IT TUTORIAL PROGRAMMING 

Crawling web pages and parsing content in Python under Windows

by admin on 2018-10-06 13:00:32

The tutorial is divided into three parts, one to install Python 2.7, the second to install the required libraries, and the third is the main content of the script.

Required environment:

Operating System
Programming Language

Windows 8,10 or above

Python2.7

Install the Python:

1. install python2.7

Download the Python 2.7 installation package from the python website.

URL:

https://www.python.org/downloads/release/python-2715/

direct download url:

https://www.python.org/ftp/python/2.7.15/python-2.7.15.amd64.msi
download and then install python
     


2.  Set the environment path after successful installation

path:

C:\Python27\
C:\Python27\Scripts\       


3.  install pip

url:

https://bootstrap.pypa.io/get-pip.py
download get-pip.py and open Windows CMD
                
run cmd:
python get-pip.py


4.  install python lib

open windows cmd, run cmd:

pip install beautifulsoup4
pip install adodbapi
pip install selenium
pip install requests


5.   If the web page include ajax or other asynchronous JS, you need to install chromedriver:

chromedriver download url:

http://chromedriver.chromium.org/downloads

download and copy chromedriver.exe to C:\Python27\


Python scripting:

1.  import lib

import sys 
import shutil
import HTMLParser
import urllib 
from bs4 import BeautifulSoup
import httplib, mimetypes, mimetools, urllib2, cookielib, urlparse
import json
from xml.etree.ElementTree import ElementTree
from selenium import webdriver

2. Get web content

driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
html = driver.page_source
driver.close()

3. Parsing web content

soup = BeautifulSoup(html)
item =  soup.find('div',id = 'list_main')
text = item.get_text()

For the usage of beautifulsoup4, you can refer to:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

It is a powerful DOM parsing library and should have the functionality you want.

-End-

Categories