Crawling web pages and parsing content in Python under Windows

by admin on 2018-10-06 13:00:32

The tutorial is divided into three parts, one to install Python 2.7, the second to install the required libraries, and the third is the main content of the script.

Required environment:

Operating System
Programming Language

Windows 8,10 or above


Install the Python:

1. install python2.7

Download the Python 2.7 installation package from the python website.


direct download url:
download and then install python

2.  Set the environment path after successful installation



3.  install pip

download and open Windows CMD
run cmd:

4.  install python lib

open windows cmd, run cmd:

pip install beautifulsoup4
pip install adodbapi
pip install selenium
pip install requests

5.   If the web page include ajax or other asynchronous JS, you need to install chromedriver:

chromedriver download url:

download and copy chromedriver.exe to C:\Python27\

Python scripting:

1.  import lib

import sys 
import shutil
import HTMLParser
import urllib 
from bs4 import BeautifulSoup
import httplib, mimetypes, mimetools, urllib2, cookielib, urlparse
import json
from xml.etree.ElementTree import ElementTree
from selenium import webdriver

2. Get web content

driver = webdriver.Chrome()
html = driver.page_source

3. Parsing web content

soup = BeautifulSoup(html)
item =  soup.find('div',id = 'list_main')
text = item.get_text()

For the usage of beautifulsoup4, you can refer to:

It is a powerful DOM parsing library and should have the functionality you want.