Crawling web pages and parsing content in Python under Windows
The tutorial is divided into three parts, one to install Python 2.7, the second to install the required libraries, and the third is the main content of the script.
Required environment:
Operating System | Programming Language |
Windows 8,10 or above | Python2.7 |
Install the Python:
1. install python2.7
Download the Python 2.7 installation package from the python website.
URL:
https://www.python.org/downloads/release/python-2715/
direct download url:
https://www.python.org/ftp/python/2.7.15/python-2.7.15.amd64.msi
2. Set the environment path after successful installation
path:
C:\Python27\
C:\Python27\Scripts\
3. install pip
url:
https://bootstrap.pypa.io/get-pip.py
python get-pip.py
4. install python lib
open windows cmd, run cmd:
pip install beautifulsoup4
pip install adodbapi
pip install selenium
pip install requests
5. If the web page include ajax or other asynchronous JS, you need to install chromedriver:
chromedriver download url:
http://chromedriver.chromium.org/downloads
download and copy chromedriver.exe to C:\Python27\
Python scripting:
1. import lib
import sys
import shutil
import HTMLParser
import urllib
from bs4 import BeautifulSoup
import httplib, mimetypes, mimetools, urllib2, cookielib, urlparse
import json
from xml.etree.ElementTree import ElementTree
from selenium import webdriver
2. Get web content
driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
html = driver.page_source
driver.close()
3. Parsing web content
soup = BeautifulSoup(html)
item = soup.find('div',id = 'list_main')
text = item.get_text()
For the usage of beautifulsoup4, you can refer to:
It is a powerful DOM parsing library and should have the functionality you want.