tonglin0325的个人主页

Python爬虫——使用request请求网页

1.安装request

1
2
pip install requests

2.请求网页

下载地址:http://phantomjs.org/download.html

1
2
3
4
>>> import requests
>>> r = requests.get('https://wwww.baidu.com')
>>> print(r.text)

3.请求失败重试

如果请求失败的话,可以使用urllib3的Retry来进行重试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
from requests.adapters import HTTPAdapter
from urllib3.util import Retry

try:
retry = Retry(
total=5,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount('https://', adapter)
r = session.get('https://httpbin.org/status/502', timeout=180)
print(r.status_code)
except Exception as e:
print(e)

参考:https://oxylabs.io/blog/python-requests-retry

4.使用lxml解析网页

可以安装xpath helper插件来测试xpath,https://chromewebstore.google.com/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl,使用如下

1
2
3
4
5
6
from lxml import etree

tree = etree.HTML(r.content)
text = tree.xpath('/html/body/div/header/div/nav/ul/li')
print(text)