最近看到requests 作者 kennethreitz 出了一個(gè)新庫(kù) requests-html,拿來(lái)練練手。該庫(kù)旨在盡可能簡(jiǎn)單直觀地解析html(例如,抓取網(wǎng)頁(yè))。
官方文檔
http://html.python-requests.org/
來(lái)抓抓網(wǎng)易11選5的彩票的數(shù)據(jù)。
首先我們打開(kāi)網(wǎng)站,打開(kāi)開(kāi)發(fā)者工具找到對(duì)應(yīng)的html。
image
session = HTMLSession()
def getData():
response = session.get('http://caipiao.163.com/award/11xuan5/')
content = response.html.find('section.main', first=True)
body = content.find('tbody')
itemDicts = dict()
for tr in body:
list = tr.find('td.start')
for td in list:
try:
period = td.attrs['data-period']
award = td.attrs['data-award']
print("序號(hào):" + td.text + " 期號(hào):" + period + " 開(kāi)獎(jiǎng)號(hào)碼:" + award)
itemDicts[period] = award
except KeyError as e:
print('except: ', e)
finally:
print('finally')
因?yàn)檫€有沒(méi)有開(kāi)出來(lái)的開(kāi)獎(jiǎng)號(hào)碼 我們就try...except了。我們發(fā)現(xiàn)網(wǎng)頁(yè)是表格的,我們需要按期號(hào)排列。
sortItemDict = sorted(itemDicts.keys(), reverse=False)
# print(sortItemDict)
for key in sortItemDict:
print("期號(hào):", key, " 開(kāi)獎(jiǎng)號(hào)碼:", itemDicts[key])
最后結(jié)果:
image
完整代碼(發(fā)現(xiàn)省了不少事,直接find元素s)
from requests_html import HTMLSession
import requests
session = HTMLSession()
def getData():
response = session.get('http://caipiao.163.com/award/11xuan5/')
content = response.html.find('section.main', first=True)
body = content.find('tbody')
itemDicts = dict()
for tr in body:
list = tr.find('td.start')
for td in list:
try:
period = td.attrs['data-period']
award = td.attrs['data-award']
print("序號(hào):" + td.text + " 期號(hào):" + period + " 開(kāi)獎(jiǎng)號(hào)碼:" + award)
itemDicts[period] = award
except KeyError as e:
print('except: ', e)
finally:
print('finally')
sortItemDict = sorted(itemDicts.keys(), reverse=False)
# print(sortItemDict)
for key in sortItemDict:
print("期號(hào):", key, " 開(kāi)獎(jiǎng)號(hào)碼:", itemDicts[key])
if __name__ == '__main__':
getData()