Beautifulsoup主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù),相對(duì)于正則表達(dá)式來(lái)說(shuō),更簡(jiǎn)便,具體實(shí)例如下:
本文的主要目的是從賽希網(wǎng)下載歷年試題,使用Beautifulsoup之前一定要分析網(wǎng)頁(yè)結(jié)構(gòu),soup.select和soup.find_all返回的都是列表結(jié)構(gòu),想要從列表中獲取相應(yīng)的數(shù)據(jù),需要遍歷列表。此處獲取URL僅僅用一句話URL = soup.find_all(href=re.compile("pdf"))就能夠?qū)崿F(xiàn)。非常簡(jiǎn)便。
再標(biāo)記一下enumerate()函數(shù),函數(shù)傳入的是一個(gè)序列、迭代器或其他支持迭代對(duì)象,返回的是 enumerate(枚舉) 對(duì)象。
# -*- coding:utf-8 -*-
import re
import sys
import requests
from bs4 import BeautifulSoup
def Download_URL_List():
url = 'http://www.educity.cn/rk/zhenti/test/'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
headers = { 'User-Agent' : user_agent}
try:
html = requests.get(url,headers=headers)#連接
except Exception as e:
print(e)
content = html.text#獲取內(nèi)容,自動(dòng)轉(zhuǎn)碼unicode
soup = BeautifulSoup(content,"lxml")
URL = soup.find_all(href=re.compile("pdf"))
return URL
def Download_File(URL):
pattern = re.compile(r'[a-zA-z]+://[^\s]*.pdf')
for i,pdf_url in enumerate(URL):
download_url = ((pattern.search(str(pdf_url))).group(0))
req = requests.get(download_url)
string = str(i+1)+'.pdf'
with open(string,'wb') as pdf:
pdf.write(req.content)
if __name__ == '__main__':
Download_File(Download_URL_List())