項(xiàng)目地址:https://github.com/Kulbear/All-IT-eBooks-Spider
喜歡歡迎Star!
簡介
最近在公司實(shí)習(xí),項(xiàng)目多數(shù)和爬蟲有關(guān),越發(fā)的感覺爬蟲十分的好用,閑來無事便有了這個(gè)小程序。
首先感謝 崔慶才的Python爬蟲學(xué)習(xí)系列教程,當(dāng)年我的第一個(gè)爬蟲(其實(shí)可能比本文這個(gè)還要復(fù)雜一點(diǎn))是參考他的教程完成的。
代碼相當(dāng)簡單,大牛輕噴。只是給想用Python學(xué)習(xí)爬蟲的童鞋們一點(diǎn)矮地兒Idea而已。
這幾日和朋友搜索東西的時(shí)候無意間發(fā)現(xiàn)了一個(gè)國外的存有大量PDF格式電子書的網(wǎng)站。其實(shí)我相當(dāng)奇怪在國外版權(quán)管控如此嚴(yán)的環(huán)境下這個(gè)網(wǎng)站是如何拿到這么多電子書的,而且全是正版樣式的PDF,目錄索引一應(yīng)俱全,沒有任何影印和掃描的版本。
之前多數(shù)的Python開發(fā)都是在UNIX環(huán)境下完成的,這次這個(gè)小爬蟲也是為了測試一下剛在Windows上部署好的環(huán)境。
閑話少說,今天要做的事情就是爬取All IT eBooks這個(gè)網(wǎng)站上面PDF的下載鏈接了。
準(zhǔn)備工作
-
安裝Python 3.5.X
因?yàn)閃indows上配置Python稍微麻煩,我個(gè)人使用的是Anaconda提供的一站式安裝包,針對(duì)Windows平臺(tái)Anaconda提供了很傻瓜的一鍵安裝包 或者選擇 Anaconda 下載頁
沒了
這個(gè)項(xiàng)目(姑且叫項(xiàng)目)的結(jié)構(gòu)十分簡單,要爬取的網(wǎng)站結(jié)構(gòu)設(shè)計(jì)也十分清晰,所以我們不需要用任何第三方庫!
分析網(wǎng)頁代碼并提取
其實(shí)這個(gè)簡單的爬蟲需要做的事情僅僅是爬取目標(biāo)網(wǎng)頁的源代碼(一般是HTML),提取自己需要的有效信息,再做進(jìn)一步使用。
打開這個(gè)網(wǎng)頁,可以看到這個(gè)網(wǎng)站的設(shè)計(jì)十分簡潔和整齊,估計(jì)源代碼應(yīng)該也是結(jié)構(gòu)簡潔的

將網(wǎng)頁往下拖可以看到在頁底有Pagination的按鈕,提供翻頁,翻頁后的鏈接為 http://www.allitebooks.com/page/數(shù)字頁碼/ 這個(gè)格式。
點(diǎn)擊每本書或者標(biāo)題以后,會(huì)進(jìn)入到每本書的詳細(xì)資料頁面,并且有一個(gè)十分明顯的Download PDF的按鈕(這里我就不截圖了)。
比如某本書詳細(xì)頁面的鏈接(不在上圖中,找了一本鏈接比較短的書):
http://www.allitebooks.com/big-data/
首先我們要拿到每個(gè)書detail頁面的鏈接,然后通過這個(gè)鏈接進(jìn)入到具體的頁面,再找尋下載的鏈接。
多檢查幾個(gè)鏈接我們可以發(fā)現(xiàn)首頁上的每本書詳細(xì)頁面的鏈接很容易找到,每本書的內(nèi)容都是一個(gè)article的node里所包含的,例如:
<article id="post-23083" class="post-23083 post type-post status-publish format-standard has-post-thumbnail hentry category-networking-cloud-computing post-list clearfix">
<div class="entry-thumbnail hover-thumb">
<a rel="bookmark">
 </a>
</div>
<!-- END .entry-thumbnail -->
<div class="entry-body">
<header class="entry-header">
<h2 class="entry-title"><a rel="bookmark">SSL VPN</a></h2>
<!-- END .entry-title -->
<div class="entry-meta">
<span class="author vcard">
<h5 class="entry-author">By: <a rel="tag">J. Steinberg</a>, <a rel="tag">Joseph Steinberg</a>, <a rel="tag">T. Speed</a>, <a rel="tag">Tim Speed</a></h5>
</span>
</div>
<!-- END .entry-meta -->
</header>
<!-- END .entry-header -->
<div class="entry-summary">
<p>This book is a business and technical overview of SSL VPN technology in a highly readable style. It provides a vendor-neutral introduction to SSL VPN technology for system architects, analysts and managers engaged in evaluating and planning
an SSL VPN implementation. This book aimed at IT network professionals…</p>
</div>
<!-- END .entry-summary -->
</div>
<!-- END .entry-body -->
</article>
很容易就可以找到這本書的鏈接在第一層div的第一個(gè)子node上
<a rel="bookmark">
仔細(xì)觀察整個(gè)網(wǎng)頁源碼后發(fā)現(xiàn),唯獨(dú)這個(gè)帶有書detail頁鏈接的tag里有這條
rel="bookmark"
那么現(xiàn)在就很簡單了,我們有以下幾個(gè)選擇來提取這個(gè)鏈接:
- BeautifulSoup
- 正則表達(dá)式(Regular Expression)
- 其他...
BeautifulSoup這里不過多做敘述,簡單來說,這個(gè)庫可以幫你很好的分解HTML的DOM結(jié)構(gòu),而正則表達(dá)式則是Ultimate Solution,可以匹配任何符合條件的字符串,這里我們選用正則表達(dá)式(我也只學(xué)過皮毛,不過解決這次的問題只需要5分鐘入門級(jí)就可以),具體正則教程可以參見網(wǎng)上的資源,比如 這里。
先推薦一個(gè)在線檢測正則的網(wǎng)站 Regex101
'href="(.*)" rel="bookmark">'

匹配剛才那個(gè)網(wǎng)頁鏈接所需要的正則表達(dá)式如上,現(xiàn)在我們來開始Python代碼的部分:
import urllib.request
import re
BASE_URL = 'http://www.allitebooks.com'
BOOK_LINK_PATTERN = 'href="(.*)" rel="bookmark">'
req = urllib.request.Request(BASE_URL)
html = urllib.request.urlopen(req)
doc = html.read().decode('utf8')
# print(doc)
url_list = list(set(re.findall(BOOK_LINK_PATTERN, doc)))
以上代碼能夠?qū)⒕W(wǎng)頁源碼解碼并返回我們需要的url_list, 其中re.findall(...)這一部分的作用是,找到doc中所有符合BOOK_LINK_PATTERN的部分并return一個(gè)list出來,轉(zhuǎn)換為set只是為了去重,又在之后重新轉(zhuǎn)回為了list為了方便遍歷。
僅僅抓取第一頁顯然不夠,所以我們加入對(duì)頁碼的遍歷,如下:
....
i = 1
while True:
req = urllib.request.Request(BASE_URL)
html = urllib.request.urlopen(req)
doc = html.read().decode('utf8')
# print(doc)
url_list = list(set(re.findall(BOOK_LINK_PATTERN, doc)))
# Do something here
i += 1
這里并沒有對(duì)可能出現(xiàn)的Error做處理,我們稍后補(bǔ)上。
至此,我們的程序已經(jīng)可以抓取這個(gè)網(wǎng)站所有頁面里的書detail頁面的鏈接了(理論上)
具體到每個(gè)頁面以后的工作變得十分簡單,通過訪問每本書的detail頁面,檢查源代碼,可以很輕松的提取出頁面里Download PDF按鈕對(duì)應(yīng)的下載鏈接。
<span class="download-links">
<a href="http://file.allitebooks.com/20160908/Expert Android Studio.pdf" target="_blank"><i class="fa fa-download" aria-hidden="true"></i> Download PDF <span class="download-size">(48.5 MB)</span></a>
</span>
其中,
<a href="http://file.allitebooks.com/20160908/Expert Android Studio.pdf" target="_blank"><i class="fa fa-download" aria-hidden="true"></i> Download PDF <span class="download-size">(48.5 MB)</span></a>
就是我們需要的部分了。
故技重施,使用如下的正則表達(dá)式匹配這一段HTML代碼:
<a href="(http:\/\/file.*)" target="_blank">
這段代碼就不分解放出了,自己動(dòng)手吧(源碼和Github鏈接在最后)。
(其實(shí)是因?yàn)槲沂菍懲炅苏麄€(gè)代碼以后才返回來寫這個(gè)文章,現(xiàn)在懶得拆了......)
小結(jié)
當(dāng)然,這個(gè)簡單的程序只是一個(gè)最最基本的小爬蟲。離枝繁葉茂真正功能的爬蟲還差很多。多數(shù)網(wǎng)站都有多少不等的反爬蟲機(jī)制,比如單位時(shí)間內(nèi)單一IP的方位次數(shù)限制等等。通常網(wǎng)站會(huì)有一個(gè)robots.txt文件,規(guī)定了針對(duì)爬蟲的要求,比如能不能使用爬蟲。這個(gè)文件一般在www.hostname.com/robots.txt這個(gè)格式的網(wǎng)址可以直接查看,比如我們這次爬取的網(wǎng)站
http://www.allitebooks.com/robots.txt
應(yīng)對(duì)不同網(wǎng)站的反爬蟲機(jī)制,我們可以選擇增加Header,隨機(jī)Header,隨機(jī)IP等很多方法來繞開,當(dāng)你大量或者高頻爬取一些網(wǎng)站的同時(shí),如果可以,別忘了給網(wǎng)站擁有者做一些貢獻(xiàn)(比如之前爬取Wiki的時(shí)候,捐贈(zèng)了5刀...),以緩解網(wǎng)站作者維持服務(wù)器的壓力。
源碼
Github: https://github.com/JiYangE/All-IT-eBooks-Spider
請(qǐng)盡情的鞭笞Star我吧!
趁著午休的一小時(shí)趕工出來的代碼,也沒備注重構(gòu)修改過,結(jié)構(gòu)略亂,單一指責(zé)根本沒有,我不管,能打仗的兵就是好兵,各位湊活一下看,邏輯非常簡單
文件1 crawler.py
# -*- coding: utf-8 -*-
import re
import time
import urllib.request
import conf as cf
BASE_URL = 'http://www.allitebooks.com'
class MyCrawler:
def __init__(self, base_url=cf.BASE_URL, header=cf.FAKE_HEADER, start_page=1):
self.base_url = base_url
self.start_page = start_page
self.headers = header
# 鏈接代理
def build_proxy(self):
proxy = cf.PROXY
proxy_support = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
def fetch_book_name_list(self):
while True:
try:
req = urllib.request.Request(
self.base_url + '/page/{}'.format(self.start_page), headers=self.headers)
html = urllib.request.urlopen(req)
doc = html.read().decode('utf8')
alist = list(set(re.findall(cf.BOOK_LINK_PATTERN, doc)))
print('Now working on page {}\n'.format(self.start_page))
time.sleep(20)
self.start_page += 1
self.fetch_download_link(alist)
except urllib.error.HTTPError as err:
print(err.msg)
break
def fetch_download_link(self, alist):
f = open('result.txt', 'a')
for item in alist:
req = urllib.request.Request(item)
html = urllib.request.urlopen(req)
doc = html.read().decode('utf8')
url = re.findall(cf.DOWNLOAD_LINK_PATTERN, doc)[0]
print('Storing {}'.format(url))
f.write(url + '\n')
time.sleep(7)
f.close()
def run(self):
self.fetch_book_name_list()
if __name__ == '__main__':
mc = MyCrawler()
# mc.build_proxy()
mc.run()
文件2 conf.py
# -*- coding: utf-8 -*-
import random
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]
PROXY = {'http': "http://127.0.0.1:9743/"}
BOOK_LINK_PATTERN = 'href="(.*)" rel="bookmark">'
DOWNLOAD_LINK_PATTERN = '<a href="(http:\/\/file.*)" target="_blank">'
BASE_URL = 'http://www.allitebooks.com'
FAKE_HEADER = {
'User-Agent': random.choice(USER_AGENTS),
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.allitebooks.com/',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
}
運(yùn)行結(jié)果文件 result.txt 內(nèi)容:
http://file.allitebooks.com/20160708/Functional Python Programming.pdf
http://file.allitebooks.com/20160709/Mastering JavaScript.pdf
http://file.allitebooks.com/20160708/ReSharper Essentials.pdf
http://file.allitebooks.com/20160714/Mastering Python.pdf
http://file.allitebooks.com/20160723/PHP in Action.pdf
http://file.allitebooks.com/20160709/Learning Google Apps Script.pdf
http://file.allitebooks.com/20160709/Mastering Yii.pdf
......
再廢話兩句
Python的功能日益強(qiáng)大起來,有很多現(xiàn)成的爬蟲框架可以學(xué)習(xí),在熟練網(wǎng)絡(luò)協(xié)議和抓取等基礎(chǔ)的網(wǎng)絡(luò)知識(shí)以后,也可以試試學(xué)習(xí)一些較為完善的框架,比如Scrapy,詳情可以看崔慶才的總結(jié)