1.定義Item
Item 是保存爬取到的數(shù)據(jù)的容器;其使用方法和 python 字典類似。
您可以通過創(chuàng)建一個 scrapy.Item 類, 并且定義類型為 scrapy.Field的類屬性來定義一個 Item。
本例中,我們將從 http://www.dmoz.org/ 中獲取標(biāo)題(title),網(wǎng)址(link),以及網(wǎng)站的描述(desc)。 對此,在 item 中定義相應(yīng)的字段。
編輯 tutorial 目錄中的 items.py 文件:
import scrapy
# 定義了 title、link、desc 三個 item
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
2.編寫第一個爬蟲(Spider)
在 tutorial/spiders 目錄下新建一個 Python 文件,命名為 dmoz_spider.py,編輯該文件:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
name: 用于區(qū)別 Spider。 該名字必須是唯一的,您不可以為不同的 Spider 設(shè)定相同的名字。start_urls:包含了 Spider 在啟動時進行爬取的 url 列表。 因此,第一個被獲取到的頁面將是其中之一。 后續(xù)的 url 則從初始的 url 獲取到的數(shù)據(jù)中提取。parse():是 spider 的一個方法。 被調(diào)用時,每個初始 URL 完成下載后生成的 Response 對象將會作為唯一的參數(shù)傳遞給該函數(shù)。該方法負責(zé)解析返回的數(shù)據(jù)(response data),提取數(shù)據(jù)(生成 item)以及生成需要進一步處理的 URL 的 Request 對象。
3.爬取
進入項目的根目錄,執(zhí)行下列命令啟動 spider:
scrapy crawl dmoz
crawl dmoz 啟動用于爬取 dmoz.org 的 spider,您將得到類似的輸出:
[Anaconda2] E:\tutorial>scrapy crawl dmoz
2016-07-30 21:39:42+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: tutorial)
2016-07-30 21:39:42+0800 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-30 21:39:42+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2016-07-30 21:39:44+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2016-07-30 21:39:48+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-30 21:39:49+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-30 21:39:49+0800 [scrapy] INFO: Enabled item pipelines:
2016-07-30 21:39:49+0800 [dmoz] INFO: Spider opened
2016-07-30 21:39:49+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-30 21:39:49+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-30 21:39:49+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2016-07-30 21:39:55+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-07-30 21:39:55+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-07-30 21:39:55+0800 [dmoz] INFO: Closing spider (finished)
2016-07-30 21:39:55+0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 516,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16392,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 30, 13, 39, 55, 745000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 30, 13, 39, 49, 234000)}
2016-07-30 21:39:55+0800 [dmoz] INFO: Spider closed (finished)
Scrapy 為 Spider 的 start_urls 屬性中的每個 URL 創(chuàng)建了 scrapy.Request
對象,并將 parse 方法作為回調(diào)函數(shù)(callback)賦值給了 Request。
Request 對象經(jīng)過調(diào)度,執(zhí)行生成 scrapy.http.Response 對象并送回給 spider parse() 方法。
4.提取Item
Scrapy 使用了一種基于 XPath 和 CSS 表達式機制來提取網(wǎng)頁中的內(nèi)容。
現(xiàn)在,我們嘗試在 Shell 中使用 Selector 選擇器,進入項目的根目錄,執(zhí)行下列命令來啟動 shell:
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
注意:當(dāng)您在終端運行 Scrapy 時,請一定記得給 url 地址加上引號,否則包含參數(shù)的 url (例如 & 字符)會導(dǎo)致 Scrapy 運行失敗。
進入shell之后:

在 shell 中,你會看到這樣的輸出:
[ ... Scrapy log here ... ]
2016-07-30 22:40:00+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x044CEB50>
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] settings <scrapy.settings.Settings object at 0x03634810>
[s] spider <DmozSpider 'dmoz' at 0x48be330>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]:
當(dāng) shell 載入后,您將得到一個包含 response 數(shù)據(jù)的本地 response 變量。
輸入 response.body 將輸出 response 的包體,輸入 response.headers 可以看到 response 的包頭:
In [1]: response.headers
Out[1]:
{'Content-Language': 'en',
'Content-Type': 'text/html;charset=UTF-8',
'Cteonnt-Length': '52225',
'Date': 'Sat, 30 Jul 2016 14:55:10 GMT',
'Server': 'Apache',
'Set-Cookie': 'JSESSIONID=A98E722CA05AE195806DB0E5E79F64E4; Path=/; HttpOnly'}
Selector 有四個基本的方法:
-
xpath():傳入 xpath 表達式,返回該表達式所對應(yīng)的所有節(jié)點的 selector list 列表 。 -
css():傳入 CSS 表達式,返回該表達式所對應(yīng)的所有節(jié)點的 selector list 列表. -
extract():序列化該節(jié)點為 unicode 字符串并返回 list。 -
re():根據(jù)傳入的正則表達式對數(shù)據(jù)進行提取,返回 unicode 字符串 list 列表。
現(xiàn)在我們在 shell 中試試:
In [3]: response.xpath('//title')
Out[3]: [<Selector xpath='//title' data=u'<title>DMOZ - Computers: Programming: La'>]
In [4]: response.xpath('//title').extract()
Out[4]: [u'<title>DMOZ - Computers: Programming: Languages: Python: Books</title>']
In [5]: response.xpath('//title/text()')
Out[5]: [<Selector xpath='//title/text()' data=u'DMOZ - Computers: Programming: Languages'>]
In [6]: response.xpath('//title/text()').extract()
Out[6]: [u'DMOZ - Computers: Programming: Languages: Python: Books']
In [7]: response.xpath('//title/text()').re('(\w+):')
Out[7]: [u'Computers', u'Programming', u'Languages', u'Python']
5.提取數(shù)據(jù)
我們可以通過這段代碼選擇該頁面中網(wǎng)站列表里所有 <li> 元素:
response.xpath('//ul/li')
輸出結(jié)果:
In [1]: response.xpath('//ul/li')
Out[1]:
[<Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/about.html"> '>,
<Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/help/become.html"'>,
<Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/add.html"> '>,
<Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/help/helpmain.htm'>,
<Selector xpath='//ul/li' data=u'<li> <a href="/editors/"> Login '>,
<Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
<Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
<Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
<Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
<Selector xpath='//ul/li' data=u'<li> <span><a class="social-link" target'>,
<Selector xpath='//ul/li' data=u'<li> <span><a class="social-link" target'>]
抓取網(wǎng)站標(biāo)題欄的內(nèi)容:
response.xpath('//ul/li/a/text()').extract()
輸出內(nèi)容:
In [2]: response.xpath('//ul/li/a/text()').extract()
Out[2]:
[u' About ',
u' Become an Editor ',
u' Suggest a Site ',
u' Help ',
u' Login ']
抓取標(biāo)題欄的鏈接:
response.xpath('//ul/li/a/@href').extract()
輸出內(nèi)容:
In [3]: response.xpath('//ul/li/a/@href').extract()
Out[3]:
[u'/docs/en/about.html',
u'/docs/en/help/become.html',
u'/docs/en/add.html',
u'/docs/en/help/helpmain.html',
u'/editors/']
可以看到,每個 .xpath() 調(diào)用返回 selector 組成的 list,因此我們可以拼接更多的 .xpath() 然后用遍歷的方法來進一步獲取某個節(jié)點。我們將在下邊使用這樣的特性:
修改 spiders/dmoz_spider.py 文件:
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
再次啟動該 Scrapy 項目:
scrapy crawl dmoz
您將看到爬取到的網(wǎng)站信息被成功輸出:
2016-08-01 10:06:48+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[u' About '] [u'/docs/en/about.html'] [u' ', u' ']
[u' Become an Editor '] [u'/docs/en/help/become.html'] [u' ', u' ']
[u' Suggest a Site '] [u'/docs/en/add.html'] [u' ', u' ']
[u' Help '] [u'/docs/en/help/helpmain.html'] [u' ', u' ']
[u' Login '] [u'/editors/'] [u' ', u' ']
[] [] [u' ', u' Share via Facebook ']
[] [] [u' ', u' Share via Twitter ']
[] [] [u' ', u' Share via LinkedIn ']
[] [] [u' ', u' Share via e-Mail ']
[] [] [u' ', u' ']
[] [] [u' ', u' ']
2016-08-01 10:06:48+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
[u' About '] [u'/docs/en/about.html'] [u' ', u' ']
[u' Become an Editor '] [u'/docs/en/help/become.html'] [u' ', u' ']
[u' Suggest a Site '] [u'/docs/en/add.html'] [u' ', u' ']
[u' Help '] [u'/docs/en/help/helpmain.html'] [u' ', u' ']
[u' Login '] [u'/editors/'] [u' ', u' ']
[] [] [u' ', u' Share via Facebook ']
[] [] [u' ', u' Share via Twitter ']
[] [] [u' ', u' Share via LinkedIn ']
[] [] [u' ', u' Share via e-Mail ']
[] [] [u' ', u' ']
[] [] [u' ', u' ']
2016-08-01 10:06:48+0800 [dmoz] INFO: Closing spider (finished)
6. 使用item
Item 對象是自定義的 python 字典。您可以使用標(biāo)準(zhǔn)的字典語法來獲取到其每個字段的值。(字段即是我們之前用Field賦值的屬性):
>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']'Example title'
再次修改 spiders/dmoz_spider.py 文件:
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
現(xiàn)在對 dmoz.org 進行爬取將會產(chǎn)生 DmozItem 對象。
7.保存爬取到的數(shù)據(jù)
最簡單存儲爬取的數(shù)據(jù)的方式是使用 Feed exports:
scrapy crawl dmoz -o items.json
該命令將采用 JSON 格式對爬取的數(shù)據(jù)進行序列化,生成 items.json 文件。
