少妇高潮一区二区,老司机一区二区,M/M电影完整版

1.定義Item

Item 是保存爬取到的數(shù)據(jù)的容器；其使用方法和 python 字典類似。

您可以通過創(chuàng)建一個 scrapy.Item 類，并且定義類型為 scrapy.Field的類屬性來定義一個 Item。

本例中，我們將從 http://www.dmoz.org/ 中獲取標(biāo)題（title），網(wǎng)址（link），以及網(wǎng)站的描述（desc）。對此，在 item 中定義相應(yīng)的字段。

編輯 tutorial 目錄中的 items.py 文件:

import scrapy

# 定義了 title、link、desc 三個 item
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

2.編寫第一個爬蟲(Spider)

在 tutorial/spiders 目錄下新建一個 Python 文件，命名為 dmoz_spider.py，編輯該文件：

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

name：用于區(qū)別 Spider。該名字必須是唯一的，您不可以為不同的 Spider 設(shè)定相同的名字。
start_urls：包含了 Spider 在啟動時進行爬取的 url 列表。因此，第一個被獲取到的頁面將是其中之一。后續(xù)的 url 則從初始的 url 獲取到的數(shù)據(jù)中提取。
parse()：是 spider 的一個方法。被調(diào)用時，每個初始 URL 完成下載后生成的 Response 對象將會作為唯一的參數(shù)傳遞給該函數(shù)。該方法負責(zé)解析返回的數(shù)據(jù)（response data），提取數(shù)據(jù)（生成 item）以及生成需要進一步處理的 URL 的 Request 對象。

3.爬取

進入項目的根目錄，執(zhí)行下列命令啟動 spider:

scrapy crawl dmoz

crawl dmoz 啟動用于爬取 dmoz.org 的 spider，您將得到類似的輸出:

[Anaconda2] E:\tutorial>scrapy crawl dmoz
2016-07-30 21:39:42+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: tutorial)
2016-07-30 21:39:42+0800 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-30 21:39:42+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2016-07-30 21:39:44+0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2016-07-30 21:39:48+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-30 21:39:49+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-30 21:39:49+0800 [scrapy] INFO: Enabled item pipelines:
2016-07-30 21:39:49+0800 [dmoz] INFO: Spider opened
2016-07-30 21:39:49+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-30 21:39:49+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-30 21:39:49+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2016-07-30 21:39:55+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-07-30 21:39:55+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-07-30 21:39:55+0800 [dmoz] INFO: Closing spider (finished)
2016-07-30 21:39:55+0800 [dmoz] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 516,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 16392,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2016, 7, 30, 13, 39, 55, 745000),
         'log_count/DEBUG': 4,
         'log_count/INFO': 7,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2016, 7, 30, 13, 39, 49, 234000)}
2016-07-30 21:39:55+0800 [dmoz] INFO: Spider closed (finished)

Scrapy 為 Spider 的 start_urls 屬性中的每個 URL 創(chuàng)建了 scrapy.Request
對象，并將 parse 方法作為回調(diào)函數(shù)（callback）賦值給了 Request。

Request 對象經(jīng)過調(diào)度，執(zhí)行生成 scrapy.http.Response 對象并送回給 spider parse() 方法。

4.提取Item

Scrapy 使用了一種基于 XPath 和 CSS 表達式機制來提取網(wǎng)頁中的內(nèi)容。

現(xiàn)在，我們嘗試在 Shell 中使用 Selector 選擇器，進入項目的根目錄，執(zhí)行下列命令來啟動 shell:

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

注意：當(dāng)您在終端運行 Scrapy 時，請一定記得給 url 地址加上引號，否則包含參數(shù)的 url (例如 & 字符)會導(dǎo)致 Scrapy 運行失敗。

進入shell之后：

在 shell 中，你會看到這樣的輸出：

[ ... Scrapy log here ... ]

2016-07-30 22:40:00+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x044CEB50>
[s]   item       {}
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   settings   <scrapy.settings.Settings object at 0x03634810>
[s]   spider     <DmozSpider 'dmoz' at 0x48be330>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]:

當(dāng) shell 載入后，您將得到一個包含 response 數(shù)據(jù)的本地 response 變量。

輸入 response.body 將輸出 response 的包體，輸入 response.headers 可以看到 response 的包頭：

In [1]: response.headers
Out[1]:
{'Content-Language': 'en',
 'Content-Type': 'text/html;charset=UTF-8',
 'Cteonnt-Length': '52225',
 'Date': 'Sat, 30 Jul 2016 14:55:10 GMT',
 'Server': 'Apache',
 'Set-Cookie': 'JSESSIONID=A98E722CA05AE195806DB0E5E79F64E4; Path=/; HttpOnly'}

Selector 有四個基本的方法：

xpath()：傳入 xpath 表達式，返回該表達式所對應(yīng)的所有節(jié)點的 selector list 列表。
css()：傳入 CSS 表達式，返回該表達式所對應(yīng)的所有節(jié)點的 selector list 列表.
extract()：序列化該節(jié)點為 unicode 字符串并返回 list。
re()：根據(jù)傳入的正則表達式對數(shù)據(jù)進行提取，返回 unicode 字符串 list 列表。

現(xiàn)在我們在 shell 中試試：

In [3]: response.xpath('//title')
Out[3]: [<Selector xpath='//title' data=u'<title>DMOZ - Computers: Programming: La'>]

In [4]: response.xpath('//title').extract()
Out[4]: [u'<title>DMOZ - Computers: Programming: Languages: Python: Books</title>']

In [5]: response.xpath('//title/text()')
Out[5]: [<Selector xpath='//title/text()' data=u'DMOZ - Computers: Programming: Languages'>]

In [6]: response.xpath('//title/text()').extract()
Out[6]: [u'DMOZ - Computers: Programming: Languages: Python: Books']

In [7]: response.xpath('//title/text()').re('(\w+):')
Out[7]: [u'Computers', u'Programming', u'Languages', u'Python']

5.提取數(shù)據(jù)

我們可以通過這段代碼選擇該頁面中網(wǎng)站列表里所有 <li> 元素:

response.xpath('//ul/li')

輸出結(jié)果：

In [1]: response.xpath('//ul/li')
Out[1]:
[<Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/about.html">     '>,
 <Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/help/become.html"'>,
 <Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/add.html">       '>,
 <Selector xpath='//ul/li' data=u'<li> <a href="/docs/en/help/helpmain.htm'>,
 <Selector xpath='//ul/li' data=u'<li> <a href="/editors/"> Login         '>,
 <Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
 <Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
 <Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
 <Selector xpath='//ul/li' data=u'<li class="social-link" onclick="share(\''>,
 <Selector xpath='//ul/li' data=u'<li> <span><a class="social-link" target'>,
 <Selector xpath='//ul/li' data=u'<li> <span><a class="social-link" target'>]

抓取網(wǎng)站標(biāo)題欄的內(nèi)容：

response.xpath('//ul/li/a/text()').extract()

輸出內(nèi)容：

In [2]: response.xpath('//ul/li/a/text()').extract()
Out[2]:
[u'        About       ',
 u'   Become an Editor ',
 u'            Suggest a Site          ',
 u' Help             ',
 u' Login                       ']

抓取標(biāo)題欄的鏈接：

response.xpath('//ul/li/a/@href').extract()

輸出內(nèi)容：

In [3]: response.xpath('//ul/li/a/@href').extract()
Out[3]:
[u'/docs/en/about.html',
 u'/docs/en/help/become.html',
 u'/docs/en/add.html',
 u'/docs/en/help/helpmain.html',
 u'/editors/']

可以看到，每個 .xpath() 調(diào)用返回 selector 組成的 list，因此我們可以拼接更多的 .xpath() 然后用遍歷的方法來進一步獲取某個節(jié)點。我們將在下邊使用這樣的特性:

修改 spiders/dmoz_spider.py 文件：

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
       for sel in response.xpath('//ul/li'):
            title = sel.xpath('a/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('text()').extract()
            print title, link, desc

再次啟動該 Scrapy 項目：

scrapy crawl dmoz

您將看到爬取到的網(wǎng)站信息被成功輸出：

2016-08-01 10:06:48+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[u'        About       '] [u'/docs/en/about.html'] [u' ', u' ']
[u'   Become an Editor '] [u'/docs/en/help/become.html'] [u' ', u' ']
[u'            Suggest a Site          '] [u'/docs/en/add.html'] [u' ', u' ']
[u' Help             '] [u'/docs/en/help/helpmain.html'] [u' ', u' ']
[u' Login                       '] [u'/editors/'] [u' ', u' ']
[] [] [u'                                        ', u' Share via Facebook ']
[] [] [u' ', u'  Share via Twitter  ']
[] [] [u'                                        ', u' Share via LinkedIn ']
[] [] [u'                                        ', u' Share via e-Mail   ']
[] [] [u' ', u' ']
[] [] [u' ', u'  ']
2016-08-01 10:06:48+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
[u'        About       '] [u'/docs/en/about.html'] [u' ', u' ']
[u'   Become an Editor '] [u'/docs/en/help/become.html'] [u' ', u' ']
[u'            Suggest a Site          '] [u'/docs/en/add.html'] [u' ', u' ']
[u' Help             '] [u'/docs/en/help/helpmain.html'] [u' ', u' ']
[u' Login                       '] [u'/editors/'] [u' ', u' ']
[] [] [u'                                        ', u' Share via Facebook ']
[] [] [u' ', u'  Share via Twitter  ']
[] [] [u'                                        ', u' Share via LinkedIn ']
[] [] [u'                                        ', u' Share via e-Mail   ']
[] [] [u' ', u' ']
[] [] [u' ', u'  ']
2016-08-01 10:06:48+0800 [dmoz] INFO: Closing spider (finished)

6. 使用item

Item 對象是自定義的 python 字典。您可以使用標(biāo)準(zhǔn)的字典語法來獲取到其每個字段的值。(字段即是我們之前用Field賦值的屬性):

>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']'Example title'

再次修改 spiders/dmoz_spider.py 文件：

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

現(xiàn)在對 dmoz.org 進行爬取將會產(chǎn)生 DmozItem 對象。

7.保存爬取到的數(shù)據(jù)

最簡單存儲爬取的數(shù)據(jù)的方式是使用 Feed exports：

scrapy crawl dmoz -o items.json

該命令將采用 JSON 格式對爬取的數(shù)據(jù)進行序列化，生成 items.json 文件。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

編寫第一個爬蟲

編寫第一個爬蟲

1.定義Item

2.編寫第一個爬蟲(Spider)

3.爬取

4.提取Item

5.提取數(shù)據(jù)

6. 使用item

7.保存爬取到的數(shù)據(jù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

編寫第一個爬蟲

1.定義Item

2.編寫第一個爬蟲(Spider)

3.爬取

4.提取Item

5.提取數(shù)據(jù)

6. 使用item

7.保存爬取到的數(shù)據(jù)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av