伊人网久久爱,麻豆福利在线视频

之前有介紹 scrapy 的相關(guān)知識(shí)，但是沒有介紹相關(guān)實(shí)例，在這里做個(gè)小例，供大家參考學(xué)習(xí)。

注：后續(xù)不強(qiáng)調(diào)python 版本，默認(rèn)即為python3.x。

爬取目標(biāo)這里簡(jiǎn)單找一個(gè)圖片網(wǎng)站，獲取圖片的先關(guān)信息。

該網(wǎng)站網(wǎng)址： http://www.58pic.com/c/

創(chuàng)建項(xiàng)目終端命令行執(zhí)行以下命令?

scrapy startproject AdilCrawler

命令執(zhí)行后，會(huì)生成如下結(jié)構(gòu)的項(xiàng)目。

執(zhí)行結(jié)果如下

如上圖提示，cd 到項(xiàng)目下，可以執(zhí)行 scrapy genspider example example.com 命令，創(chuàng)建名為example,域名為example.com 的爬蟲文件。

編寫items.py

這里先簡(jiǎn)單抓取圖片的作者名稱、圖片主題等信息。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items

# See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AdilcrawlerItem(scrapy.Item):

# define the fields for your item here like:# name = scrapy.Field()? ?

author = scrapy.Field() # 作者? ?

theme = scrapy.Field() # 主題

編寫spider文件

進(jìn)入AdilCrawler目錄，使用命令創(chuàng)建一個(gè)基礎(chǔ)爬蟲類：

scrapy genspider? thousandPic www.58pic.com#? thousandPic為爬蟲名，www.58pic.com為爬蟲作用范圍

執(zhí)行命令后會(huì)在spiders文件夾中創(chuàng)建一個(gè)thousandPic.py的文件，現(xiàn)在開始對(duì)其編寫：

# -*- coding: utf-8 -*-import scrapy# 爬蟲小試class ThousandpicSpider(scrapy.Spider):

? ? name ='thousandPic'? ? allowed_domains = ['www.58pic.com']

? ? start_urls = ['http://www.58pic.com/c/']

? ? def parse(self, response):

? ? ? ? '''? ? ? ?

? ? ? ? ? 查看頁面元素

? ? ? ? ? /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

? ? ? ? ? 因?yàn)轫撁嬷?有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]? 其中i? 為變量作為區(qū)分的，所以為了獲取當(dāng)前頁面所有的圖

? ? ? ? ? 這里不寫 i 程序會(huì)遍歷該路徑下的所有圖片。

? ? ? ? '''

????????# author 作者

????????# theme? 主題

? ? ? ? author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

? ? ? ? theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

? ? ? ? # 使用爬蟲的log 方法在控制臺(tái)輸出爬取的內(nèi)容。? ? ? ?

????????self.log(author)

? ? ? ? self.log(theme)

? ? ? ? # 使用遍歷的方式打印出爬取的內(nèi)容，因?yàn)楫?dāng)前一頁有20張圖片。foriinrange(1, 21):

? ? ? ? print(i,' **** ',theme[i - 1],': ',author[i - 1] )

?執(zhí)行命令,查看打印結(jié)果

scrapy crawl thousandPic

結(jié)果如下，其中DEBUG為 log 輸出。

代碼優(yōu)化

引入 itemAdilcrawlerItem

# -*- coding: utf-8 -*-import scrapy# 這里使用 import 或是下面from 的方式都行，關(guān)鍵要看當(dāng)前項(xiàng)目在pycharm的打開方式，是否是作為一個(gè)項(xiàng)目打開的，建議使用這一種方式。import AdilCrawler.items as items# 使用from 這種方式，AdilCrawler 需要作為一個(gè)項(xiàng)目打開。# from AdilCrawler.items import AdilcrawlerItemclass ThousandpicSpider(scrapy.Spider):

? ? name ='thousandPic'? ? allowed_domains = ['www.58pic.com']

? ? start_urls = ['http://www.58pic.com/c/']

? ? def parse(self, response):

? ? ? ? '''? ? ? ?

? ? ? ? ? 查看頁面元素

? ? ? ? ? /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

? ? ? ? ? 因?yàn)轫撁嬷?有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]? 其中i? 為變量作為區(qū)分的，所以為了獲取當(dāng)前頁面所有的圖

? ? ? ? ? 這里不寫 i 程序會(huì)遍歷該路徑下的所有圖片。

? ? ? ? '''? ? ? ?

????????item = items.AdilcrawlerItem()

? ? ? ? # author 作者# theme? 主題? ? ? ? author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

? ? ? ? theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

? ? ? ? item['author'] = author

? ? ? ? item['theme']? = theme

? ? ? ? return item

再次運(yùn)營(yíng)爬蟲，執(zhí)行結(jié)果如下

保存結(jié)果到文件

執(zhí)行命令如下

scrapy crawl thousandPic -o items.json

會(huì)生成如圖的文件

再次優(yōu)化，使用?ItemLoader 功能類

使用itemLoader ，以取代雜亂的extract()和xpath()。

代碼如下：?

# -*- coding: utf-8 -*-

import scrapyfromAdilCrawler.items

import AdilcrawlerItem

# 導(dǎo)入 ItemLoader 功能類fromscrapy.loaderimport ItemLoader

# optimize? 優(yōu)化

# 爬蟲項(xiàng)目?jī)?yōu)化

class ThousandpicoptimizeSpider(scrapy.Spider):

? ? name ='thousandPicOptimize'? ? allowed_domains = ['www.58pic.com']

? ? start_urls = ['http://www.58pic.com/c/']

? ? def parse(self, response):

? ? ? ? '''

? ? ? ? ? 查看頁面元素

? ? ? ? ? /html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()

? ? ? ? ? 因?yàn)轫撁嬷?有多張圖，而圖是以 /html/body/div[4]/div[3]/div[i]? 其中i? 為變量作為區(qū)分的，所以為了獲取當(dāng)前頁面所有的圖

? ? ? ? ? 這里不寫 i 程序會(huì)遍歷該路徑下的所有圖片。

? ? ? ? '''

????????# 使用功能類 itemLoader,以取代看起來雜亂的 extract() 和 xpath() ，優(yōu)化如下

? ? ? ? i = ItemLoader(item = AdilcrawlerItem(),response = response )

? ? ? ? # author 作者# theme? 主題? ? ? ? i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

? ? ? ? i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

? ? ? ? return i.load_item()

編寫pipelines文件

?默認(rèn)pipelines.py 文件

# -*- coding: utf-8 -*-# Define your item pipelines here#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class Adilcrawler1Pipeline(object):

? ? def process_item(self, item, spider):

? ? ? ? return item

優(yōu)化后代碼如下

# -*- coding: utf-8 -*-# Define your item pipelines here#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class AdilcrawlerPipeline(object):

? ? '''? ? ? ?

????保存item數(shù)據(jù)

? ? '''

????????def__init__(self):

? ? ? ? ????????self.filename = open('thousandPic.json','w')

? ? ????def process_item(self, item, spider):

? ? ? ? ????????#? ensure_ascii=False 可以解決 json 文件中亂碼的問題。

????????????????text = json.dumps(dict(item), ensure_ascii=False) +',\n'

????????????????#? 這里是一個(gè)字典一個(gè)字典存儲(chǔ)的，后面加個(gè) ',\n' 以便分隔和換行。? ? ? ?

????????????????self.filename.write(text)

? ? ? ? ????????return item

? ? ????def close_spider(self,spider):

? ? ? ? ????????self.filename.close()

settings文件設(shè)置

修改settings.py配置文件

找到pipelines 配置進(jìn)行修改

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# ITEM_PIPELINES = {

#? ? 'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,

# }

# 啟動(dòng)pipeline 必須將其加入到“ITEM_PIPLINES”的配置中

# 其中根目錄是tutorial，pipelines是我的pipeline文件名，TutorialPipeline是類名

ITEM_PIPELINES = {

? ? 'AdilCrawler.pipelines.AdilcrawlerPipeline': 300,

}

# 加入后，相當(dāng)于開啟pipeline，此時(shí)在執(zhí)行爬蟲，會(huì)執(zhí)行對(duì)應(yīng)的pipelines下的類，并執(zhí)行該類相關(guān)的方法，比如這里上面的保存數(shù)據(jù)功能。

執(zhí)行命令

scrapy crawl thousandPicOptimize

執(zhí)行后生成如下圖文件及保存的數(shù)據(jù)

使用CrawlSpider類進(jìn)行翻頁抓取

使用crawl 模板創(chuàng)建一個(gè) CrawlSpider

執(zhí)行命令如下

scrapy genspider -t crawl thousandPicPaging www.58pic.com

items.py 文件不變，查看爬蟲?thousandPicPaging.py 文件

# -*- coding: utf-8 -*-

import scrapyfromscrapy.linkextractorsimport LinkExtractorfromscrapy.spidersimport CrawlSpider, Ruleclass ThousandpicpagingSpider(CrawlSpider):

? ? name ='thousandPicPaging'? ? allowed_domains = ['www.58pic.com']

? ? start_urls = ['http://www.58pic.com/']

? ? rules = (

? ? ? ? Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),

? ? )

? ? def parse_item(self, response):

? ? ? ? i = {}

? ? ? ? #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

????????#i['name'] = response.xpath('//div[@id="name"]').extract()

????????#i['description'] = response.xpath('//div[@id="description"]').extract()

return i

修改后如下

# -*- coding: utf-8 -*-import scrapy

# 導(dǎo)入鏈接規(guī)則匹配類，用來提取符合規(guī)則的連接fromscrapy.linkextractorsimport LinkExtractor

# 導(dǎo)入CrawlSpider類和Rulefromscrapy.spidersimport CrawlSpider, Ruleimport AdilCrawler.items as itemsclass ThousandpicpagingSpider(CrawlSpider):

? ? name ='thousandPicPaging'? ? allowed_domains = ['www.58pic.com']

? ? # 修改起始頁地址start_urls = ['http://www.58pic.com/c/']

? ? # Response里鏈接的提取規(guī)則，返回的符合匹配規(guī)則的鏈接匹配對(duì)象的列表# http://www.58pic.com/c/1-0-0-03.html? 根據(jù)翻頁連接地址，找到相應(yīng)的正則表達(dá)式? 1-0-0-03? -> \S-\S-\S-\S\S? 而且這里使用 allow# 不能使用 restrict_xpaths ，使用他的話，正則將失效page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

? ? rules = (

? ? ? ? # 獲取這個(gè)列表里的鏈接，依次發(fā)送請(qǐng)求，并且繼續(xù)跟進(jìn)，調(diào)用指定回調(diào)函數(shù)處理

????????Rule(page_link, callback='parse_item', follow=True),

????????# 注意這里的 ',' 要不會(huì)報(bào)錯(cuò)? ?

????????)

? ? # 加上這個(gè) 方法是為了解決 parse_item() 不能抓取第一頁數(shù)據(jù)的問題 parse_start_url 是 CrawlSpider() 類下的方法，這里重寫一下即可

def parse_start_url(self, response):

? ? ? ? i = items.AdilcrawlerItem()

? ? ? ? author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

? ? ? ? theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

? ? ? ? i['author'] = author

? ? ? ? i['theme'] = theme

? ? ? ? yield i

? ? # 指定的回調(diào)函數(shù)def parse_item(self, response):

? ? ? ? i = items.AdilcrawlerItem()

? ? ? ? author = response.xpath('/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()').extract()

? ? ? ? theme = response.xpath('/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()').extract()

? ? ? ? i['author'] = author

? ? ? ? i['theme'] = theme

? ? ? ? yieldi

再次執(zhí)行?

scrapy crawl thousandPicPaging

查看執(zhí)行結(jié)果，可以看到是有4頁的內(nèi)容

再次優(yōu)化引入?ItemLoader類

# -*- coding: utf-8 -*-

import scrapy

# 導(dǎo)入鏈接規(guī)則匹配類，用來提取符合規(guī)則的連接

fromscrapy.linkextractorsimport LinkExtractor

# 導(dǎo)入CrawlSpider類和Rulefromscrapy.loader

import ItemLoaderfromscrapy.spiders

import CrawlSpider, Rule

import AdilCrawler.items as items

class ThousandpicpagingopSpider(CrawlSpider):

? ? name ='thousandPicPagingOp'? ? allowed_domains = ['www.58pic.com']

? ? # 修改起始頁地址start_urls = ['http://www.58pic.com/c/']

? ? # Response里鏈接的提取規(guī)則，返回的符合匹配規(guī)則的鏈接匹配對(duì)象的列表# http://www.58pic.com/c/1-0-0-03.html? 根據(jù)翻頁連接地址，找到相應(yīng)的正則表達(dá)式? 1-0-0-03? -> \S-\S-\S-\S\S? 而且這里使用 allow# 不能使用 restrict_xpaths ，使用他的話，正則將失效page_link = LinkExtractor(allow='http://www.58pic.com/c/\S-\S-\S-\S\S.html', allow_domains='www.58pic.com')

? ? rules = (

? ? ? ? # 獲取這個(gè)列表里的鏈接，依次發(fā)送請(qǐng)求，并且繼續(xù)跟進(jìn)，調(diào)用指定回調(diào)函數(shù)處理

????????Rule(page_link, callback='parse_item', follow=True),

????????# 注意這里的 ',' 要不會(huì)報(bào)錯(cuò)

? ? )

? ? # 加上這個(gè) 方法是為了解決 parse_item() 不能抓取第一頁數(shù)據(jù)的問題 parse_start_url 是 CrawlSpider() 類下的方法，這里重寫一下即可

def parse_start_url(self, response):

? ? ? ? i = ItemLoader(item = items.AdilcrawlerItem(),response = response )

? ? ? ? i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

? ? ? ? i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

? ? ? ? yield? i.load_item()

? ? ????# 指定的回調(diào)函數(shù)def parse_item(self, response):

? ? ? ? i = ItemLoader(item = items.AdilcrawlerItem(),response = response )

? ? ? ? i.add_xpath('author','/html/body/div[4]/div[3]/div/a/p[2]/span/span[2]/text()')

? ? ? ? i.add_xpath('theme','/html/body/div[4]/div[3]/div/a/p[1]/span[1]/text()')

? ? ? ? yieldi.load_item()

執(zhí)行結(jié)果是一樣的。

最后插播一條?在線正則表達(dá)式測(cè)試工具的廣告，地址：?http://tool.oschina.net/regex/

應(yīng)用如下

至此，簡(jiǎn)單完成了一個(gè)網(wǎng)站的簡(jiǎn)單信息的爬取。后面還會(huì)有其他內(nèi)容的介紹~

如果你要覺得對(duì)你有用的話，請(qǐng)不要吝惜你打賞，這將是我無盡的動(dòng)力，謝謝！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python Scrapy 爬蟲框架實(shí)例（一）