国产亚洲一区二区三区,大香蕉最新在线

Scrapy的安裝介紹

Scrapy框架官方網(wǎng)址：http://doc.scrapy.org/en/latest

Scrapy中文維護(hù)站點(diǎn)：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

Windows 安裝方式

Python 3

升級(jí)pip版本：

pip3 install --upgrade pip

通過(guò)pip 安裝 Scrapy 框架

pip3 install Scrapy

Ubuntu 安裝方式

通過(guò)pip3 安裝 Scrapy 框架

sudo pip3 install scrapy
如果安裝不成功再試著添加這些依賴庫(kù)：

安裝非Python的依賴

sudo apt-get install python3-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

流程圖

Image.png

1.Scrapy Engine(引擎): 負(fù)責(zé)Spider、ItemPipeline、Downloader、Scheduler中間的通訊，信號(hào)、數(shù)據(jù)傳遞等

2.Scheduler(調(diào)度器): 它負(fù)責(zé)接受引擎發(fā)送過(guò)來(lái)的Request請(qǐng)求，并按照一定的方式進(jìn)行整理排列，入隊(duì)，當(dāng)引擎需要時(shí)，交還給引擎。

3.Downloader（下載器）：負(fù)責(zé)下載Scrapy Engine(引擎)發(fā)送的所有Requests請(qǐng)求，并將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來(lái)處理，

4.Spider（爬蟲(chóng)）：它負(fù)責(zé)處理所有Responses,從中分析提取數(shù)據(jù)，獲取Item字段需要的數(shù)據(jù)，并將需要跟進(jìn)的URL提交給引擎，再次進(jìn)入Scheduler(調(diào)度器)，

5.Item Pipeline(管道)：它負(fù)責(zé)處理Spider中獲取到的Item，并進(jìn)行進(jìn)行后期處理（詳細(xì)分析、過(guò)濾、存儲(chǔ)等）的地方.

6.Downloader Middlewares（下載中間件）：你可以當(dāng)作是一個(gè)可以自定義擴(kuò)展下載功能的組件。

7.Spider Middlewares（Spider中間件）：你可以理解為是一個(gè)可以自定擴(kuò)展和操作引擎和Spider中間通信的功能組件（比如進(jìn)入Spider的Responses;和從Spider出去的Requests）

爬蟲(chóng)又分為普通爬蟲(chóng)和通用爬蟲(chóng)

普通爬蟲(chóng)

1.新建項(xiàng)目
在開(kāi)始爬取之前，必須創(chuàng)建一個(gè)新的Scrapy項(xiàng)目。進(jìn)入自定義的項(xiàng)目目錄中，運(yùn)行下列命令：
scrapy startproject myspider
注意：我們?cè)趯戫?xiàng)目時(shí)可以使用虛擬環(huán)境

2.新建爬蟲(chóng)文件
scrapy genspider jobbole jobbole.com

3.明確目標(biāo)url
https://www.baidu.com/

4.進(jìn)入item.py文件創(chuàng)建自己需要爬取的字段名稱
標(biāo)題
title = scrapy.Field()
創(chuàng)建時(shí)間
create_date = scrapy.Field()
文章地址
url = scrapy.Field()

5.制作爬蟲(chóng)（spider/baidu.py）
-*- coding: utf-8 -*-
import scrapy
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['jobbole.com']
start_urls = ['http://blog.jobbole.com/all-posts/']

def parse(self, response):
pass

6.分析數(shù)據(jù)，存儲(chǔ)數(shù)據(jù)
在管道文件（pipeline.py）進(jìn)行存儲(chǔ)。

通用爬蟲(chóng)

通過(guò)下面的命令可以快速創(chuàng)建 CrawlSpider模板的代碼：
scrapy genspider -t crawl 爬蟲(chóng)文件域名

它是Spider的派生類，Spider類的設(shè)計(jì)原則是只爬取start_url列表中的網(wǎng)頁(yè)，而CrawlSpider類定義了一些規(guī)則Rule來(lái)提供跟進(jìn)鏈接的方便的機(jī)制，從爬取的網(wǎng)頁(yè)結(jié)果中獲取鏈接并繼續(xù)爬取的工作．

CrawlSpider繼承于Spider類，除了繼承過(guò)來(lái)的屬性外（name、allow_domains），還提供了新的屬性和方法:

rules

CrawlSpider使用rules屬性來(lái)決定爬蟲(chóng)的爬取規(guī)則，并將匹配后的url請(qǐng)求提交給引擎,完成后續(xù)的爬取工作。
在rules中包含一個(gè)或多個(gè)Rule對(duì)象，每個(gè)Rule對(duì)爬取網(wǎng)站的動(dòng)作定義了某種特定操作，比如提取當(dāng)前相應(yīng)內(nèi)容里的特定鏈接，是否對(duì)提取的鏈接跟進(jìn)爬取，對(duì)提交的請(qǐng)求設(shè)置回調(diào)函數(shù)等。

class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)

使用通用爬蟲(chóng)

第一步：根據(jù)要爬取的網(wǎng)頁(yè)確定需要保存的字段
class ZhilianItem(scrapy.Item):
define the fields for your item here like:

name = scrapy.Field()
job_title = scrapy.Field()

第二步：編寫爬蟲(chóng)類
LinkExtractor實(shí)例對(duì)象
jobListRult = LinkExtractor(allow=r'sou.zhaopin.com/jobs')

第三步：數(shù)據(jù)保存
Pipelines.py
import json
class ZhilianPipeline(object):

def init(self):```
self.file = open('zhilian.json','a+')

def process_item(self, item, spider):
content = json.dumps(dict(item),ensure_ascii=False) + '\n'
self.file.write(content)
def closespider(self):
self.file.close()

第四步：settings相關(guān)設(shè)置
1.ROBOTSTXT_OBEY = False 設(shè)置是否遵守robot協(xié)議
2.DOWNLOAD_DELAY = 3 設(shè)置下載延時(shí)
3.設(shè)置全局的Header
DEFAULT_REQUEST_HEADERS = {
'User-Agent':' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:59.0) Gecko/20100101 Firefox/59.0',
}
4.激活pipelines數(shù)據(jù)處理管道
ITEM_PIPELINES = {
'zhilian.pipelines.ZhilianPipeline': 300,
}

第五步：運(yùn)行程序
scrapy crawl zhilianCrawl

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

關(guān)于scrapy框架

關(guān)于scrapy框架

Scrapy的安裝介紹

Windows 安裝方式

Ubuntu 安裝方式

爬蟲(chóng)又分為普通爬蟲(chóng)和通用爬蟲(chóng)

普通爬蟲(chóng)

通用爬蟲(chóng)

rules

使用通用爬蟲(chóng)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

關(guān)于scrapy框架

Scrapy的安裝介紹

Windows 安裝方式

Ubuntu 安裝方式

爬蟲(chóng)又分為普通爬蟲(chóng)和通用爬蟲(chóng)

普通爬蟲(chóng)

通用爬蟲(chóng)

rules

使用通用爬蟲(chóng)

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av