文件目錄說明：

scrapy.cfg: 項(xiàng)目的配置文件

tutorial/: 該項(xiàng)目的python模塊。之后您將在此加入代碼。

tutorial/items.py: 項(xiàng)目中的item文件.

tutorial/pipelines.py: 項(xiàng)目中的pipelines文件.

tutorial/settings.py: 項(xiàng)目的設(shè)置文件.

tutorial/spiders/: 放置spider代碼的目錄.

定義item

Item 是保存爬取到的數(shù)據(jù)的容器；其使用方法和python字典類似，并且提供了額外保護(hù)機(jī)制來避免拼寫錯(cuò)誤導(dǎo)致的未定義字段錯(cuò)誤。

示例代碼：

import scrapy

class DmozItem(scrapy.Item):

? ? title = scrapy.Field()

? ? link = scrapy.Field()

? ? desc = scrapy.Field()

定義spider

name: 用于區(qū)別Spider。該名字必須是唯一的，您不可以為不同的Spider設(shè)定相同的名字。

start_urls: 包含了Spider在啟動(dòng)時(shí)進(jìn)行爬取的url列表。因此，第一個(gè)被獲取到的頁(yè)面將是其中之一。后續(xù)的URL則從初始的URL獲取到的數(shù)據(jù)中提取。

parse() :是spider的一個(gè)方法。被調(diào)用時(shí)，每個(gè)初始URL完成下載后生成的 Response 對(duì)象將會(huì)作為唯一的參數(shù)傳遞給該函數(shù)。該方法負(fù)責(zé)解析返回的數(shù)據(jù)(response data)，提取數(shù)據(jù)(生成item)以及生成需要進(jìn)一步處理的URL的 Request 對(duì)象。

示例代碼：

import scrapy

class DmozSpider(scrapy.Spider):

? ? name = "dmoz"

? ? allowed_domains = ["dmoz.org"]

? ? start_urls = [

? ? ? ? "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

? ? ? ? "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

? ? ]

? ? def parse(self, response):

? ? ? ? filename = response.url.split("/")[-2]

? ? ? ? with open(filename, 'wb') as f:

? ? ? ? ? ? f.write(response.body)

Scrapy提供了一個(gè) item pipeline ，來下載屬于某個(gè)特定項(xiàng)目的圖片，比如，當(dāng)你抓取產(chǎn)品時(shí)，也想把它們的圖片下載到本地。

這條管道，被稱作圖片管道，在 ImagesPipeline 類中實(shí)現(xiàn)，提供了一個(gè)方便并具有額外特性的方法，來下載并本地存儲(chǔ)圖片

使用圖片管道

示例代碼：

import scrapy

from scrapy.contrib.pipeline.images import ImagesPipeline

from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

? ? def get_media_requests(self, item, info):

? ? ? ? for image_url in item['image_urls']:

? ? ? ? ? ? yield scrapy.Request(image_url)

? ? def item_completed(self, results, item, info):

? ? ? ? image_paths = [x['path'] for ok, x in results if ok]

? ? ? ? if not image_paths:

? ? ? ? ? ? raise DropItem("Item contains no images")

? ? ? ? item['image_paths'] = image_paths

? ? ? ? return item

配置修改

開啟圖片管道

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

指定路徑

IMAGES_STORE = '/path/to/valid/dir'

main文件編寫

當(dāng)我們使用scrapy編寫一個(gè)爬蟲工程后，想要對(duì)工程進(jìn)行斷點(diǎn)調(diào)試，和內(nèi)部運(yùn)行一般我們會(huì)定義一個(gè)main.py文件，

以運(yùn)行jobbole為例，編寫main.py 文件代碼。

from scrapy.cmdline import execute

import sys

import os

#設(shè)置工程目錄

sys.path.append(os.path.dirname(os.path.abspath(__file__)))

#啟動(dòng)命令

execute(['scrapy','crawl','jobbole'])

下載器中間件是介于Scrapy的request/response處理的鉤子框架。是用于全局修改Scrapy request和response的一個(gè)輕量、底層的系統(tǒng)。

激活下載中間件

DOWNLOADER_MIDDLEWARES = {

? ? 'myproject.middlewares.CustomDownloaderMiddleware': 543,

}

爬蟲settings.py文件

　# -*- coding: utf-8 -*-

# Scrapy settings for tutorial project

# For simplicitv, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#? ? https://doc.scrapy.org/en/latest/topics/settings.html

#? ? https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#? ? https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tutorial'#爬蟲項(xiàng)目的名稱

SPIDER_MODULES = ['tutorial.spiders']#爬蟲文件目錄

NEWSPIDER_MODULE = 'tutorial.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#用戶的代理，設(shè)置用戶代理的目的是為了模擬瀏覽器發(fā)起請(qǐng)求

#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

# Obey robots.txt rules

#是否要遵守robot協(xié)議默認(rèn)為true表示遵守通常設(shè)置為false

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#下載器允許發(fā)起請(qǐng)求的最大并發(fā)數(shù)? 默認(rèn)情況下是16

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#下載延時(shí) 同一個(gè)網(wǎng)站上一個(gè)請(qǐng)求和下一個(gè)請(qǐng)求之間的間隔時(shí)間

DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#在某一個(gè)域下? 允許的最大并發(fā)請(qǐng)求數(shù)量默認(rèn)8

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#對(duì)于單個(gè)ip下允許最大的并發(fā)請(qǐng)求數(shù)量默認(rèn)為0 為零表示不限制

#特殊點(diǎn)：如果非零? 上面設(shè)置的針對(duì)于域的設(shè)置就會(huì)不再生效了

#這個(gè)時(shí)候并發(fā)的限制就會(huì)針對(duì)與ip而不會(huì)針對(duì)與網(wǎng)站

#特殊點(diǎn)：如果它非零? 我們下載延時(shí)不在是針對(duì)網(wǎng)站了而是針對(duì)于ip了

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#針對(duì)于cookies設(shè)置一般情況下? 我們不攜帶cookies 也是反反爬的一個(gè)手段? 默認(rèn)為True

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#scrapy攜帶的一個(gè)終端擴(kuò)展插件? telent作用：它能夠打印日志信息監(jiān)聽爬蟲的爬取狀態(tài)信息

TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#默認(rèn)的請(qǐng)求頭? 他是一個(gè)全局的

DEFAULT_REQUEST_HEADERS = {

? # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

? # 'Accept-Language': 'en',

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#爬蟲中間件? 我們可以在這里做一些爬蟲自定義的擴(kuò)展

#SPIDER_MIDDLEWARES = {

#? ? 'tutorial.middlewares.TutorialSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#下載中間件? 一般自定義下載中間件可以在這里激活? ? 后面的數(shù)字表示優(yōu)先級(jí)? 數(shù)字越小優(yōu)先級(jí)越高

#DOWNLOADER_MIDDLEWARES = {

#? ? 'tutorial.middlewares.TutorialDownloaderMiddleware': 543,

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#scrapy的擴(kuò)展? 擴(kuò)展信息

#EXTENSIONS = {

#? ? 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#在這里激活管道文件? ? 管道文件后面同樣跟了數(shù)字表示優(yōu)先級(jí) 數(shù)字越小表示優(yōu)先級(jí)越高

ITEM_PIPELINES = {

}

#寫入圖片的保存路徑

IMAGES_STORE = '/path/to/valid/dir'

#設(shè)置自動(dòng)限速的擴(kuò)展? 彌補(bǔ)上面的不足? 可以自動(dòng)調(diào)整scrapy的下載時(shí)間? 時(shí)間延時(shí)

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.

# The initial download delay

# The maximum download delay to be set in case of high latencies

#在高延遲情況下最大的下載延遲(單位秒)

AUTOTHROTTLE_MAX_DELAY = 60

# each remote server

# Enable showing throttling stats for every response received:

#起用AutoThrottle調(diào)試(debug)模式，展示每個(gè)接收到的response?？梢酝ㄟ^此來查看限速參數(shù)是如何實(shí)時(shí)被調(diào)整的。

AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#是否啟用緩存策略

HTTPCACHE_ENABLED = True

#緩存超時(shí)時(shí)間

HTTPCACHE_EXPIRATION_SECS = 0

#緩存保存路徑

HTTPCACHE_DIR = 'httpcache'

#緩存忽略的Http狀態(tài)碼

HTTPCACHE_IGNORE_HTTP_CODES = []

# 緩存存儲(chǔ)的插件

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Linux下scrapy框架命令使用

創(chuàng)建一個(gè)爬蟲項(xiàng)目：scrapy　startproject　爬蟲的項(xiàng)目名稱

注意：創(chuàng)建爬蟲文件要在spiders文件下

創(chuàng)建爬蟲文件：scrapy　genspider　爬蟲文件名稱　爬蟲的域

運(yùn)行爬蟲的名稱：scrapy　crawl 爬蟲文件名稱

創(chuàng)建crawlspider爬蟲文件:scrapy　genspider –t　crawl jobbole xxxxx.cn

未啟動(dòng)爬蟲調(diào)試命令:scrapy shell　'http://www.xxxxxxx.cn/xxx/xxx'

scrapy框架安裝方式：

ubuntu系統(tǒng)下：sudo　pip3　install　scrapy

如果安裝不成功可嘗試安裝一下依賴環(huán)境

sudo apt-get install python3-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy框架總結(jié)

Scrapy框架總結(jié)

文件目錄說明：

定義item

示例代碼：

定義spider

示例代碼：

使用圖片管道

示例代碼：

配置修改

爬蟲settings.py文件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy框架總結(jié)

文件目錄說明：

定義item

示例代碼：

定義spider

示例代碼：

使用圖片管道

示例代碼：

配置修改

爬蟲settings.py文件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av