夜夜欢夜夜干,老鸭窝在线看,欧美qc亚洲一区二区

在搭建好了Scrapy的開發(fā)環(huán)境后（如果配置過程中遇到問題，請(qǐng)參考上一篇文章
搭建Scrapy爬蟲的開發(fā)環(huán)境，
或者在博客里留言），我們開始演示爬取實(shí)例。

我們?cè)噲D爬取論壇-東京版的主題貼。該網(wǎng)
站需要登錄后才能查看帖子附帶的大圖，適合演示登錄過程。

1. 定義item

我們需要保存標(biāo)題、帖子詳情、帖子詳情的url、圖片列表，所以定義item如下：

class RentItem(scrapy.Item):
    """item類"""

    title = scrapy.Field()          # 標(biāo)題
    rent_desc = scrapy.Field()      # 描述
    url = scrapy.Field()            # 詳情的url
    pic_list = scrapy.Field()       # 圖片列表

2. 使用FormRequest模擬登錄

首先我們需要分析頁面，找到登錄的form，以及需要提交的數(shù)據(jù)（用Fiddler或Firebug分析請(qǐng)求即可），
然后使用Scrapy提供FormRequest.from_response()模擬頁面的登錄過程，主要代碼如下：

# 需要登錄，使用FormRequest.from_response模擬登錄
    if "id='lsform'" in response.body:
        logging.info("in parse, need to login, url: {0}".format(response.url))
        form_data = {
            "handlekey": "ls",
            "quickforward": "yes",
            "username": "loginname",
            "password": "passwd"
        }
        request = FormRequest.from_response(
                response=response,
                headers=self.headers,
                formxpath="http://form[contains(@id, 'lsform')]",
                formdata=form_data,
                callback=self.parse_list
                )
    else:
        logging.info("in parse, NOT need to login, url: {0}"
                     .format(response.url))
        request = Request(url=response.url,
                          headers=self.headers,
                          callback=self.parse_list,
                          )

如果請(qǐng)求的頁面需要登錄，則通過xpath定位到對(duì)應(yīng)的form，將登錄需要的數(shù)據(jù)作為參數(shù)，提交登錄，
在callback對(duì)應(yīng)的回調(diào)方法里，處理登錄成功后的爬取邏輯。

3. 使用XPath提取頁面數(shù)據(jù)

Scrapy使用XPath或CSS表達(dá)式分析頁面結(jié)構(gòu)，由基于lxml的Selector提取數(shù)據(jù)。XPath或者CSS都可
以，另外BeautifulSoup
分析HTML/XML文件非常方便，這里采用XPath分析頁面，請(qǐng)參考
zvon-XPath 1.0 Tutorial，示例豐富且易
懂，看完這個(gè)入門教程，常見的爬取需求基本都能滿足。我這里簡(jiǎn)單解釋一下幾個(gè)重要的點(diǎn)：

/表示絕對(duì)路徑，即匹配從根節(jié)點(diǎn)開始，./表示當(dāng)前路徑，//表示匹配任意開始節(jié)點(diǎn)；
*是通配符，可以匹配任意節(jié)點(diǎn)；
在一個(gè)節(jié)點(diǎn)上使用[]，如果是數(shù)字n表示匹配第n個(gè)element，如果是@表示匹配屬性，還可以使用函數(shù)，
比如常用的contains()表示包含，starts-with()表示字符串起始匹配等。
在取節(jié)點(diǎn)的值時(shí)，text()只是取該節(jié)點(diǎn)下的值，而不會(huì)取該節(jié)點(diǎn)的子節(jié)點(diǎn)的值，而.則會(huì)取包括子節(jié)點(diǎn)
在內(nèi)的所有值，比如：

<div>Welcome to <strong>Chengdu</strong></div>

sel.xpath("div/text()")     // Welcome to
sel.xpath("div").xpath("string(.)")     // Welcome to Chengdu

4. 不同的spider使用不同的pipeline

我們可能有很多的spider，不同的spider爬取的數(shù)據(jù)的結(jié)構(gòu)不一樣，對(duì)應(yīng)的存儲(chǔ)格式也不盡相同，因此
我們會(huì)定義多個(gè)pipeline，讓不同的spider使用不同的pipeline。

首先我們需要定義一個(gè)decorator，表示如果spider的pipeline屬性中包含了添加該注解的pipeline，
則執(zhí)行該pipeline，否則跳過該pipeline：

def check_spider_pipeline(process_item_method):
    """該注解用在pipeline上

    :param process_item_method:
    :return:
    """
    @functools.wraps(process_item_method)
    def wrapper(self, item, spider):

        # message template for debugging
        msg = "{1} {0} pipeline step".format(self.__class__.__name__)

        # if class is in the spider"s pipeline, then use the
        # process_item method normally.
        if self.__class__ in spider.pipeline:
            logging.info(msg.format("executing"))
            return process_item_method(self, item, spider)

        # otherwise, just return the untouched item (skip this step in
        # the pipeline)
        else:
            logging.info(msg.format("skipping"))
            return item

    return wrapper

然后，我們還需要在所有pipeline類的回調(diào)方法process_item()上添加該decrator注解：

@check_spider_pipeline
def process_item(self, item, spider):

最后，在spider類中添加一個(gè)數(shù)組屬性pipeline，里面是所有與該spider對(duì)應(yīng)的pipeline，比如：

# 應(yīng)該交給哪個(gè)pipeline去處理
pipeline = set([
    pipelines.RentMySQLPipeline,
])

5. 將爬取的數(shù)據(jù)保存到mysql

數(shù)據(jù)存儲(chǔ)的邏輯在pipeline中實(shí)現(xiàn)，可以使用twisted adbapi以線程池的方式與數(shù)據(jù)庫交互。首
先從setttings中加載mysql配置：

@classmethod
def from_settings(cls, settings):
    """加載mysql配置"""

    dbargs = dict(
        host=settings["MYSQL_HOST"],
        db=settings["MYSQL_DBNAME"],
        user=settings["MYSQL_USER"],
        passwd=settings["MYSQL_PASSWD"],
        charset="utf8",
        use_unicode=True
    )

    dbpool = adbapi.ConnectionPool("MySQLdb", **dbargs)
    return cls(dbpool)

然后在回調(diào)方法process_item中使用dbpool保存數(shù)據(jù)到mysql：

@check_spider_pipeline
def process_item(self, item, spider):
    """pipeline的回調(diào).

    注解用于pipeline與spider之間的對(duì)應(yīng)，只有spider注冊(cè)了該pipeline，pipeline才
    會(huì)被執(zhí)行
    """

    # run db query in the thread pool，在獨(dú)立的線程中執(zhí)行
    deferred = self.dbpool.runInteraction(self._do_upsert, item, spider)
    deferred.addErrback(self._handle_error, item, spider)
    # 當(dāng)_do_upsert方法執(zhí)行完畢，執(zhí)行以下回調(diào)
    deferred.addCallback(self._get_id_by_guid)

    # at the end, return the item in case of success or failure
    # deferred.addBoth(lambda _: item)
    # return the deferred instead the item. This makes the engine to
    # process next item (according to CONCURRENT_ITEMS setting) after this
    # operation (deferred) has finished.
    time.sleep(10)
    return deferred

6. 將圖片保存到七牛云

查看七牛的python接口即可，這里要說明的是，上傳圖片的時(shí)候，不要使用BucketManager的
bucket.fetch()接口，因?yàn)榻?jīng)常上傳失敗，建議使用put_data()接口，比如：

def upload(self, file_data, key):
    """通過二進(jìn)制流上傳文件

    :param file_data:   二進(jìn)制數(shù)據(jù)
    :param key:         key
    :return:
    """
    try:
        token = self.auth.upload_token(QINIU_DEFAULT_BUCKET)
        ret, info = put_data(token, key, file_data)
    except Exception as e:
        logging.error("upload error, key: {0}, exception: {1}"
                      .format(key, e))

    if info.status_code == 200:
        logging.info("upload data to qiniu ok, key: {0}".format(key))
        return True
    else:
        logging.error("upload data to qiniu error, key: {0}".format(key))
        return False

7. 項(xiàng)目部署

部署可以使用scrapyd和scrapyd-client。
首先安裝：

$ pip install scrapyd
$ pip install scrapyd-client

啟動(dòng)scrapyd:

$ sudo scrapyd &

修改部署的配置文件scrapy.cfg:

[settings]
default = scrapy_start.settings

[deploy:dev]
url = http://localhost:6800/
project = scrapy_start

其中dev表示target，scrapy_start表示project，部署即可：

$ scrapyd-deploy dev -p scrapy_start

ok，這篇入門實(shí)例的重點(diǎn)就這么多，項(xiàng)目的源碼在gitlab。

參考

Scrapy 1.0 documentation
XPath 1.0 Tutorial
How can I use different pipelines for different spiders in a single Scrapy project

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy爬蟲入門實(shí)例

Scrapy爬蟲入門實(shí)例

1. 定義item

2. 使用FormRequest模擬登錄

3. 使用XPath提取頁面數(shù)據(jù)

4. 不同的spider使用不同的pipeline

5. 將爬取的數(shù)據(jù)保存到mysql

6. 將圖片保存到七牛云

7. 項(xiàng)目部署

參考

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy爬蟲入門實(shí)例

1. 定義item

2. 使用FormRequest模擬登錄

3. 使用XPath提取頁面數(shù)據(jù)

4. 不同的spider使用不同的pipeline

5. 將爬取的數(shù)據(jù)保存到mysql

6. 將圖片保存到七牛云

7. 項(xiàng)目部署

參考

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av