男女亚洲天堂精品,青操国产视频

Scrapy抓取到網(wǎng)頁(yè)數(shù)據(jù)，保存到數(shù)據(jù)庫(kù)，是通過(guò)pipelines來(lái)處理的?？匆幌鹿俜轿臋n的說(shuō)明。

當(dāng)Item在Spider中被收集之后，它將會(huì)被傳遞到Item Pipeline，一些組件會(huì)按照一定的順序執(zhí)行對(duì)Item的處理。

以下是item pipeline的一些典型應(yīng)用：

清理HTML數(shù)據(jù)

驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段)
查重(并丟棄)
將爬取結(jié)果保存到數(shù)據(jù)庫(kù)中

一、解析頁(yè)面數(shù)據(jù) Spider類(lèi)

本文以簡(jiǎn)書(shū)《讀書(shū)》專(zhuān)題為例，抓取專(zhuān)題收錄的所有文章數(shù)據(jù)，http://www.itdecent.cn/collection/yD9GAd
把需要爬取的頁(yè)面數(shù)據(jù)解析出來(lái)，封裝成對(duì)象Item，提交(yield)。


            item = JsArticleItem()

            author = info.xpath('p/a/text()').extract()
            pubday = info.xpath('p/span/@data-shared-at').extract()
            author_url = info.xpath('p/a/@href').extract()
            title = info.xpath('h4/a/text()').extract()
            url = info.xpath('h4/a/@href').extract()
            reads = info.xpath('div/a[1]/text()').extract()
            reads = filter(str.isdigit, str(reads[0]))

            comments = info.xpath('div/a[2]/text()').extract()
            comments = filter(str.isdigit, str(comments[0]))

            likes = info.xpath('div/span[1]/text()').extract()
            likes = filter(str.isdigit,str(likes[0]))

            rewards = info.xpath('div/span[2]/text()')
            ## 判斷文章有無(wú)打賞數(shù)據(jù)
            if len(rewards)==1 :
                rds = info.xpath('div/span[2]/text()').extract()
                rds = int(filter(str.isdigit,str(rds[0])))
            else:
                rds = 0

            item['author'] = author
            item['url'] = 'http://www.itdecent.cn'+url[0]
            item['reads'] = reads
            item['title'] = title
            item['comments'] = comments
            item['likes'] = likes
            item['rewards'] = rds
            item['pubday'] = pubday

            yield item

定義好的Item類(lèi)，在items.py中

class JsArticleItem(Item):

    author = Field()
    url = Field()
    title = Field()
    reads = Field()
    comments = Field()
    likes = Field()
    rewards = Field()
    pubday = Field()

二、pipelines.py中定義一個(gè)類(lèi)，操作數(shù)據(jù)庫(kù)

class WebcrawlerScrapyPipeline(object):
    '''保存到數(shù)據(jù)庫(kù)中對(duì)應(yīng)的class
       1、在settings.py文件中配置
       2、在自己實(shí)現(xiàn)的爬蟲(chóng)類(lèi)中yield item,會(huì)自動(dòng)執(zhí)行'''

    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        '''1、@classmethod聲明一個(gè)類(lèi)方法，而對(duì)于平常我們見(jiàn)到的叫做實(shí)例方法。
           2、類(lèi)方法的第一個(gè)參數(shù)cls（class的縮寫(xiě)，指這個(gè)類(lèi)本身），而實(shí)例方法的第一個(gè)參數(shù)是self，表示該類(lèi)的一個(gè)實(shí)例
           3、可以通過(guò)類(lèi)來(lái)調(diào)用，就像C.f()，相當(dāng)于java中的靜態(tài)方法'''
        #讀取settings中配置的數(shù)據(jù)庫(kù)參數(shù)
        dbparams = dict(
            host=settings['MYSQL_HOST'],  
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWD'],
            charset='utf8',  # 編碼要加上，否則可能出現(xiàn)中文亂碼問(wèn)題
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=False,
        )
        dbpool = adbapi.ConnectionPool('MySQLdb', **dbparams)  # **表示將字典擴(kuò)展為關(guān)鍵字參數(shù),相當(dāng)于host=xxx,db=yyy....
        return cls(dbpool)  # 相當(dāng)于dbpool付給了這個(gè)類(lèi)，self中可以得到

    # pipeline默認(rèn)調(diào)用
    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self._conditional_insert, item)  # 調(diào)用插入的方法
        query.addErrback(self._handle_error, item, spider)  # 調(diào)用異常處理方法
        return item

    # 寫(xiě)入數(shù)據(jù)庫(kù)中
    # SQL語(yǔ)句在這里
    def _conditional_insert(self, tx, item):
        sql = "insert into jsbooks(author,title,url,pubday,comments,likes,rewards,views) values(%s,%s,%s,%s,%s,%s,%s,%s)"
        params = (item['author'], item['title'], item['url'], item['pubday'],item['comments'],item['likes'],item['rewards'],item['reads'])
        tx.execute(sql, params)

    # 錯(cuò)誤處理方法
    def _handle_error(self, failue, item, spider):
        print failue

三、在settings.py中指定數(shù)據(jù)庫(kù)操作的類(lèi)，啟用pipelines組件

ITEM_PIPELINES = {
    'jsuser.pipelines.WebcrawlerScrapyPipeline': 300,#保存到mysql數(shù)據(jù)庫(kù)
}

#Mysql數(shù)據(jù)庫(kù)的配置信息
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'testdb'         #數(shù)據(jù)庫(kù)名字，請(qǐng)修改
MYSQL_USER = 'root'             #數(shù)據(jù)庫(kù)賬號(hào)，請(qǐng)修改
MYSQL_PASSWD = '1234567'         #數(shù)據(jù)庫(kù)密碼，請(qǐng)修改

MYSQL_PORT = 3306               #數(shù)據(jù)庫(kù)端口，在dbhelper中使用

其他設(shè)置，偽裝瀏覽器請(qǐng)求，設(shè)置延遲抓取，防ban

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY = 0.25 # 250 ms of delay

運(yùn)行爬蟲(chóng)，cmdline.execute("scrapy crawl zhanti".split()) 開(kāi)始，OK!

Scrapy爬取數(shù)據(jù)存入Mongdb貌似更方便，代碼更少，看下面文章鏈接。

我的Scrapy爬蟲(chóng)框架系列文章：

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Scrapy爬取數(shù)據(jù)存入MySQL數(shù)據(jù)庫(kù)

Scrapy爬取數(shù)據(jù)存入MySQL數(shù)據(jù)庫(kù)

一、解析頁(yè)面數(shù)據(jù) Spider類(lèi)

二、pipelines.py中定義一個(gè)類(lèi)，操作數(shù)據(jù)庫(kù)

三、在settings.py中指定數(shù)據(jù)庫(kù)操作的類(lèi)，啟用pipelines組件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Scrapy爬取數(shù)據(jù)存入MySQL數(shù)據(jù)庫(kù)

一、解析頁(yè)面數(shù)據(jù) Spider類(lèi)

二、pipelines.py中定義一個(gè)類(lèi)，操作數(shù)據(jù)庫(kù)

三、在settings.py中指定數(shù)據(jù)庫(kù)操作的類(lèi)，啟用pipelines組件

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

二、pipelines.py中定義一個(gè)類(lèi)，操作數(shù)據(jù)庫(kù)