国产区一区二区,日韩欧美图片,九九综合色一区

一、實驗?zāi)康?/h4>
實驗對象：豆瓣圖書 Top 250 （https://book.douban.com/top250）
實驗內(nèi)容：用scrapy框架編寫爬蟲，嘗試用xpath和css兩種方法采集豆瓣圖書top250的圖書信息，包括標(biāo)題、作者、一句話短評、評分、圖片地址等內(nèi)容

二、實驗過程

1.設(shè)計采集流程

采集工作只有一層鏈接，就是翻頁。經(jīng)過觀察，豆瓣該網(wǎng)頁翻頁的url是有規(guī)律的，以25的倍數(shù)遞增
"https://book.douban.com/top250?start=0"
"https://book.douban.com/top250?start=25"
因此url列表可以這樣表示：

url = 'https://book.douban.com/top250?start={}'.format(a*25)

2.分析采集實體的路徑

2.1 xpath方法
分析方法：用瀏覽器開發(fā)者工具，定位元素后直接 copy xpath

整本書.png

每本書的路徑： "http://[@id="content"]/div/div[1]/div/table"
令 book = response.xpath('//[@id="content"]/div/div[1]/div/table')，則后面的可以表示為：
標(biāo)題路徑：book.xpath("./tr/td[2]/div[1]/a/@title")
評價人數(shù)路徑：book.xpath("./tr/td[2]/p[2]/span/text()")
其他內(nèi)容的路徑同理，就不詳細寫了。

2.2 css方法
分析方法：用瀏覽器開發(fā)者工具，定位元素后觀察它下方欄的css標(biāo)簽節(jié)點

css.png

每本書的路徑：'tr.item'
令 book=response.css('tr.item')，則后面的可以表示為：
標(biāo)題路徑：book.css('div.pl2 a::text')
圖片路徑：book.css('a.nbg img').xpath('@src')
其它的這里先不詳細寫。

3.編寫爬蟲文件 douban.py

先創(chuàng)建爬蟲項目，然后在spyder文件夾中新建爬蟲文件 douban.py，開始編寫爬蟲代碼。

scrapy startproject douban

3.1 xpath 版本

# -*- coding: utf-8 -*-

import scrapy
from douban.items import DoubanItem
    
class geyan(scrapy.Spider):
    name = "douban2"
    def start_requests(self):
        for a in range(10):
            url = 'https://book.douban.com/top250?start={}'.format(a*25)
            yield scrapy.Request(url=url,callback=self.parse)
    
    def parse(self,response):
        items = []
       for book in response.xpath('//*[@id="content"]/div/div[1]/div/table'):
            item = DoubanItem()
            item['title']=book.xpath("./tr/td[2]/div[1]/a/@title").extract_first().replace('\n', '').strip()
            item['score']=book.xpath("./tr/td[2]/div[2]/span[2]/text()").extract_first().replace('\n', '').strip()
            item['scrible']=book.xpath("./tr/td[2]/p[2]/span/text()").extract_first().replace('\n', '').strip()
            item['num']=book.xpath("./tr/td[2]/div[2]/span[3]/text()").extract_first().strip("(").strip(")").replace('\n', '').strip()
            item['img']=book.xpath("./tr/td[1]/a/img/@src").extract_first().replace('\n', '').strip()
            items.append(item)
         
        print(items)
        return items

3.2 css版本
除了parse函數(shù)內(nèi)的具體采集路徑不同，其他部分的代碼都跟xpath版本的相同。

def parse(self,response):
        items = []
        for book in response.css('tr.item'):
            item = DoubanItem()
            item['title']=book.css('div.pl2 a::text').extract_first().replace('\n', '').strip()
            item['score']=book.css('div.star.clearfix span.rating_nums::text').extract_first().replace('\n', '').strip()
            item['scrible']=book.css('p.quote span.inq::text').extract_first().replace('\n', '').strip()
            item['num']=book.css('div.star.clearfix span.pl::text').extract_first().strip("(").strip(")").replace('\n', '').strip()
            item['img']=book.css('a.nbg img').xpath('@src').extract_first().replace('\n', '').strip()
            items.append(item)
          
        print(items)
        return items

4.修改項目文件 setting.py 和 pipelines.py

4.1 修改setting.py
去掉文件中 ITEM_PIPELINES一行的注釋，修改user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
}

4.2 修改 pipelines.py
在文件中自定義一個類

import codecs
import json

class JsonPipeline(object):
    def __init__(self):
        self.file = codecs.open('daa.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

5.運行代碼

scrapy crawl douban

部分結(jié)果如圖：

結(jié)果報告.png

6.存儲數(shù)據(jù)

scrapy crawl douban -o douban.json

結(jié)果.png

三、遇到的問題

1. 403 forbidden：爬蟲被禁止訪問該網(wǎng)頁

解決方案：在setting.py中修改user-agent，偽裝成瀏覽器
谷歌瀏覽器user-agent的獲取方法：在地址欄輸入chrome:version，查看用戶代理一欄

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

參考鏈接：http://30daydo.com/article/245

2. 列表越界 IndexError: list index out of range

解決方法：最開始在采集數(shù)據(jù)時，采集路徑后面用的數(shù)據(jù)處理方法是.extract（）[0]，改成extract_first()即可

item['title']=book.css('div.pl2 a::text').extract_first()

3. 結(jié)果數(shù)據(jù)處理問題

3.1 采集到的數(shù)據(jù)帶有大量空格、換行符和括號
比如在“num”這一個數(shù)據(jù)中，帶有括號、空格和換行符，可以用replace('\n', '')或者strip()函數(shù)處理

item['num']=book.xpath("./tr/td[2]/div[2]/span[3]/text()").extract_first().strip("(").strip(")").replace('\n', '').strip()

3.1采集到的數(shù)據(jù)進行json格式化

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

scrapy爬取豆瓣圖書TOP250實驗報告

scrapy爬取豆瓣圖書TOP250實驗報告

二、實驗過程

1.設(shè)計采集流程

2.分析采集實體的路徑

3.編寫爬蟲文件 douban.py

4.修改項目文件 setting.py 和 pipelines.py

5.運行代碼

6.存儲數(shù)據(jù)

三、遇到的問題

1. 403 forbidden：爬蟲被禁止訪問該網(wǎng)頁

2. 列表越界 IndexError: list index out of range

3. 結(jié)果數(shù)據(jù)處理問題

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

scrapy爬取豆瓣圖書TOP250實驗報告

二、實驗過程

1.設(shè)計采集流程

2.分析采集實體的路徑

3.編寫爬蟲文件 douban.py

4.修改項目文件 setting.py 和 pipelines.py

5.運行代碼

6.存儲數(shù)據(jù)

三、遇到的問題

1. 403 forbidden：爬蟲被禁止訪問該網(wǎng)頁

2. 列表越界 IndexError: list index out of range

3. 結(jié)果數(shù)據(jù)處理問題

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

二、實驗過程