日韩欧美国产人妻在线,日韩久久精品一区,久久久精品传媒

第一步：

需求: 爬取https://www.gushiwen.cn/default_1.aspx中的詩詞的標(biāo)題作者朝代內(nèi)容，以及實現(xiàn)全部翻頁并保存，以下保存為兩種方式，第一種為csv格式，第二種為txt格式。

第二步：

新建爬蟲框架,在pycharm,Terminal中
cd D:\myproject\20210530
scrapy startproject gsw

第三步：

新建爬蟲文件
cd gsw
scrapy genspider gs gushiwen.cn

第四步

分析網(wǎng)頁結(jié)構(gòu)：
每頁古詩文都在class="left"的div標(biāo)簽里，每首古詩文都在@class="sons"的div標(biāo)簽里，那么通過scrapy使用xpath來清洗所需的數(shù)據(jù)，tags = response.xpath('//div[@class="left"]/div[@class="sons"]')得到每頁10首古詩文。通過遍歷，得到每首古詩文
網(wǎng)頁頁面分析
https://www.gushiwen.cn/default_1.aspx
https://www.gushiwen.cn/default_2.aspx
https://www.gushiwen.cn/default_3.aspx
分別代表第一頁，第二頁，第三頁以此類推
網(wǎng)頁為靜態(tài)網(wǎng)頁

第五步

settings.py設(shè)置

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',
LOG_LEVEL = 'WARNING'
ITEM_PIPELINES = {
'gsw.pipelines.GswPipeline': 300,
}

第六步

items.py
封裝字段，在items.py文件中對字段進行定義并封裝，如果未定義那么爬蟲主程序無法調(diào)用。

import scrapy
class GswItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    dynasty = scrapy.Field()
    content= scrapy.Field()
    pass

第七步

爬蟲主程序調(diào)用items.py中的GswItem(scrapy.Item)類 <class 'gsw.items.GswItem'>
調(diào)用方法，右鍵點擊新建工程gsw- Mark Directory as Sources Root設(shè)置為當(dāng)前文件夾
調(diào)用GswItem(scrapy.Item)類：from gsw.items import GswItem
實例化類（有兩種方法,隨便一種都可以）
第一種：
item = GswItem(title = title,author = author,dynasty = dynasty,content = content)
第二種：
item = GswItem
item['title'] = title
item['author'] = author
item['dynasty'] = dynasty
item['content'] = content

第八步

爬蟲主程序
gs.py

數(shù)據(jù)清洗，使用到get()獲取一條數(shù)據(jù)字符串，getall()獲得多條數(shù)據(jù)，形成列表

    for tag in tags:
            title = tag.xpath('./div[@class="cont"]/p/a/b/text()').get()
            source = tag.xpath('.//p[@class="source"]/a/text()').getall()
            try:
                author = source[0]
                dynasty = source[1]
                contents = tag.xpath('.//div[@class="contson"]/text()').getall()
                content =''.join(contents).strip()

列表轉(zhuǎn)字符串方法： ' '.join()
字符串轉(zhuǎn)列表方法：.split()
當(dāng)爬取的范圍不一樣的時候(域名不一樣的時候) 有最大的就給個最大的 allowed_domains = ['gushiwen.org','gushiwen.cn'] 可以加多個多個數(shù)據(jù)之間不要出現(xiàn)空格

allowed_domains = ['gushiwen.cn','gushiwen.org']

如何處理列表為空的數(shù)據(jù) 第一個可以做非空判斷例如豆瓣第二個可以通過 try語句進行處理

第九步翻頁處理：

翻頁處理是在爬蟲主程序是處理，由于它特別重要，所以另起一步。
翻頁總結(jié)有兩種方式：
1、首先定位（xpath)到翻頁所在元素@href網(wǎng)頁地址
url_tag = response.xpath('//a[@id="amore"]/@href').get()
如果網(wǎng)頁地址不全，可以用next_url = response.urljoin(url_tag)
進行補全。并調(diào)用yield scrapy.Request(next_url,callback = self.parse)把下一頁用回調(diào)函數(shù)方法返回上述self.parse()進行調(diào)用.

  #第一種翻頁方式
        # url_tag = response.xpath('//a[@id="amore"]/@href').get()
        # if url_tag:
        #     next_url = response.urljoin(url_tag)
        #     yield scrapy.Request(next_url,callback=self.parse)

2、第二種方法，利用列表推導(dǎo)式，然后對列表進行遍歷方法
urls =['https://www.gushiwen.cn/default_{}.aspx'.format(str(x+1)) for x in range(1,5)]
for url in urls:
yield scrapy.Request(url,callback=self.parse)

   #第二種翻頁方式
        urls =['https://www.gushiwen.cn/default_{}.aspx'.format(str(x+1)) for x in range(1,5)]
        for url in urls:
            yield scrapy.Request(url,callback=self.parse)

綜合第七至第九步gs.py完整代碼如下：

import scrapy
from gsw.items import GswItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.cn','gushiwen.org']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    def parse(self, response):
        tags = response.xpath('//div[@class="left"]/div[@class="sons"]')
        for tag in tags:
            title = tag.xpath('./div[@class="cont"]/p/a/b/text()').get()
            source = tag.xpath('.//p[@class="source"]/a/text()').getall()
            try:
                author = source[0]
                dynasty = source[1]
                contents = tag.xpath('.//div[@class="contson"]/text()').getall()
                content =''.join(contents).strip()
                item = GswItem(title = title,author = author,dynasty = dynasty,content = content)
                yield item
            except:
                print('')

        #第一種翻頁方式
        # url_tag = response.xpath('//a[@id="amore"]/@href').get()
        # if url_tag:
        #     next_url = response.urljoin(url_tag)
        #     yield scrapy.Request(next_url,callback=self.parse)

        #第二種翻頁方式
        urls =['https://www.gushiwen.cn/default_{}.aspx'.format(str(x+1)) for x in range(1,5)]
        for url in urls:
            yield scrapy.Request(url,callback=self.parse)

第十步數(shù)據(jù)保存

yield 它是一個迭代器，是將爬蟲主程序的item對象（注意：在pipline文件當(dāng)中注意Item的對象加入你引用了item文件那么這個item不是一個dict對象反之則是一個字典對象）傳遞至pipline文件中進行數(shù)據(jù)保存

項目管道pipline文件中def open_spider(self,spider):和 def close_spider(self,spider):相當(dāng)于init()方法，是自動運行。但書寫不能錯誤，否則會報錯
在csv方法中，無法直接writer.writerows(item),因為對象無法直接寫入，所以需要用append()方法，lst.append(item)寫入
lst和header全局變量，可以在class GswPipeline:上方直接定義，也可以在類中創(chuàng)建構(gòu)造函數(shù)進行定義。
pipline csv保存

import csv

class GswPipeline:

    def open_spider(self,spider):
        print('程序開始運行')

    def __init__(self):
        self.headers = ['title','author','dynasty','content']
        self.lst =[]

    def WriterDate(self,lst):
        with open('古詩文.csv','w',encoding='utf-8',newline='')as f:
            writer = csv.DictWriter(f,self.headers)
            writer.writeheader()
            writer.writerows(lst)

    def process_item(self, item, spider):
        self.lst.append(item)
        self.WriterDate(self.lst)
        return item

    def close_spider(self,spider):
        print('程序運行結(jié)束')

文件寫入txt,因為item是一個json對象，所以需要將它轉(zhuǎn)化成字符串，才能被txt文件寫入。
item_json =json.dumps(dict(item),ensure_ascii=False)
pipline txt保存


class WsyjectPipeline:

    def open_spider(self,spider):
        self.fp = open('gsw.txt','w',encoding='utf-8')

    def process_item(self, item, spider):
        item_json =json.dumps(dict(item),ensure_ascii=False)
        self.fp.write(item_json +'\n')
        return item

    def close_spider(self,spider):
        self.fp.close()

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

scrapy案例總結(jié)（一）

scrapy案例總結(jié)（一）

第一步：

第二步：

第三步：

第四步

第五步

第六步

第七步

第八步

第九步翻頁處理：

第十步數(shù)據(jù)保存

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

scrapy案例總結(jié)（一）

第一步：

第二步：

第三步：

第四步

第五步

第六步

第七步

第八步

第九步 翻頁處理：

第十步 數(shù)據(jù)保存

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

第九步翻頁處理：

第十步數(shù)據(jù)保存