爬取當(dāng)當(dāng)數(shù)據(jù)

目的:練習(xí)爬取當(dāng)當(dāng)網(wǎng)站特定關(guān)鍵詞下圖書(shū)數(shù)據(jù),并將抓取到的數(shù)據(jù)存儲(chǔ)在mysql數(shù)據(jù)庫(kù)中

1.新建項(xiàng)目當(dāng)當(dāng):

scrapy startproject dd

2.cd 到項(xiàng)目目錄

cd dd
image.png

3.創(chuàng)建當(dāng)當(dāng)爬蟲(chóng) ,用基本爬蟲(chóng)模板

scrapy genspider -t basic dd_spider dangdang.com

image.png

4.使用pycharm打開(kāi)dd項(xiàng)目


image.png

5.打開(kāi)當(dāng)當(dāng),搜索特定的關(guān)鍵字的圖書(shū)分析網(wǎng)頁(yè)和需要抓取的字段

image.png
# -*- coding: utf-8 -*-

import scrapy

class DdItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

     title = scrapy.Field()
     link = scrapy.Field()
     now_price = scrapy.Field()
     comment_num = scrapy.Field()
     detail = scrapy.Field()
    

6.打開(kāi)爬蟲(chóng)文件,導(dǎo)入剛編寫(xiě)的item,以及修改的開(kāi)始的爬取網(wǎng)址

from dd.items import DdItem

定義Item

     item = DdItem()
        item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
        yield item

定義循環(huán)爬取方法

     for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

完整的代碼

# -*- coding: utf-8 -*-
import scrapy
from dd.items import DdItem
from scrapy.http import Request

class DdSpiderSpider(scrapy.Spider):
    name = 'dd_spider'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']

    def parse(self, response):
        item = DdItem()
        item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
        yield item

        for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

image.png

7.在setting中,取消注釋Pipeline的注釋,以及將Robots協(xié)議設(shè)置為False

ITEM_PIPELINES = {
   'dd.pipelines.DdPipeline': 300,
}

ROBOTSTXT_OBEY = False

image.png
image.png

8.打開(kāi)pipelines文件
通過(guò)for循環(huán)讀取爬取到的itme的值,并打印測(cè)試抓取效果

class DdPipeline(object):
    def process_item(self, item, spider):

        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            print(title)
            print(link)
            print(now_price)
            print(comment_num)
            print(detail)
        return item
image.png

9.運(yùn)行爬蟲(chóng)查看效果,使用pycharm的Terminal或mac終端,進(jìn)入的dd的文件夾目錄輸入

scrapy crawl dd_spider --nolog

image.png
image.png
image.png

10.爬取沒(méi)問(wèn)題,接下來(lái)要將抓取到的數(shù)據(jù),存入到Mysql的數(shù)據(jù)庫(kù)中,使用的是第三方庫(kù)PyMysql,提前安裝好PyMysql,直接使用命令 pip install pymysql 來(lái)安裝。

11.終端打開(kāi)并連接上mysql ,輸入創(chuàng)建數(shù)據(jù)庫(kù)dd命令,并切換成dd數(shù)據(jù)庫(kù)

create database dd;

use dd;
image.png

創(chuàng)建數(shù)據(jù)庫(kù)表books,并創(chuàng)建需要存儲(chǔ)的相應(yīng)字段:
自動(dòng)自增id,title,link,now_price,comment_num,detail

create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );

12.導(dǎo)入pymysql

import pymysql

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #創(chuàng)建連接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            #構(gòu)建sql語(yǔ)句插入數(shù)據(jù)
            sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
            conn.query(sql)
        #關(guān)閉連接
        conn.close()
        return item

無(wú)法爭(zhēng)取的寫(xiě)入寫(xiě)入數(shù)據(jù)庫(kù),報(bào)ModuleNotFoundError: No module named 'pymysql'
還沒(méi)找到解決方案


image.png

解決辦法:更換SQL語(yǔ)句的寫(xiě)法

     conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 設(shè)置自動(dòng)提交
         sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()

完整的代碼

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #創(chuàng)建連接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 設(shè)置自動(dòng)提交
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()
        cursor.close()
        #關(guān)閉連接
        conn.close()
        return item

image.png
image.png

心得,出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題,數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去,也可能是亂碼

優(yōu)化:

1.抓取的到當(dāng)當(dāng)?shù)脑u(píng)論數(shù)和價(jià)格都是字符,需要轉(zhuǎn)化成數(shù)字,這樣方便進(jìn)行排序
2.寫(xiě)入數(shù)據(jù)庫(kù)的時(shí)候使用Try 代碼更健壯

        def getNumber(string):
            newString = string.encode('UTF-8')
            lastStr = re.findall(r"\d+\.?\d*", newString)
            yield int(lastStr)

參考文章:http://blog.csdn.net/think_ma/article/details/78900218

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容