人人澡人人爱,久久99这里只有精品

目的：練習(xí)爬取當(dāng)當(dāng)網(wǎng)站特定關(guān)鍵詞下圖書(shū)數(shù)據(jù)，并將抓取到的數(shù)據(jù)存儲(chǔ)在mysql數(shù)據(jù)庫(kù)中

1.新建項(xiàng)目當(dāng)當(dāng)：

scrapy startproject dd

2.cd 到項(xiàng)目目錄

cd dd

image.png

3.創(chuàng)建當(dāng)當(dāng)爬蟲(chóng) ，用基本爬蟲(chóng)模板

scrapy genspider -t basic dd_spider dangdang.com

image.png

4.使用pycharm打開(kāi)dd項(xiàng)目

image.png

5.打開(kāi)當(dāng)當(dāng)，搜索特定的關(guān)鍵字的圖書(shū)分析網(wǎng)頁(yè)和需要抓取的字段

image.png

# -*- coding: utf-8 -*-

import scrapy

class DdItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

     title = scrapy.Field()
     link = scrapy.Field()
     now_price = scrapy.Field()
     comment_num = scrapy.Field()
     detail = scrapy.Field()

6.打開(kāi)爬蟲(chóng)文件，導(dǎo)入剛編寫(xiě)的item，以及修改的開(kāi)始的爬取網(wǎng)址

from dd.items import DdItem

定義Item

     item = DdItem()
        item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
        yield item

定義循環(huán)爬取方法

     for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

完整的代碼

# -*- coding: utf-8 -*-
import scrapy
from dd.items import DdItem
from scrapy.http import Request

class DdSpiderSpider(scrapy.Spider):
    name = 'dd_spider'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']

    def parse(self, response):
        item = DdItem()
        item["title"] = response.xpath("http://p[@class='name']/a/@title").extract()
        item["link"] = response.xpath("http://p[@class='name']/a/@href").extract()
        item["now_price"] = response.xpath("http://p[@class='price']/span[@class='search_now_price']/text()").extract()
        item["comment_num"] = response.xpath("http://p/a[@class='search_comment_num']/text()").extract()
        item["detail"] = response.xpath("http://p[@class='detail']/text()").extract()
        yield item

        for i in range(2,27):
            url = "http://search.dangdang.com/?key=python&act=input&page_index="+str(i)
            yield Request(url, callback=self.parse())

image.png

7.在setting中，取消注釋Pipeline的注釋,以及將Robots協(xié)議設(shè)置為False

ITEM_PIPELINES = {
   'dd.pipelines.DdPipeline': 300,
}

ROBOTSTXT_OBEY = False

image.png

8.打開(kāi)pipelines文件
通過(guò)for循環(huán)讀取爬取到的itme的值,并打印測(cè)試抓取效果

class DdPipeline(object):
    def process_item(self, item, spider):

        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            print(title)
            print(link)
            print(now_price)
            print(comment_num)
            print(detail)
        return item

image.png

9.運(yùn)行爬蟲(chóng)查看效果,使用pycharm的Terminal或mac終端，進(jìn)入的dd的文件夾目錄輸入

scrapy crawl dd_spider --nolog

image.png

10.爬取沒(méi)問(wèn)題，接下來(lái)要將抓取到的數(shù)據(jù)，存入到Mysql的數(shù)據(jù)庫(kù)中,使用的是第三方庫(kù)PyMysql，提前安裝好PyMysql，直接使用命令 pip install pymysql 來(lái)安裝。

11.終端打開(kāi)并連接上mysql ，輸入創(chuàng)建數(shù)據(jù)庫(kù)dd命令,并切換成dd數(shù)據(jù)庫(kù)

create database dd;

use dd;

image.png

創(chuàng)建數(shù)據(jù)庫(kù)表books，并創(chuàng)建需要存儲(chǔ)的相應(yīng)字段：
自動(dòng)自增id，title，link，now_price，comment_num，detail

create table books(id int AUTO_INCREMENT PRIMARY KEY,title char(200),link char(100)unique,now_price int(10),comment_num char(100),detail char(255) );

12.導(dǎo)入pymysql

import pymysql

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #創(chuàng)建連接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd")
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            #構(gòu)建sql語(yǔ)句插入數(shù)據(jù)
            sql = "insert into books(title,link,now_price,comment_num,detail) VALUES ('"+title+"','"+link+"','"+now_price+"','"+comment_num+"','"+detail+"')"
            conn.query(sql)
        #關(guān)閉連接
        conn.close()
        return item

無(wú)法爭(zhēng)取的寫(xiě)入寫(xiě)入數(shù)據(jù)庫(kù)，報(bào)ModuleNotFoundError: No module named 'pymysql'
還沒(méi)找到解決方案

image.png

解決辦法：更換SQL語(yǔ)句的寫(xiě)法

     conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 設(shè)置自動(dòng)提交

         sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()

完整的代碼

# -*- coding: utf-8 -*-

import pymysql

class DdPipeline(object):
    def process_item(self, item, spider):
        #創(chuàng)建連接
        conn = pymysql.connect(host="127.0.0.1",user="root",passwd="654321",db="dd",charset='utf8')
        cursor = conn.cursor()
        cursor.execute('set names utf8')  # 固定格式
        cursor.execute('set autocommit=1')  # 設(shè)置自動(dòng)提交
        for i in range(0,len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            now_price = item["now_price"][i]
            comment_num = item["comment_num"][i]
            detail = item["detail"][i]
            sql = "insert into goods(title,link,now_price,comment_num,detail) VALUES (%s,%s,%s,%s,%s)"
            param = (title,link,now_price,comment_num,detail)
            cursor.execute(sql,param )
            conn.commit()
        cursor.close()
        #關(guān)閉連接
        conn.close()
        return item

image.png

心得，出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題，數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去，也可能是亂碼

優(yōu)化：

1.抓取的到當(dāng)當(dāng)?shù)脑u(píng)論數(shù)和價(jià)格都是字符，需要轉(zhuǎn)化成數(shù)字，這樣方便進(jìn)行排序
2.寫(xiě)入數(shù)據(jù)庫(kù)的時(shí)候使用Try 代碼更健壯

        def getNumber(string):
            newString = string.encode('UTF-8')
            lastStr = re.findall(r"\d+\.?\d*", newString)
            yield int(lastStr)

參考文章：http://blog.csdn.net/think_ma/article/details/78900218

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬取當(dāng)當(dāng)數(shù)據(jù)

爬取當(dāng)當(dāng)數(shù)據(jù)

解決辦法：更換SQL語(yǔ)句的寫(xiě)法

心得，出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題，數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去，也可能是亂碼

優(yōu)化：

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬取當(dāng)當(dāng)數(shù)據(jù)

解決辦法：更換SQL語(yǔ)句的寫(xiě)法

心得，出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題，數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去，也可能是亂碼

優(yōu)化：

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

心得，出現(xiàn)問(wèn)題比較多的是數(shù)據(jù)的編碼問(wèn)題，數(shù)據(jù)表字段的編碼如何存入的字段編碼不符可能會(huì)存不進(jìn)去，也可能是亂碼