日本综合一区二区三区,婷婷开心五月

多圖預(yù)警希望進(jìn)來(lái)的時(shí)候你是wifi 好像這句話要放在標(biāo)題上哈....

Python新手前些天看了一些基本語(yǔ)法發(fā)現(xiàn)繼續(xù)看下去效果甚微(枯(ji)燥(mo)了)
知乎上面的大神都說(shuō)爬蟲(chóng) 那我就從爬蟲(chóng)開(kāi)始實(shí)踐學(xué)習(xí)吧
先從簡(jiǎn)單的靜態(tài)的一個(gè)頁(yè)面開(kāi)始

干什么都得按照套路來(lái) 一哥們經(jīng)常這樣說(shuō) 干啥事都有套路跟著我左手右手一個(gè)慢動(dòng)作

如果不使用框架Scrapy
我們拿到這個(gè)網(wǎng)頁(yè)的源文件之后
就得自己用正則表達(dá)式來(lái)抽取想要的數(shù)據(jù)

這里拿糗百做實(shí)驗(yàn) 為什么是糗百因?yàn)槲铱吹馁Y料是糗百(無(wú)辜的糗百,寶寶不哭)

Paste_Image.png

先要分析頁(yè)面看怎么抽取出來(lái)我們想要的數(shù)據(jù)

Paste_Image.png

發(fā)現(xiàn)段子都在這個(gè)<div class="content">text</div>標(biāo)簽里面
所以代碼這么實(shí)現(xiàn)

 #coding=utf-8
__author__ = 'Daemon'

import urllib2,re,time

class CB_Spider:
    def __init__(self):
        self.page=1
        self.enable=True

#正則獲取段子內(nèi)容
def getPageContent(self):
    myUrl='http://www.qiushibaike.com/hot/page/'+str(self.page)
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    req = urllib2.Request(myUrl, headers = headers)  #模擬瀏覽器
    myResponse=urllib2.urlopen(req)
    myPgae=myResponse.read()
    unicodePage=myPgae.decode('utf-8')
    #根據(jù)正則表達(dá)式拿到所有的內(nèi)容
    myItems=re.findall('<div.*?class="content">(.*?)</div>',unicodePage,re.S)
    items=[]

    print u'第%s頁(yè)數(shù)據(jù)展示 ' %self.page

    for item in myItems:
        content=item.strip()
        #拿到最后的<!--12345678910-->時(shí)間戳
        timeContent=re.findall(r'<!--(.*?)-->',content)
        #去掉時(shí)間戳
        pattern=re.compile(r'<!--(.*?)-->')
        content=re.sub(pattern,'',content)
        if len(timeContent)>0:
            timeContent=time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(int(timeContent[0])))
        else:
            timeContent=''
        #每個(gè)段子前加上發(fā)布時(shí)間
        print timeContent +'\n'+content

    self.page+=1;

print u'''

-----------------------------
操作:輸入daemon退出
功能:按下回車(chē)依次瀏覽今日的糗百熱點(diǎn)
-----------------------------

'''

cbSpider=CB_Spider()
while cbSpider.enable:
    myInput=raw_input()
    if 'daemon'==myInput:
        cbSpider.enable=False
    break
    else:
        cbSpider.getPageContent()

xbspider.gif

到這里就是全部的代碼用正則自己抽取相關(guān)網(wǎng)頁(yè)的內(nèi)容

但是一般人不這么玩在實(shí)際項(xiàng)目中還得用框架我開(kāi)發(fā)Android的我也不會(huì)啥都自己寫(xiě) 來(lái)看看Scrapy的簡(jiǎn)單使用怎么獲取和上面一樣的結(jié)果

先要安裝相關(guān)的環(huán)境我是按照這個(gè)地址給出安裝步驟進(jìn)行當(dāng)然不會(huì)每個(gè)人都一切順利

QQ圖片20160121143913.gif

遇到問(wèn)題去Google 基本都能解決這里不是重點(diǎn)
Scrapy安裝

安裝好了之后第一步

創(chuàng)建一個(gè)scrapy工程看圖上的CMD 命令行

Paste_Image.png

我這里用的的Pycharm來(lái)打開(kāi)這個(gè)工程這里只是為了結(jié)構(gòu)看起來(lái)方便和寫(xiě)代碼方便因?yàn)檫@里最后還是得靠cmd 來(lái)執(zhí)行還沒(méi)將Pycharm和Scrapy配置起來(lái) 后面學(xué)習(xí)的時(shí)候跟著主流走
工程結(jié)構(gòu)如下

Paste_Image.png

這里就是創(chuàng)建工程現(xiàn)在來(lái)到我們的項(xiàng)目ScrapyDemo1 寫(xiě)好的

Paste_Image.png

看看這幾個(gè)文件能干嘛
items.py 寫(xiě)上自己的爬蟲(chóng)需要的數(shù)據(jù)的類(lèi) 和相關(guān)屬性相當(dāng)于在解析數(shù)據(jù)時(shí)的接收器

Paste_Image.png

setting.py 看名字就知道是配置文件樣子長(zhǎng)這樣這里是默認(rèn)的哈

Paste_Image.png

其余的幾個(gè)暫時(shí)沒(méi)用到先不說(shuō)

xb_spider.py 就是我們的實(shí)現(xiàn)文件

#coding=utf-8
from scrapy import Spider

from ScrapyDemo1.items import XBItem

from scrapy.selector import Selector
__author__ = 'Daemon'

class XBSpider(Spider):
    name='xb'  #必須
    #allowed_domains = ["qiushibaike.com"]
    #必須
    start_urls=[
        'http://www.qiushibaike.com/8hr/page/1'
    ]

#必須
def parse(self, response):
    sel=Selector(response)
    sites=sel.xpath('//div[@class="content"]')
    #
    # filename = 'xbdata'
    # open(filename, 'w+').write(len(sites))

    items=[]
    for site in sites:
        item=XBItem()
        item['content']=site.xpath('text()').extract()
        print items.append(item)
    return items

用到了Selector Xpath
Selector文檔
 Xpath文檔

W3c的截圖