利用Scrapy爬取鏈家杭州

在惡補(bǔ)了一下關(guān)于class的概念之后,對(duì)于爬蟲(chóng)框架scrapy的運(yùn)用稍微熟練了一點(diǎn),于是對(duì)前段時(shí)間用beautifulsoup方式爬取鏈家的代碼進(jìn)行了更新。

這次爬取的仍然是鏈家杭州二手房,只不過(guò)將上次爬取的在售區(qū)換成了成交區(qū)。

Scrapy的學(xué)習(xí),可以通過(guò)查閱下面的資料,適當(dāng)穿插進(jìn)行吧。

Scrapy爬蟲(chóng)框架的參考資料

好,言歸正傳。

首先是就是分析網(wǎng)頁(yè)結(jié)構(gòu),任意打開(kāi)一個(gè)鏈家二手房板塊頁(yè)面,計(jì)數(shù)發(fā)現(xiàn)該頁(yè)面下有總計(jì)30條(圖中只截取了4條)的二手房信息,而總計(jì)有100個(gè)頁(yè)面。
網(wǎng)頁(yè)結(jié)構(gòu)

因此,不難得到應(yīng)該采取的爬蟲(chóng)策略為:

1. 爬取每一個(gè)頁(yè)面30條的二手房信息的網(wǎng)址鏈接
2. 爬取每個(gè)二手房網(wǎng)址鏈接內(nèi)的標(biāo)題、價(jià)格等因素

notes: 如果對(duì)以上爬取過(guò)程進(jìn)行細(xì)分,第1條則是首先獲取所有頁(yè)面的url,然后獲取每個(gè)頁(yè)面中30條二手房的url;第2條則是對(duì)第1條獲得的二手房url進(jìn)行分析,進(jìn)一步獲取標(biāo)題、價(jià)格等具體因素。

但是,又發(fā)現(xiàn)鏈家網(wǎng)站并沒(méi)有把所有的二手房信息直接放出來(lái),每個(gè)版塊內(nèi)無(wú)論有多少二手房,也只呈現(xiàn)總計(jì)100個(gè)頁(yè)面,每頁(yè)30條總計(jì)3000條的租房信息。


篩選條件

那么就只能通過(guò)選擇不同的篩選條件,將所有的二手房進(jìn)行劃分,將每個(gè)篩選條件下的二手房數(shù)量控制在3000條以下,再將所有篩選條件下的二手房信息合并以取得所有的信息。

此處,我選擇的是以總價(jià)為條件進(jìn)行篩選。

#在0-50萬(wàn)的篩選條件下,url為
# url = "https://hz.lianjia.com/chengjiao/pg1/ea10000bp0ep50/"
#其中pg1為當(dāng)前篩選條件下的第1頁(yè),bp0為總價(jià)篩選下限,ep50為總價(jià)篩選上限

#1.設(shè)置篩選條件為
# page_group_list = ['ea10000bp0ep50/',
# 'ea10000bp50ep100/',
# 'ea10000bp100ep120/',
# 'ea10000bp120ep140/',
# 'ea10000bp140ep160/',
# 'ea10000bp160ep180/',
# 'ea10000bp180ep200/',
# 'ea10000bp200ep250/',
# 'ea10000bp250ep300/',
# 'ea10000bp300ep10000/']

#2.每個(gè)篩選條件下的頁(yè)面數(shù)量通過(guò)pg后的數(shù)字進(jìn)行迭代
#pg(1,2,3,4,5....)

#3.每個(gè)篩選條件下的最大頁(yè)面數(shù)量也需要獲得,因?yàn)椴皇撬袟l件下都是100頁(yè)

url分析完畢,開(kāi)始具體的寫(xiě)代碼。這次所寫(xiě)的Scrapy爬蟲(chóng)框架,大致由items、peplines、settings以及Spiders幾個(gè)部分構(gòu)成,items用于定義所想爬取的元素,peplines用于實(shí)現(xiàn)爬取元素的輸出,settings用于調(diào)整爬蟲(chóng)具體參數(shù),而spiders則是爬蟲(chóng)的核心,在spiders中實(shí)現(xiàn)具體的爬取過(guò)程。

a.定義items
import scrapy
class LianjiaItem(scrapy.Item):
# 房屋名稱(chēng)
housename = scrapy.Field()
# 產(chǎn)權(quán)年限
propertylimit = scrapy.Field()
# 鏈接
houselink = scrapy.Field()
# 掛牌總價(jià)
totalprice = scrapy.Field()
# 單價(jià)
unitprice = scrapy.Field()
# 房屋戶(hù)型
housetype = scrapy.Field()
# 建筑面積
constructarea = scrapy.Field()
# 套內(nèi)面積
housearea = scrapy.Field()
# 樓層
housefloor = scrapy.Field()
# 房屋用途
house_use = scrapy.Field()
# 交易屬性
tradeproperty = scrapy.Field()
# 關(guān)注次數(shù)
guanzhu = scrapy.Field()
# 帶看次數(shù)
daikan = scrapy.Field()
# 所屬行政區(qū)域
district = scrapy.Field()
# 成交總價(jià)
selltotalprice = scrapy.Field()
# 成交均價(jià)
sellunitprice = scrapy.Field()
# 成交時(shí)間
selltime = scrapy.Field()
# 成交周期
sellperiod = scrapy.Field()
# 小區(qū)均價(jià)
villageunitprice = scrapy.Field()
# 小區(qū)建成年代
villagetime = scrapy.Field()
b.定義spiders
# -*- coding: utf-8 -*-
import scrapy
import requests
from lxml import etree
import json
from Lianjia.items import LianjiaItem
import re


class ChengjiaoSpider(scrapy.Spider):
name = 'chengjiao'
# allowed_domains = ['lianjia.com']
baseURL = 'https://hz.lianjia.com/chengjiao/pg'
offset_page = 1
offset_list = 0
page_group_list = ['ea10000bp0ep50/',
'ea10000bp50ep100/',
'ea10000bp100ep120/',
'ea10000bp120ep140/',
'ea10000bp140ep160/',
'ea10000bp160ep180/',
'ea10000bp180ep200/',
'ea10000bp200ep250/',
'ea10000bp250ep300/',
'ea10000bp300ep10000/']

url = baseURL + str(offset_page) + page_group_list[offset_list]

start_urls = [url]

#用于獲取當(dāng)前篩選條件下的最大頁(yè)面數(shù)量
def getmax(self,url):
requ = requests.get(url,allow_redirects=False)
if requ.status_code == 200:
resp = requ.text
tree = etree.HTML(resp)
str_max = tree.xpath("http://div[@class='page-box house-lst-page-box']/@page-data")[0]
dic_max = json.loads(str_max)
maxnum = dic_max['totalPage']
return maxnum
else:
print 'Open Page Error'

#用于獲取頁(yè)面下的二手房url。
#callback參數(shù)用于將返回的值傳遞給指定的方法,meta參數(shù)用于將變量item傳遞給指定的方法
def parse(self, response):
node_list = response.xpath("http://div[@class='info']/div[@class='title']/a")
for node in node_list:
item = LianjiaItem()
item['houselink'] = node.xpath("./@href").extract()[0]
yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})
#如果爬取的頁(yè)數(shù)小于該篩選條件下的最大頁(yè)面數(shù),則頁(yè)面數(shù)量+1,并繼續(xù)爬取下一頁(yè);
#當(dāng)頁(yè)數(shù)大于或等于該篩選條件下的最大頁(yè)面數(shù)時(shí),說(shuō)明已經(jīng)爬完該條件下的所有頁(yè)面,
#則頁(yè)數(shù)重新從1開(kāi)始計(jì),并換下一個(gè)篩選條件。
if self.offset_page < self.getmax(response.url):
self.offset_page += 1
nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
yield scrapy.Request(nexturl,callback=self.parse)
else:
if self.offset_list < len(self.page_group_list)-1:
self.offset_page = 1
self.offset_list += 1
nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
yield scrapy.Request(nexturl,callback=self.parse)

#爬取具體的信息
#通過(guò)meta參數(shù)接受上一個(gè)方法傳遞的值item
def parse_content(self,response):
item = response.meta['key']
# 房屋名稱(chēng)
try:
item['housename'] = response.xpath("http://div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
except:
item['housename'] = 'None'
# 產(chǎn)權(quán)年限
try:
item['propertylimit'] = response.xpath("http://div[@class='content']/ul/li[13]/text()").extract()[0].strip()
except:
item['propertylimit'] = 'None'
# 掛牌總價(jià)
try:
item['totalprice'] = response.xpath("http://div[@class='msg']/span[1]/label/text()").extract()[0].strip()
except:
item['totalprice'] = 'None'
# 房屋戶(hù)型
try:
item['housetype'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
except:
item['housetype'] = 'None'
# 建筑面積
try:
item['constructarea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
except:
item['constructarea'] = 'None'
# 套內(nèi)面積
try:
item['housearea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
except:
item['housearea'] = 'None'
# 房屋用途
try:
item['house_use'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
except:
item['house_use'] = 'None'
# 交易屬性
try:
item['tradeproperty'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
except:
item['tradeproperty'] = 'None'
# 關(guān)注次數(shù)
try:
item['guanzhu'] = response.xpath("http://div[@class='msg']/span[5]/label/text()").extract()[0].strip()
except:
item['guanzhu'] = 'None'
# 帶看次數(shù)
try:
item['daikan'] = response.xpath("http://div[@class='msg']/span[4]/label/text()").extract()[0].strip()
except:
item['daikan'] = 'None'
# 行政區(qū)
try:
pre_district = response.xpath("http://section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
pattern = u'(.*?)二手房成交價(jià)格'
item['district'] = re.search(pattern,pre_district).group(1)
except:
item['district'] = 'None'
# 成交總價(jià)
try:
item['selltotalprice'] = response.xpath("http://span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
except:
item['selltotalprice'] = 'None'
# 成交均價(jià)
try:
item['sellunitprice'] = response.xpath("http://div[@class='price']/b/text()").extract()[0].strip()
except:
item['sellunitprice'] = 'None'
# 成交時(shí)間
try:
item['selltime'] = response.xpath("http://div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
except:
item['selltime'] = 'None'

yield item
c.定義settings
# -*- coding: utf-8 -*-

# Scrapy settings for Lianjia project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Lianjia'

SPIDER_MODULES = ['Lianjia.spiders']
NEWSPIDER_MODULE = 'Lianjia.spiders'

#LOG_FILE = r"C:\test\CHENGJ_pro.doc"
#LOG_LEVEL = 'INFO'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Lianjia (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
   'Accept-Language': 'zh-CN,zh;q=0.9',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Lianjia.middlewares.LianjiaSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Lianjia.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Lianjia.pipelines.LianjiaPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
d.定義peplines
import json

class LianjiaPipeline(object):
    def __init__(self):
        self.f = open('c:\\test\\ceshi.json','w')
    
    
    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.f.write(content.encode('utf-8'))
        return item
    
    def close_spider(self,spider):
        self.f.close()
補(bǔ)充:將json轉(zhuǎn)換為excel
import json
import pandas as pd

path = r"C:\test\ceshi.json"
f = open(path)

records = [json.loads(line) for line in f.readlines()]
df = pd.DataFrame(records)

df.to_csv(r"C:\test\chengjiao.csv",encoding='gb18030')

在看了靜覓的教程之后,將spiders的代碼進(jìn)行了更新,其它部分不變。整體上代碼更加清晰,少了很多的判斷語(yǔ)句和迭代。

import scrapy
import requests
from lxml import etree
import json
from Lianjia.items import LianjiaItem
import re
from scrapy.http import Request


class ChengjiaoSpider(scrapy.Spider):
    name = 'chengjiao_pro'
    baseURL = 'https://hz.lianjia.com/chengjiao/pg'
    offset_page = 1
    page_group_list = ['ea10000bp0ep50/',
                      'ea10000bp50ep100/',
                       'ea10000bp100ep120/',
                       'ea10000bp120ep140/',
                       'ea10000bp140ep160/',
                       'ea10000bp160ep180/',
                       'ea10000bp180ep200/',
                       'ea10000bp200ep250/',
                       'ea10000bp250ep300/',
                       'ea10000bp300ep10000/']    

    
    def start_requests(self):
        for i in self.page_group_list:
            url = self.baseURL + str(self.offset_page) + i
            yield Request(url,callback=self.parse) 
            
        
    def parse(self, response):
        maxnum_dict = json.loads(response.xpath("http://div[@class='page-box house-lst-page-box']/@page-data").extract()[0])
        maxnum = int(maxnum_dict['totalPage'])
        for num in range(1,maxnum+1):
#            item = LianjiaItem()
            split_str = self.baseURL + str(num)
            url = split_str + response.url.split(self.baseURL + str(self.offset_page))[1]
            yield Request(url,self.get_link,dont_filter=True)
#            item['iurl'] = url
#            item['resurl'] = response.url
#            yield item
            
            
    def get_link(self,response):
        node_list = response.xpath("http://div[@class='info']/div[@class='title']/a")
        for node in node_list:
            item = LianjiaItem()
            item['houselink'] = node.xpath("./@href").extract()[0]
            yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})

    def parse_content(self,response):
        item = response.meta['key']
#        房屋名稱(chēng)
        try:
            item['housename'] = response.xpath("http://div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
        except:
            item['housename'] = 'None'
#        產(chǎn)權(quán)年限
        try:
            item['propertylimit'] = response.xpath("http://div[@class='content']/ul/li[13]/text()").extract()[0].strip()
        except:
            item['propertylimit'] = 'None'
#        掛牌總價(jià)
        try:
            item['totalprice'] = response.xpath("http://div[@class='msg']/span[1]/label/text()").extract()[0].strip()
        except:
            item['totalprice'] = 'None'
#        房屋戶(hù)型
        try:
            item['housetype'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
        except:
            item['housetype'] = 'None'
#        建筑面積
        try:
            item['constructarea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
        except:
            item['constructarea'] = 'None'
#        套內(nèi)面積
        try:
            item['housearea'] = response.xpath("http://div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
        except:
            item['housearea'] = 'None'
#        房屋用途
        try:
            item['house_use'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
        except:
            item['house_use'] = 'None'
#        交易屬性
        try:
            item['tradeproperty'] = response.xpath("http://div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
        except:
            item['tradeproperty'] = 'None'
#        關(guān)注次數(shù)
        try:
            item['guanzhu'] = response.xpath("http://div[@class='msg']/span[5]/label/text()").extract()[0].strip()
        except:
            item['guanzhu'] = 'None'
#        帶看次數(shù)
        try:            
            item['daikan'] = response.xpath("http://div[@class='msg']/span[4]/label/text()").extract()[0].strip()
        except:
            item['daikan'] = 'None'
#        行政區(qū)
        try:
            pre_district = response.xpath("http://section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
            pattern = u'(.*?)二手房成交價(jià)格'
            item['district'] = re.search(pattern,pre_district).group(1)
        except:
            item['district'] = 'None'
#        成交總價(jià)
        try:
            item['selltotalprice'] = response.xpath("http://span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
        except:
            item['selltotalprice'] = 'None'
#        成交均價(jià)
        try:
            item['sellunitprice'] = response.xpath("http://div[@class='price']/b/text()").extract()[0].strip()
        except:
            item['sellunitprice'] = 'None'
#        成交時(shí)間
        try:
            item['selltime'] = response.xpath("http://div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
        except:
            item['selltime'] = 'None'
        yield item
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 最近想在工作相關(guān)的項(xiàng)目上做技術(shù)改進(jìn),需要全而準(zhǔn)的車(chē)型數(shù)據(jù),尋尋覓覓而不得,所以就只能自己動(dòng)手豐衣足食,到網(wǎng)上獲(竊...
    littlelory閱讀 4,020評(píng)論 7 19
  • 序言第1章 Scrapy介紹第2章 理解HTML和XPath第3章 爬蟲(chóng)基礎(chǔ)第4章 從Scrapy到移動(dòng)應(yīng)用第5章...
    SeanCheney閱讀 15,258評(píng)論 13 61
  • scrapy是python最有名的爬蟲(chóng)框架之一,可以很方便的進(jìn)行web抓取,并且提供了很強(qiáng)的定制型,這里記錄簡(jiǎn)單學(xué)...
    bomo閱讀 2,329評(píng)論 1 11
  • 你是一團(tuán)篝火 在暗里燃燒 溫暖旅人的夜晚 冷下一處灰燼 他們卻不曾為誰(shuí)停留 風(fēng)聲又起 他們走了,于是 你成為世界的情人
    同敬閱讀 314評(píng)論 5 12
  • 德惠迎賓廣場(chǎng)座落在德惠西站前,剛剛建成不久,幾年前這里還是一片農(nóng)田。隨著德惠市城區(qū)的開(kāi)發(fā)建設(shè),特別是哈大高速鐵路的...
    宏波_閱讀 1,278評(píng)論 2 3

友情鏈接更多精彩內(nèi)容