python3編寫爬蟲程序獲取鏈家網(wǎng)租房信息

程序設(shè)計(jì)思路

爬蟲程序的設(shè)計(jì)思路大同小異,下面是我的設(shè)計(jì)思路
1.模擬瀏覽器抓取數(shù)據(jù)
2.清洗數(shù)據(jù)
3.存入數(shù)據(jù)庫或者Excel
4.數(shù)據(jù)分析與處理

需要的類庫

requests 用于模擬瀏覽器向網(wǎng)站發(fā)送請求
BeautifulSoup 用于將抓取的html數(shù)據(jù)進(jìn)行清洗
html5lib 用于BeautifulSoup對html的解析使用
openpyxl 用于將清洗過的數(shù)據(jù)存入Excel

抓取數(shù)據(jù)

鏈家數(shù)據(jù)截圖

通過對network的分析沒有找到鏈家通過json傳遞的數(shù)據(jù),這時(shí)候我們的策略就是讀取網(wǎng)頁分析網(wǎng)頁。
使用python當(dāng)中的requests模塊模擬瀏覽器訪問的過程獲取html信息。
這里需要注意的是,當(dāng)我們需要requests模擬瀏覽器去訪問鏈家網(wǎng)站的時(shí)候在headers里面我們要模擬完整的信息。

headers

模擬完整信息的目的是為了保證防止鏈家的服務(wù)器誤以為我們是程序在抓取網(wǎng)站的信息而阻止我們抓取新信息。

代碼片段

headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Connection':'keep-alive',
        'Cookie':'lianjia_uuid=9615f3ee-0865-4a66-b674-b94b64f709dc; logger_session=d205696d584e350975cf1d649f944f4b; select_city=110000; all-lj=144beda729446a2e2a6860f39454058b; _smt_uid=5871c8fd.2beaddb7; CNZZDATA1253477573=329766555-1483847667-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851778; CNZZDATA1254525948=58093639-1483848060-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483853460; CNZZDATA1255633284=1668427390-1483847993-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851644; CNZZDATA1255604082=1041799577-1483850582-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483850582; _ga=GA1.2.430968090.1483852019; lianjia_ssid=05e8ddcc-b863-4ff6-9f1d-f283e82edd4f',
        'Host':'bj.lianjia.com',
        'Upgrade-Insecure-Requests':'1',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
    }
    html = requests.get(realurl,headers=headers)
    # 解碼
    html.encoding="utf-8"
    soup = BeautifulSoup(html.text,"html5lib")
    info_ul = soup.find(id="house-lst")

顯示html的時(shí)候我們要使用requests的encoding方法設(shè)置好編碼格式,防止出現(xiàn)亂碼。

清洗數(shù)據(jù)

我們通過requests模塊提供的方法獲得的數(shù)據(jù)是帶有html標(biāo)簽的數(shù)據(jù),顯然這些html標(biāo)簽對我們進(jìn)行數(shù)據(jù)分析是無用的,所以我們需要對數(shù)據(jù)進(jìn)行清洗,這時(shí)候就需要使用BeautifulSoup模塊來進(jìn)行操作了。

 # 創(chuàng)建一個(gè)list
    house_all = []
    # 遍歷
    for i in range(0,30):
        house_one = []
        info_li = info_ul.find(attrs={"data-index":i})
        info_panel = info_li.find(attrs={"class":"info-panel"})
        # 清洗數(shù)據(jù)
        title = info_panel.h2.text
        region = info_panel.find(attrs={"class":"region"}).text
        zone = info_panel.find(attrs={"class":"zone"}).text
        meters = info_panel.find(attrs={"class":"meters"}).text
        con = info_panel.find(attrs={"class":"con"}).text
        # subway = info_panel.find(attrs={"class":"fang-subway-ex"}).text
        # visit = info_panel.find(attrs={"class":"haskey-ex"}).text
        # warm = info_panel.find(attrs={"class":"heating-ex"}).text
        price = info_panel.find(attrs={"class":"price"}).text
        update = info_panel.find(attrs={"class":"price-pre"}).text
        lookman = info_panel.find(attrs={"class":"square"}).text
        house_one.append(title)
        house_one.append(region)
        house_one.append(zone)
        house_one.append(meters)
        house_one.append(con)
        # house_one.append(subway)
        # house_one.append(visit)
        # house_one.append(warm)
        house_one.append(price)
        house_one.append(update)
        house_one.append(lookman)
        house_all.append(house_one)
    return house_all

具體方法可以去看BeautifulSoup的中文官方手冊。上面我只是將數(shù)據(jù)逐一取出來,然后放到一個(gè)list里面。

存入Excel

顯然鏈家網(wǎng)并不允許我們?nèi)プト资f條數(shù)據(jù),所以我們使用Excel存儲我們抓到的數(shù)據(jù)就可以了。這需要用到openpyxl模塊,這個(gè)python模塊可以對xlsx文件進(jìn)行操作。
編寫主程序?qū)⒚恳豁摰臄?shù)據(jù)遍歷插入到Excel當(dāng)中。

def main():
    url = "http://bj.lianjia.com/zufang/"
    house_result = []
    for i in range(1,101):
        params = "pg"+str(i)+"l2"
        realurl = url + params
        result = get_house(realurl)
        house_result = house_result + result

    wb = Workbook()
    ws1 = wb.active
    ws1.title = "beijing"
    for row in house_result:
        ws1.append(row)
    wb.save('北京兩室一廳租房信息.xlsx')
if __name__ == '__main__':
    main()

代碼思路

1.引入需要使用庫
2.創(chuàng)建一個(gè)靜態(tài)方法get_house()接受一個(gè)參數(shù)realurl
3.對抓取到的數(shù)據(jù)進(jìn)行清洗獲得純文本
4.主函數(shù)當(dāng)中遍歷頁數(shù),多次調(diào)用get_house()獲取每一頁的數(shù)據(jù)逐行寫入到excel當(dāng)中

完整代碼參考

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook

def get_house(realurl):
    # 請求
    headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8',
        'Connection':'keep-alive',
        'Cookie':'lianjia_uuid=9615f3ee-0865-4a66-b674-b94b64f709dc; logger_session=d205696d584e350975cf1d649f944f4b; select_city=110000; all-lj=144beda729446a2e2a6860f39454058b; _smt_uid=5871c8fd.2beaddb7; CNZZDATA1253477573=329766555-1483847667-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851778; CNZZDATA1254525948=58093639-1483848060-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483853460; CNZZDATA1255633284=1668427390-1483847993-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851644; CNZZDATA1255604082=1041799577-1483850582-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483850582; _ga=GA1.2.430968090.1483852019; lianjia_ssid=05e8ddcc-b863-4ff6-9f1d-f283e82edd4f',
        'Host':'bj.lianjia.com',
        'Upgrade-Insecure-Requests':'1',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
    }
    html = requests.get(realurl,headers=headers)
    # 解碼
    html.encoding="utf-8"
    soup = BeautifulSoup(html.text,"html5lib")
    info_ul = soup.find(id="house-lst")
    # 創(chuàng)建一個(gè)list
    house_all = []
    # 遍歷
    for i in range(0,30):
        house_one = []
        info_li = info_ul.find(attrs={"data-index":i})
        info_panel = info_li.find(attrs={"class":"info-panel"})
        # 清洗數(shù)據(jù)
        title = info_panel.h2.text
        region = info_panel.find(attrs={"class":"region"}).text
        zone = info_panel.find(attrs={"class":"zone"}).text
        meters = info_panel.find(attrs={"class":"meters"}).text
        con = info_panel.find(attrs={"class":"con"}).text
        # subway = info_panel.find(attrs={"class":"fang-subway-ex"}).text
        # visit = info_panel.find(attrs={"class":"haskey-ex"}).text
        # warm = info_panel.find(attrs={"class":"heating-ex"}).text
        price = info_panel.find(attrs={"class":"price"}).text
        update = info_panel.find(attrs={"class":"price-pre"}).text
        lookman = info_panel.find(attrs={"class":"square"}).text
        house_one.append(title)
        house_one.append(region)
        house_one.append(zone)
        house_one.append(meters)
        house_one.append(con)
        # house_one.append(subway)
        # house_one.append(visit)
        # house_one.append(warm)
        house_one.append(price)
        house_one.append(update)
        house_one.append(lookman)
        house_all.append(house_one)
    return house_all

def main():
    url = "http://bj.lianjia.com/zufang/"
    house_result = []
    for i in range(1,101):
        params = "pg"+str(i)+"l2"
        realurl = url + params
        result = get_house(realurl)
        house_result = house_result + result

    wb = Workbook()
    ws1 = wb.active
    ws1.title = "beijing"
    for row in house_result:
        ws1.append(row)
    wb.save('北京兩室一廳租房信息.xlsx')
if __name__ == '__main__':
    main()
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 爬蟲是一個(gè)比較容易上手的技術(shù),也許花5分鐘看一篇文檔就能爬取單個(gè)網(wǎng)頁上的數(shù)據(jù)。但對于大規(guī)模爬蟲,完全就是另一回事,...
    真依然很拉風(fēng)閱讀 9,823評論 5 114
  • ├─day1 │ 1爬蟲的基本概念 │ 2Fiddler簡介 │ 3網(wǎng)頁信息簡介 │ 4讀取網(wǎng)頁三種方法 ...
    飛雪雪團(tuán)隊(duì)閱讀 4,597評論 1 5
  • 掃除道的五大功效: 讓人謙虛(勤打掃,經(jīng)常整理,思路就容易清晰,知道自己還有很多不足,所以就會謙虛、低調(diào)、謙和!)...
    六爸啦啦啦閱讀 210評論 0 0
  • 這個(gè)標(biāo)題并不是講農(nóng)業(yè)生產(chǎn),而講的是親子教育的話題。 今天早上小女兒發(fā)生的一件事情,我記在了小本子,本來準(zhǔn)備作為日課...
    正言鋒語閱讀 536評論 0 0
  • 這場離婚,令我?guī)锥冗煅省?關(guān)于這個(gè)遠(yuǎn)親哥哥,我?guī)缀鯖]有任何印象,關(guān)于這段故事,也是從母親那里拾來的。 他出身農(nóng)村,...
    432c57ca314e閱讀 302評論 0 0

友情鏈接更多精彩內(nèi)容