程序設(shè)計(jì)思路
爬蟲程序的設(shè)計(jì)思路大同小異,下面是我的設(shè)計(jì)思路
1.模擬瀏覽器抓取數(shù)據(jù)
2.清洗數(shù)據(jù)
3.存入數(shù)據(jù)庫或者Excel
4.數(shù)據(jù)分析與處理
需要的類庫
requests 用于模擬瀏覽器向網(wǎng)站發(fā)送請求
BeautifulSoup 用于將抓取的html數(shù)據(jù)進(jìn)行清洗
html5lib 用于BeautifulSoup對html的解析使用
openpyxl 用于將清洗過的數(shù)據(jù)存入Excel
抓取數(shù)據(jù)

通過對network的分析沒有找到鏈家通過json傳遞的數(shù)據(jù),這時(shí)候我們的策略就是讀取網(wǎng)頁分析網(wǎng)頁。
使用python當(dāng)中的requests模塊模擬瀏覽器訪問的過程獲取html信息。
這里需要注意的是,當(dāng)我們需要requests模擬瀏覽器去訪問鏈家網(wǎng)站的時(shí)候在headers里面我們要模擬完整的信息。

模擬完整信息的目的是為了保證防止鏈家的服務(wù)器誤以為我們是程序在抓取網(wǎng)站的信息而阻止我們抓取新信息。
代碼片段
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Cookie':'lianjia_uuid=9615f3ee-0865-4a66-b674-b94b64f709dc; logger_session=d205696d584e350975cf1d649f944f4b; select_city=110000; all-lj=144beda729446a2e2a6860f39454058b; _smt_uid=5871c8fd.2beaddb7; CNZZDATA1253477573=329766555-1483847667-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851778; CNZZDATA1254525948=58093639-1483848060-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483853460; CNZZDATA1255633284=1668427390-1483847993-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851644; CNZZDATA1255604082=1041799577-1483850582-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483850582; _ga=GA1.2.430968090.1483852019; lianjia_ssid=05e8ddcc-b863-4ff6-9f1d-f283e82edd4f',
'Host':'bj.lianjia.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
html = requests.get(realurl,headers=headers)
# 解碼
html.encoding="utf-8"
soup = BeautifulSoup(html.text,"html5lib")
info_ul = soup.find(id="house-lst")
顯示html的時(shí)候我們要使用requests的encoding方法設(shè)置好編碼格式,防止出現(xiàn)亂碼。
清洗數(shù)據(jù)
我們通過requests模塊提供的方法獲得的數(shù)據(jù)是帶有html標(biāo)簽的數(shù)據(jù),顯然這些html標(biāo)簽對我們進(jìn)行數(shù)據(jù)分析是無用的,所以我們需要對數(shù)據(jù)進(jìn)行清洗,這時(shí)候就需要使用BeautifulSoup模塊來進(jìn)行操作了。
# 創(chuàng)建一個(gè)list
house_all = []
# 遍歷
for i in range(0,30):
house_one = []
info_li = info_ul.find(attrs={"data-index":i})
info_panel = info_li.find(attrs={"class":"info-panel"})
# 清洗數(shù)據(jù)
title = info_panel.h2.text
region = info_panel.find(attrs={"class":"region"}).text
zone = info_panel.find(attrs={"class":"zone"}).text
meters = info_panel.find(attrs={"class":"meters"}).text
con = info_panel.find(attrs={"class":"con"}).text
# subway = info_panel.find(attrs={"class":"fang-subway-ex"}).text
# visit = info_panel.find(attrs={"class":"haskey-ex"}).text
# warm = info_panel.find(attrs={"class":"heating-ex"}).text
price = info_panel.find(attrs={"class":"price"}).text
update = info_panel.find(attrs={"class":"price-pre"}).text
lookman = info_panel.find(attrs={"class":"square"}).text
house_one.append(title)
house_one.append(region)
house_one.append(zone)
house_one.append(meters)
house_one.append(con)
# house_one.append(subway)
# house_one.append(visit)
# house_one.append(warm)
house_one.append(price)
house_one.append(update)
house_one.append(lookman)
house_all.append(house_one)
return house_all
具體方法可以去看BeautifulSoup的中文官方手冊。上面我只是將數(shù)據(jù)逐一取出來,然后放到一個(gè)list里面。
存入Excel
顯然鏈家網(wǎng)并不允許我們?nèi)プト资f條數(shù)據(jù),所以我們使用Excel存儲我們抓到的數(shù)據(jù)就可以了。這需要用到openpyxl模塊,這個(gè)python模塊可以對xlsx文件進(jìn)行操作。
編寫主程序?qū)⒚恳豁摰臄?shù)據(jù)遍歷插入到Excel當(dāng)中。
def main():
url = "http://bj.lianjia.com/zufang/"
house_result = []
for i in range(1,101):
params = "pg"+str(i)+"l2"
realurl = url + params
result = get_house(realurl)
house_result = house_result + result
wb = Workbook()
ws1 = wb.active
ws1.title = "beijing"
for row in house_result:
ws1.append(row)
wb.save('北京兩室一廳租房信息.xlsx')
if __name__ == '__main__':
main()
代碼思路
1.引入需要使用庫
2.創(chuàng)建一個(gè)靜態(tài)方法get_house()接受一個(gè)參數(shù)realurl
3.對抓取到的數(shù)據(jù)進(jìn)行清洗獲得純文本
4.主函數(shù)當(dāng)中遍歷頁數(shù),多次調(diào)用get_house()獲取每一頁的數(shù)據(jù)逐行寫入到excel當(dāng)中
完整代碼參考
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
def get_house(realurl):
# 請求
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Cookie':'lianjia_uuid=9615f3ee-0865-4a66-b674-b94b64f709dc; logger_session=d205696d584e350975cf1d649f944f4b; select_city=110000; all-lj=144beda729446a2e2a6860f39454058b; _smt_uid=5871c8fd.2beaddb7; CNZZDATA1253477573=329766555-1483847667-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851778; CNZZDATA1254525948=58093639-1483848060-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483853460; CNZZDATA1255633284=1668427390-1483847993-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483851644; CNZZDATA1255604082=1041799577-1483850582-http%253A%252F%252Fbj.fang.lianjia.com%252F%7C1483850582; _ga=GA1.2.430968090.1483852019; lianjia_ssid=05e8ddcc-b863-4ff6-9f1d-f283e82edd4f',
'Host':'bj.lianjia.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
html = requests.get(realurl,headers=headers)
# 解碼
html.encoding="utf-8"
soup = BeautifulSoup(html.text,"html5lib")
info_ul = soup.find(id="house-lst")
# 創(chuàng)建一個(gè)list
house_all = []
# 遍歷
for i in range(0,30):
house_one = []
info_li = info_ul.find(attrs={"data-index":i})
info_panel = info_li.find(attrs={"class":"info-panel"})
# 清洗數(shù)據(jù)
title = info_panel.h2.text
region = info_panel.find(attrs={"class":"region"}).text
zone = info_panel.find(attrs={"class":"zone"}).text
meters = info_panel.find(attrs={"class":"meters"}).text
con = info_panel.find(attrs={"class":"con"}).text
# subway = info_panel.find(attrs={"class":"fang-subway-ex"}).text
# visit = info_panel.find(attrs={"class":"haskey-ex"}).text
# warm = info_panel.find(attrs={"class":"heating-ex"}).text
price = info_panel.find(attrs={"class":"price"}).text
update = info_panel.find(attrs={"class":"price-pre"}).text
lookman = info_panel.find(attrs={"class":"square"}).text
house_one.append(title)
house_one.append(region)
house_one.append(zone)
house_one.append(meters)
house_one.append(con)
# house_one.append(subway)
# house_one.append(visit)
# house_one.append(warm)
house_one.append(price)
house_one.append(update)
house_one.append(lookman)
house_all.append(house_one)
return house_all
def main():
url = "http://bj.lianjia.com/zufang/"
house_result = []
for i in range(1,101):
params = "pg"+str(i)+"l2"
realurl = url + params
result = get_house(realurl)
house_result = house_result + result
wb = Workbook()
ws1 = wb.active
ws1.title = "beijing"
for row in house_result:
ws1.append(row)
wb.save('北京兩室一廳租房信息.xlsx')
if __name__ == '__main__':
main()