Python爬取上海鏈家網(wǎng)房源數(shù)據(jù)并存入MongoDB數(shù)據(jù)庫(kù)

以下是我爬取上海鏈家網(wǎng)寶山區(qū)房源信息的學(xué)習(xí)總結(jié)

準(zhǔn)備工作

用到的Python模塊:

  • requests
  • bs4
  • pymongo
  • datetime
  • time
  • random

分析網(wǎng)頁(yè)

登陸http://sh.lianjia.com/ershoufang/baoshan 用Chrome打開(kāi)開(kāi)發(fā)者工具

image

每條房源信息都在li元素中,我們?cè)賮?lái)看一下翻頁(yè)鏈接
image

試著點(diǎn)擊下一頁(yè),我們?yōu)g覽器上的鏈接是有規(guī)律可循的

http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
.........
http://sh.lianjia.com/ershoufang/baoshan/100

現(xiàn)在我們?cè)囍廊∏?0頁(yè)的鏈接

import requests
for i in range(1, 11):
    r = requests.get('http://sh.lianjia.com/ershoufang/baoshan/d' + str(i))
    print(r.url)
  

爬取結(jié)果

http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
http://sh.lianjia.com/ershoufang/baoshan/d4
http://sh.lianjia.com/ershoufang/baoshan/d5
http://sh.lianjia.com/ershoufang/baoshan/d6
http://sh.lianjia.com/ershoufang/baoshan/d7
http://sh.lianjia.com/ershoufang/baoshan/d8
http://sh.lianjia.com/ershoufang/baoshan/d9
http://sh.lianjia.com/ershoufang/baoshan/d10

解析網(wǎng)頁(yè)

image

要抓取的信息如下:

  • 標(biāo)題:room_title = room.find('div', attrs={'class': 'prop-title'})
  • 房屋信息:room_info = room.find('span', attrs={'class': 'info-col row1-text'})
  • 位置:room_location = room.find('span', attrs={'class': 'info-col row2-text'})
  • 附加信息:extra_info = room.find('div', attrs={'class': 'property-tag-container'})
  • 總價(jià):room_price = room.find('span', attrs={'class': 'total-price strong-num'})
  • 單價(jià):room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'})
soup = BeautifulSoup(r.text, 'html.parser')
rooms = soup.find('ul', attrs={'class': 'js_fang_list'})
for room in rooms.find_all('li'):
    room_title = room.find('div', attrs={'class': 'prop-title'}).get_text()
    room_info = room.find('span', attrs={'class': 'info-col row1-text'}).get_text()
    room_location = room.find('span', attrs={'class': 'info-col row2-text'}).find('a').get_text()
    room_price = room.find('span', attrs={'class': 'total-price strong-num'}).get_text()
    room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'}).get_text()
    extra_info = room.find('div', attrs={'class': 'property-tag-container'}).get_text()


    print(room_title, room_info, room_location, room_price, room_unit_price, extra_info)

下面是網(wǎng)頁(yè)解析下來(lái)的一個(gè)房源信息


廚衛(wèi)全明,臥室?guī)ш?yáng)臺(tái),地鐵房,高區(qū)采光好
 
                            1室1廳 | 44.73平
                            
                                | 高區(qū)/6層
                            
                            
                                | 朝南
                            
                         葑潤(rùn)華庭 255 
                            單價(jià)57008元/平
                         
距離7號(hào)線祁華路站698米
滿(mǎn)二
有鑰匙

存入MongoDB數(shù)據(jù)庫(kù)

MongoDB數(shù)據(jù)結(jié)構(gòu)是以鍵值對(duì){key:value}形式組成,有點(diǎn)類(lèi)似于JSON


image
# 鏈接數(shù)據(jù)庫(kù)
client = MongoClient('localhost', 27017)
# 建立數(shù)據(jù)庫(kù)
db = client.tests
# 建立集合
homes = db.homes

rooms_list = []

# 先將爬下來(lái)的數(shù)據(jù)賦值為字典
rooms_info ={
                'title': room_title,
                'info': room_info,
                'location': room_location,
                'price': room_price,
                'unit_proce': room_unit_price,
                'message': extra_info,
                'time': datetime.datetime.now()
            }

rooms_list.append(rooms_info)
# 存入數(shù)據(jù)庫(kù)
result = homes.insert_many(rooms_list)
print(result)

運(yùn)行代碼,我們可以看到數(shù)據(jù)存入了MongoDB

<pymongo.results.InsertManyResult object at 0x00000260C536AB8>
<pymongo.results.InsertManyResult object at 0x00000260C536AAC>
<pymongo.results.InsertManyResult object at 0x00000260C536AA0>
<pymongo.results.InsertManyResult object at 0x00000260C536AB4>
<pymongo.results.InsertManyResult object at 0x00000260C536AB0>
<pymongo.results.InsertManyResult object at 0x00000260C536A28>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536888>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A48>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>

可以下載一個(gè)MongoDB可視化工具,我用的是Robo3T,數(shù)據(jù)就這樣存入了

image

總共有100頁(yè)的數(shù)據(jù),用time.sleep()來(lái)控制速度防止被封掉,但爬取效率實(shí)在很低,這兩天準(zhǔn)備學(xué)習(xí)pandas

完整代碼在GitHub
簡(jiǎn)書(shū)
歡迎訪問(wèn)博客Treehl的博客

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容