爬取短租房前三頁,并將數(shù)據(jù)存儲在mongodb中,打印出大于等于500元的租房信息。
代碼:
import requests
from bs4 import BeautifulSoup
import pymongo
client = pymongo.MongoClient('localhost')
duanzufang = client['dzf']
list = duanzufang['list']
urls = ['http://bj.xiaozhu.com/search-duanzufang-p' + str({}).format(str(i)) + '-0/' for i in range(1, 4)]
head = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'}
for url in urls:
wb_data = requests.get(url, headers = head)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.find_all('span', {'class': 'result_title hiddenTxt'})
infos = soup.find_all('em', {'class': 'hiddenTxt'})
prices = soup.find_all('span', {'class': 'result_price'})
for title, info, price in zip(titles, infos, prices):
data = {
'title': title.get_text(),
'typo': info.get_text().replace('\n', '').replace(' ', '').split('-')[0],
'comment_num': info.get_text().replace('\n', '').replace(' ', '').split('-')[1],
'address': info.get_text().replace('\n', '').replace(' ', '').split('-')[2],
'price': int(price.i.get_text())
}
list.insert_one(data)
for item in list.find({'price': {'$gte': 500}}):
print (item)
總結(jié):
1、理解了網(wǎng)頁的結(jié)構(gòu)
2、通過研讀Bs4文檔,學(xué)會了find系列函數(shù)用法
3、學(xué)會數(shù)據(jù)庫建立以及輸入數(shù)據(jù)
問題:
如果不止爬取前三頁,想爬取所有頁,觀察了底下頁碼發(fā)現(xiàn)是動態(tài)變化的,請問老師這種情況應(yīng)該怎么爬取呢?