出現(xiàn)的問題：

當(dāng)爬取交易地點(diǎn)，使用代碼

時(shí)，爬取的結(jié)果是 ['交易地點(diǎn)：']，而不是我想要的‘地點(diǎn)－地點(diǎn)’形式

解決方法

將selector改為

soup.select('#wrapper > div.content.clearfix > div.leftBox > div > div > ul > li:nth-of-type(3) > a)

如果僅僅是用

soup.select('#wrapper > div.content.clearfix > div.leftBox > div > div > ul > li:nth-of-type(3) > a)[0].stripped_strings

那么結(jié)果也只是網(wǎng)站中出現(xiàn)的第一個(gè)地點(diǎn)，而不是全部
所以，采用map()函數(shù)遍歷

在爬取過程中出現(xiàn)了

requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
```這種錯(cuò)誤
在網(wǎng)上查了一下，是屬于[python requests接收chunked編碼問題](http://blog.csdn.net/wangzuxi/article/details/40377467)
但是網(wǎng)上給出的解決方法太專業(yè)，我一入門漢瞬間懵逼了
#####解決方法
后來(lái)我注意到在網(wǎng)頁(yè)的Request Headers中有一個(gè)Accept-Encoding,應(yīng)該是關(guān)于編碼問題的，所以就在在requests.get中添加了
```headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4)AppleWebKit/537.36(KHTML,likeGecko)Chrome/44.0.2403.157 Safari/537.36',    'Connection':'keep-alive','Accept-Encoding':'gzip, deflate'}```然后就順利爬取了
- 后來(lái)爬取到5萬(wàn)多條時(shí)出現(xiàn)了```requests.exceptions.ConnectionError: None: Max retries exceeded with url: /qitawupin/o111/ (Caused by None)```錯(cuò)誤提示
######解決方法
使用代理ip


#下面是代碼部分
- 爬取分類鏈接

import requests
from bs4 import BeautifulSoup
first_url = 'http://bj.ganji.com/wu/'
base_url = 'http://bj.ganji.com'

http://bj.ganji.com/jiaju/

def get_second_url(url):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text, 'lxml')
second_urls = soup.select('dl.fenlei dt a')
for second_url in second_urls:
whole_second_url = base_url + second_url.get('href')
print(whole_second_url)

將得到的結(jié)果賦值給whole_second_url 

- 爬取列表頁(yè)

mport requests,time,pymongo,random
from bs4 import BeautifulSoup
client = pymongo.MongoClient('localhost',27017)
ganji = client['ganji']
whole_third_url = ganji['whole_third_url']
item_info = ganji['item_info']
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Connection':'keep-alive','Accept-Encoding':'gzip, deflate'}
proxy_list = ['http://117.177.250.151:8081', 'http://111.85.219.250:3129', 'http://122.70.183.138:8118',]
proxy_ip = random.choice(proxy_list)
proxies = {'http':proxy_ip}
def get_third_url(whole_second_url,pages):
whole_url = '{}o{}/'.format(whole_second_url,str(pages))
web_data = requests.get(whole_url,headers = headers,proxies = proxies)
# time.sleep(5)
soup = BeautifulSoup(web_data.text, 'lxml')
if soup.find_all('a',{'class':'next'}):
for link in soup.select('li.js-item a.ft-tit'):
third_url = link.get('href')
whole_third_url.insert_one({'url':third_url})
#print(third_url)
else:
pass


- 爬取詳情頁(yè)具體信息

def get_item_info(url):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text, 'lxml')
title = soup.select('h1.title-name')[0].text if soup.find_all('h1',{'class':'title-name'}) else None
#這里考慮到有刪除頁(yè)和轉(zhuǎn)轉(zhuǎn)的商品頁(yè)（和一般的商品頁(yè)不一樣），根據(jù)觀察，刪除頁(yè)和轉(zhuǎn)轉(zhuǎn)頁(yè)的商品標(biāo)題的html不一樣，所以以標(biāo)題作為判斷標(biāo)準(zhǔn)
if title == None:
pass
else:
time = list(soup.select('i.pr-5')[0].stripped_strings) if soup.find('i',{'class':'pr-5'}) else None
type = soup.select('#wrapper > div.content.clearfix > div.leftBox > div:nth-of-type(3) > div > ul > li:nth-of-type(1) > span > a')[0].text if soup.find_all('ul',{'class':'det-infor'}) else None
price = soup.select('i.f22.fc-orange.f-type')[0].text if soup.find_all('i',{'class':'f22 fc-orange f-type'}) else None
address = list(map(lambda x:x.text,soup.select('#wrapper > div.content.clearfix > div.leftBox > div > div > ul > li:nth-of-type(3) > a'))) if soup.find_all('li') else None
old_new = soup.select('ul.second-det-infor.clearfix > li:nth-of-type(2) > label')[0].text if soup.select('ul.second-det-infor.clearfix > li:nth-of-type(2) > label') else None
item_info.insert_one({'title':title, 'time':time, 'type':type, 'price':price, 'address':address, 'old_new':old_new})

    print(title,time,type,price,address,old_new)

- 開始爬取

from multiprocessing import Pool
from get_second_url import whole_second_url
from get_third_url import get_third_url
from get_third_url import get_item_info
def get_all_links_from(whole_second_url):
for i in range(1,121):
get_third_url(whole_second_url,i)

if name == 'main':
pool = Pool()
pool.map(get_all_links_from,whole_second_url.split())


- 計(jì)數(shù)
使用以下程序?qū)Υ鎯?chǔ)到數(shù)據(jù)庫(kù)的數(shù)據(jù)進(jìn)行計(jì)數(shù)

import time
from get_third_url import whole_third_url
while True:
print(whole_third_url.find().count())
time.sleep(5)

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

爬取10萬(wàn)商品數(shù)據(jù)

爬取10萬(wàn)商品數(shù)據(jù)

出現(xiàn)的問題：

解決方法

http://bj.ganji.com/jiaju/

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

爬取10萬(wàn)商品數(shù)據(jù)

出現(xiàn)的問題：

解決方法

http://bj.ganji.com/jiaju/

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av