對(duì)于這種簡(jiǎn)單的網(wǎng)站而言,要是想追求爬取數(shù)據(jù)的速度,就不得不說(shuō)Scrapy.它是一個(gè)基于Twisted,純Python實(shí)現(xiàn)的爬蟲框架,用戶只需要定制開發(fā)幾個(gè)模塊就可以輕松的實(shí)現(xiàn)一個(gè)爬蟲,用來(lái)抓取網(wǎng)頁(yè)內(nèi)容以及各種圖片,是非常之方便的
1.Scrapy架構(gòu)
Scrapy 使用 Twisted這個(gè)異步網(wǎng)絡(luò)庫(kù)來(lái)處理網(wǎng)絡(luò)通訊,架構(gòu)清晰,并且包含了各種中間件接口,可以靈活的完成各種需求。整體架構(gòu)如下圖所示:

2.操作流程
<ul><li>創(chuàng)建一個(gè)Scrapy項(xiàng)目</li><li>定義提取的Item</li><li>編寫爬取網(wǎng)站的spider并提取Item</li><li>編寫 Item Pipeline 來(lái)存儲(chǔ)提取到的 Item(即數(shù)據(jù))</li>具體教程可參考Scrapy入門教程,就不在這贅述了</ul>
3.具體實(shí)現(xiàn)

現(xiàn)有如下需求,要爬取ZZ91網(wǎng)站的數(shù)據(jù)如圖所示(標(biāo)紅部分),觀察url,不能發(fā)現(xiàn)其url規(guī)律,即最后的數(shù)字是表示page的頁(yè)碼,假設(shè)前幾部均已實(shí)現(xiàn),現(xiàn)編寫一個(gè)spider,實(shí)現(xiàn)的時(shí)候先分析頁(yè)面標(biāo)簽,再使用beautifulSoup、Xpath、selector或正則表達(dá)式等均可,進(jìn)行頁(yè)面解析.

編寫spider核心代碼如下:
<pre><code>class DemoSpider(scrapy.spiders.Spider):
name = "fdp1"
start_urls = [
"http://jiage.zz91.com/s/e5ba9fe794b5e793b6-1/1.html",
]
base = "http://jiage.zz91.com/s/e5ba9fe794b5e793b6-1/"
for i in xrange(2,4):
url = base + str(i) + '.html'
start_urls.append(url)
def parse(self, response):
#filename = response.url.split("/")[-2]
response.body
soup1 = BeautifulSoup(response.body ,"lxml")
divs = soup1.findAll('div',{'class':'l-main'})
length = 0
for div in divs:
items1 = div.findAll('div',{'class':'m-item'})
items2 = div.findAll('div',{'class':'m-item_2'})
length = len(items1)
for i in xrange(length):
item = Zz91Item()
link = items1[i].find('a')['href']
print link
item['fdp1_url'] = link
item['fdp1_title'] = items1[i].find('a').get_text()
response2 = urllib.urlopen(link)
soup2 = BeautifulSoup(response2, "lxml")
fdp1_content = soup2.find('div',{'class':'p_content p_contentA'})
fdp1_content =fdp1_content.get_text()
fdp1_content = ' '.join(fdp1_content.split())
#print fdp1_content
item['fdp1_content'] =fdp1_content
contact_link = items2[i].find('a')['href']
#print contact_link
item['fdp1_contact_url'] = contact_link
item['fdp1_company'] = items2[i].find('a').get_text()
</code></pre>
<p>我是用mysql數(shù)據(jù)庫(kù)來(lái)存儲(chǔ)數(shù)據(jù)的,這里考慮,數(shù)據(jù)的重復(fù)性,需要在spider類里再編寫一個(gè)check函數(shù),來(lái)檢查數(shù)據(jù)是否重復(fù)爬取.代碼如下:</p>
<pre><code>def check(self, article_url):
self.database = Database()
self.database.connect('crawl_data')
sql = "SELECT * FROM zz91_feidianping_1 where fdp1_url=%s order by fdp1_url"
str_article_url = article_url.encode('utf-8')
data = (str_article_url,)
try:
search_result = self.database.query(sql, data)
if search_result == ():
self.database.close()
return True
except Exception, e:
print e
traceback.print_exc()
self.database.close()
return False
</code>
</pre>
<p>在parse函數(shù)后面添加下面兩行代碼即可:</p>
<pre>
<code> if self.check(item['fdp1_url']):
yield item
</code>
</pre>
<p>獲取數(shù)據(jù)如圖所示:




http://apptest.zz91.com/detail/?id=%s&appsystem=%s&company_id=%s&datatype=%s&usertoken=%s ,但是這個(gè)url里的id是
http://apptest.zz91.com/offerlist/?clientid=867450021846562&company_id=%s&appsystem=%s&keywords=%s&page=%s&orderflag=&datatype=%s&usertoken=%s 里的pbt_id,所以先獲取上述的json文件,才能獲取detail_jison文件
具體實(shí)現(xiàn)如下:
<pre><code>
#coding:utf-8
import requests
import urllib
appsystem = 'XXXXXXXXXXXXXXXX'
company_id ='XXXXXXXXXXXXXXXX'
datatype = 'XXXXXXXXXXXX'
usertoken = 'XXXXXXXXXXXXXX'
keystr = "XXXXXXXXXXXXXXXX"
keywords =urllib.parse.quote(keystr)
count = 1
for i in range(1,100):
url = 'http://apptest.zz91.com/offerlist/?clientid=867450021846562' \
'&company_id=%s&appsystem=%s&keywords=%s' \
'&page=%s&orderflag=&datatype=%s&usertoken=%s'\
% (company_id,appsystem,keywords,str(i),datatype,usertoken)
# 根據(jù)抓包信息 構(gòu)造表單
headers = {
'Host': 'apptest.zz91.com',
'User-Agent': 'Mozilla/5.0 (Linux; Android 4.4.4; 2014112 Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip',
'Connection': 'Keep-Alive',
'Charset': 'UTF-8',
'Cookie':'sessionid=XXXXXXXXXXXXXXXXX',
}
r = requests.session()
response = r.post(url=url,headers=headers)
rjson = response.json()
productList= rjson['productList']
# print(productList)
for product in productList:
print('第%s條信息'% (str(count)))
count += 1
com_name = product['com_name']#公司名稱
pdt_price = product['pdt_price']#價(jià)格
pdt_id= product['pdt_id']#發(fā)布消息id
isshowcontact = product['isshowcontact']
isbuycontact = product['isbuycontact']
com_id = product['com_id']#公司id
pdt_time_en = product['pdt_time_en']#發(fā)布日期
pdt_name = product['pdt_name']#發(fā)布標(biāo)題
com_province = product['com_province']#所在地
pdt_detail = product['pdt_detail']#發(fā)布詳情,其實(shí)也比較粗略
ldbtel = product['ldbtel']
pdt_kind = product['pdt_kind']
kindtxt = pdt_kind['kindtxt']#供應(yīng)或求購(gòu)
kindclass = pdt_kind['kindclass']#buy
com_subname = product['com_subname']
pdt_images = product['pdt_images']
phone_rate = product['phone_rate']
phone_level = product['phone_level']
vippaibian = product['vippaibian']
pdt_name1 = product['pdt_name1']
wordsrandom = product['wordsrandom']
detail_url = 'http://apptest.zz91.com/detail/?' \
'id=%s&appsystem=%s&company_id=%s' \
'&datatype=%s&usertoken=%s'\
% (pdt_id,appsystem,company_id,datatype,usertoken)
detail_pdt = r.post(url=detail_url, headers=headers)
detail_json = detail_pdt.json()
detailList = detail_json['list']
address = detailList['address']
compname = detailList['compname']
business = detailList['business']
contact = detailList['contact']
details = detailList['details']
email = detailList['email']
mobile = detailList['mobile']
mobile1 = detailList['mobile1']
expire_time = detailList['expire_time']#發(fā)布有效期
title = detailList['title']
price = detailList['price']
price_unit = detailList['price_unit']
quantity = detailList['quantity']
quantity_unit = detailList['quantity_unit']
print(title)
print(details)
print(address)
print(compname)
print(contact)
print(mobile)
print('價(jià)格:'+str(price)+str(price_unit))
print('數(shù)量:'+str(quantity)+str(quantity_unit))
print('發(fā)布日期:'+pdt_time_en) #發(fā)布日期
print('有效期:'+expire_time)#發(fā)布有效期
</code></pre>獲取結(jié)果如下圖所示:

</p>
4.總結(jié)
至此,功能均已實(shí)現(xiàn),想說(shuō)一下,F(xiàn)iddler抓包手機(jī)客戶端數(shù)據(jù),還是很方便的,具體配置教程如鏈接所示.接下來(lái),目標(biāo)就是實(shí)現(xiàn)識(shí)別圖片中的手機(jī)號(hào)碼
