爬蟲項(xiàng)目
眾籌網(wǎng)-眾籌中項(xiàng)目 http://www.zhongchou.com/brow...,我們就以這個(gè)網(wǎng)站為例,我們爬取它所有目前正在眾籌中的項(xiàng)目,獲得每一個(gè)項(xiàng)目詳情頁(yè)的URL,存入txt文件中。
實(shí)戰(zhàn)比較
python原始版
# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup # HTML
#請(qǐng)求頭
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'www.zhongchou.com',
'Upgrade-Insecure-Requests':1,
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}
# 獲得項(xiàng)目url列表
def getItems(allpage):
no = 0
items = open('pystandard.txt','a')
for page in range(allpage):
if page==0:
url = 'http://www.zhongchou.com/browse/di'
else:
url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
# print url #①
r1 = requests.get(url,headers=headers)
html = r1.text.encode('utf8')
soup = BeautifulSoup(html);
lists = soup.findAll(attrs={"class":"ssCardItem"})
for i in range(len(lists)):
href = lists[i].a['href']
items.write(href+"\n")
no +=1
items.close()
return no
if __name__ == '__main__':
start = time.clock()
allpage = 30
no = getItems(allpage)
end = time.clock()
print('it takes %s Seconds to get %s items '%(end-start,no))
實(shí)驗(yàn)5次的結(jié)果:
it takes 48.1727159614 Seconds to get 720 items
it takes 45.3397999415 Seconds to get 720 items
it takes 44.4811429862 Seconds to get 720 items
it takes 44.4619293082 Seconds to get 720 items
it takes 46.669706593 Seconds to get 720 items
python多線程版
# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time,threading
from BeautifulSoup import BeautifulSoup # HTML
#請(qǐng)求頭
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'www.zhongchou.com',
'Upgrade-Insecure-Requests':1,
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}
items = open('pymulti.txt','a')
no = 0
lock = threading.Lock()
# 獲得項(xiàng)目url列表
def getItems(urllist):
# print urllist #①
global items,no,lock
for url in urllist:
r1 = requests.get(url,headers=headers)
html = r1.text.encode('utf8')
soup = BeautifulSoup(html);
lists = soup.findAll(attrs={"class":"ssCardItem"})
for i in range(len(lists)):
href = lists[i].a['href']
lock.acquire()
items.write(href+"\n")
no +=1
# print no
lock.release()
if __name__ == '__main__':
start = time.clock()
allpage = 30
allthread = 30
per = (int)(allpage/allthread)
urllist = []
ths = []
for page in range(allpage):
if page==0:
url = 'http://www.zhongchou.com/browse/di'
else:
url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
urllist.append(url)
for i in range(allthread):
# print urllist[i*(per):(i+1)*(per)]
th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))
th.start()
th.join()
items.close()
end = time.clock()
print('it takes %s Seconds to get %s items '%(end-start,no))
實(shí)驗(yàn)5次的結(jié)果:
it takes 45.5222291114 Seconds to get 720 items
it takes 46.7097831417 Seconds to get 720 items
it takes 45.5334646156 Seconds to get 720 items
it takes 48.0242797553 Seconds to get 720 items
it takes 44.804855018 Seconds to get 720 items
這個(gè)多線程并沒有優(yōu)勢(shì),經(jīng)過 #① 的注釋與否發(fā)現(xiàn),這個(gè)所謂的多線程也是按照單線程運(yùn)行的。
python改進(jìn)
單線程
首先我們把解析html的步驟改進(jìn)一下,分析發(fā)現(xiàn)
lists = soup.findAll('a',attrs={"class":"siteCardICH3"})
比
lists = soup.findAll(attrs={"class":"ssCardItem"})
更好,因?yàn)樗侵苯诱?a ,而不是先找 div 再找 div 下的 a
改進(jìn)后實(shí)驗(yàn)5次結(jié)果如下,可見有進(jìn)步:
it takes 41.0018861912 Seconds to get 720 items
it takes 42.0260390497 Seconds to get 720 items
it takes 42.249635988 Seconds to get 720 items
it takes 41.295524133 Seconds to get 720 items
it takes 42.9022894154 Seconds to get 720 items
多線程
修改 getItems(urllist)為 getItems(urllist,thno)
函數(shù)起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。 結(jié)果:
0 begin at 0.00100631078628
0 end at 1.28625832936
1 begin at 1.28703230691
1 end at 2.61739476075
2 begin at 2.61801291642
2 end at 3.92514717937
3 begin at 3.9255829208
3 end at 5.38870235361
4 begin at 5.38921134066
4 end at 6.670658786
5 begin at 6.67125734731
5 end at 8.01520989534
6 begin at 8.01566383155
6 end at 9.42006780585
7 begin at 9.42053340537
7 end at 11.0386755513
8 begin at 11.0391565464
8 end at 12.421359168
9 begin at 12.4218294329
9 end at 13.9932716671
10 begin at 13.9939957256
10 end at 15.3535799145
11 begin at 15.3540870354
11 end at 16.6968289314
12 begin at 16.6972665389
12 end at 17.9798803157
13 begin at 17.9804714125
13 end at 19.326706238
14 begin at 19.3271438455
14 end at 20.8744308886
15 begin at 20.8751017624
15 end at 22.5306500245
16 begin at 22.5311450156
16 end at 23.7781693541
17 begin at 23.7787245279
17 end at 25.1775114499
18 begin at 25.178350742
18 end at 26.5497330734
19 begin at 26.5501776789
19 end at 27.970799259
20 begin at 27.9712727895
20 end at 29.4595075375
21 begin at 29.4599959972
21 end at 30.9507299602
22 begin at 30.9513989679
22 end at 32.2762763982
23 begin at 32.2767182045
23 end at 33.6476256057
24 begin at 33.648137392
24 end at 35.1100517711
25 begin at 35.1104907783
25 end at 36.462657099
26 begin at 36.4632234696
26 end at 37.7908515759
27 begin at 37.7912845182
27 end at 39.4359928956
28 begin at 39.436448698
28 end at 40.9955021593
29 begin at 40.9960871912
29 end at 42.6425665264
it takes 42.6435882327 Seconds to get 720 items
可見這些線程是真的沒有并發(fā)執(zhí)行,而是順序執(zhí)行的,并沒有達(dá)到多線程的目的。問題在哪里呢?原來(lái)
我的循環(huán)中
th.start()
th.join()
兩行代碼是緊接著的,所以新的線程會(huì)等待上一個(gè)線程執(zhí)行完畢才會(huì)start,修改為
for i in range(allthread):
# print urllist[i*(per):(i+1)*(per)]
th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))
ths.append(th)
for th in ths:
th.start()
for th in ths:
th.join()
7 end at 69.1060433231
22 end at 69.2743398214
2 end at 69.5523713152
14 end at 69.6454986837
15 end at 69.8333400981
12 end at 69.9508018062
10 end at 70.2860348602
26 end at 70.3670659719
13 end at 70.3847232972
27 end at 70.3941635841
11 end at 70.5132838156
1 end at 70.7272351926
0 end at 70.9115253609
6 end at 71.0876563409
8 end at 71.112480539825
end at 71.1145248855
3 end at 71.4606034226
19 end at 71.6103622486
18 end at 71.6674453096
20 end at 71.725601862
17 end at 71.7778992318
9 end at 71.7847479301
28 end at 71.7921004837
it takes 71.7931912368 Seconds to get 720 items
反思
上面的的多線是并發(fā)了,可是比單線程運(yùn)行時(shí)間長(zhǎng)了太多......我還沒找出來(lái)原因,猜想是不是beautifulsoup不支持多線程?請(qǐng)各位多多指教。為了驗(yàn)證這個(gè)想法,我準(zhǔn)備不用beautifulsoup,直接使用字符串查找。首先還是從單線程的修改:
# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup # HTML
#請(qǐng)求頭
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'www.zhongchou.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}
# 獲得項(xiàng)目url列表
def getItems(allpage):
no = 0
data = set()
for page in range(allpage):
if page==0:
url = 'http://www.zhongchou.com/browse/di'
else:
url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
# print url #①
r1 = requests.get(url,headers=headers)
html = r1.text.encode('utf8')
start = 5000
while True:
index = html.find("deal-show", start)
if index == -1:
break
# print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"
# time.sleep(100)
data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")
start = index + 1000
items = open('pystandard.txt','a')
items.write("".join(data))
items.close()
return len(data)
if __name__ == '__main__':
start = time.clock()
allpage = 30
no = getItems(allpage)
end = time.clock()
print('it takes %s Seconds to get %s items '%(end-start,no))
it takes 11.6800132309 Seconds to get 720 items
it takes 11.3621804427 Seconds to get 720 items
it takes 11.6811991567 Seconds to get 720 items
然后對(duì)多線程進(jìn)行修改:
# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time,threading
from BeautifulSoup import BeautifulSoup # HTML
#請(qǐng)求頭
header = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'www.zhongchou.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}
data = set()
no = 0
lock = threading.Lock()
# 獲得項(xiàng)目url列表
def getItems(urllist,thno):
# print urllist
# print thno," begin at",time.clock()
global no,lock,data
for url in urllist:
r1 = requests.get(url,headers=header)
html = r1.text.encode('utf8')
start = 5000
while True:
index = html.find("deal-show", start)
if index == -1:
break
lock.acquire()
data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")
start = index + 1000
lock.release()
# print thno," end at",time.clock()
if __name__ == '__main__':
start = time.clock()
allpage = 30 #頁(yè)數(shù)
allthread = 10 #線程數(shù)
per = (int)(allpage/allthread)
urllist = []
ths = []
for page in range(allpage):
if page==0:
url = 'http://www.zhongchou.com/browse/di'
else:
url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
urllist.append(url)
for i in range(allthread):
# print urllist[i*(per):(i+1)*(per)]
low = i*allpage/allthread#注意寫法
high = (i+1)*allpage/allthread
# print low,' ',high
th = threading.Thread(target = getItems,args= (urllist[low:high],i))
ths.append(th)
for th in ths:
th.start()
for th in ths:
th.join()
items = open('pymulti.txt','a')
items.write("".join(data))
items.close()
end = time.clock()
print('it takes %s Seconds to get %s items '%(end-start,len(data)))
實(shí)驗(yàn)3次,結(jié)果:
it takes 1.4781525123 Seconds to get 720 items
it takes 1.44905954029 Seconds to get 720 items
it takes 1.49297891786 Seconds to get 720 items
可見多線程確實(shí)比單線程快好多倍。對(duì)于簡(jiǎn)單的爬取任務(wù)而言,用字符串的內(nèi)置方法比用beautifulsoup解析html快很多。
NodeJs
// npm install request -g #貌似不行,要進(jìn)入代碼所在目錄:npm install --save request
// npm install cheerio -g #npm install --save cheerio
var request = require("request");
var cheerio = require('cheerio');
var fs = require('fs');
var t1 = new Date().getTime();
var allpage = 30;
var urllist = new Array()
var urldata = "";
var mark = 0;
var no = 0;
for (var i=0; i<allpage; i++) {
if (i==0)
urllist[i] = 'http://www.zhongchou.com/browse/di'
else
urllist[i] = 'http://www.zhongchou.com/browse/di-p'+(i+1).toString();
request(urllist[i],function(error,resp,body){
if (!error && resp.statusCode==200) {
getUrl(body);
}
});
}
function getUrl(data) {
var $ = cheerio.load(data); //cheerio解析data
var href = $("a.siteCardICH3").toArray();
for (var i = href.length - 1; i >= 0; i--) {
// console.log(href[i].attribs["href"]);
urldata += (href[i].attribs["href"]+"\n");
no += 1;
}
mark += 1;
if (mark==allpage) {
// console.log(urldata);
fs.writeFile('./nodestandard.txt',urldata,function(err){
if(err) throw err;
});
var t2 = new Date().getTime();
console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
}
}
it takes 3.949 Seconds to get 720 items
it takes 3.642 Seconds to get 720 items
it takes 3.641 Seconds to get 720 items
it takes 3.938 Seconds to get 720 items
it takes 3.783 Seconds to get 720 items
可見同樣是用解析html的方法,nodejs速度完虐python。字符串查找呢?
function getUrl(data) {
mark += 1;
var start = 5000
while (true) {
var index1 = data.indexOf("deal-show", start);
if (index1 == -1)
break;
var url = "http://www.zhongchou.com/deal-show/"+data.substring(index1+10,index1+19)+"\n";
// console.log(url);
if (urldata.indexOf(url)==-1) {
urldata.push(url);
}
start = index1 + 1000;
}
if (mark==allpage) {//所有頁(yè)面執(zhí)行完畢
// console.log(urldata);
no = urldata.length;
fs.writeFile('./nodestandard.txt',urldata.join(""),function(err){
if(err) throw err;
});
var t2 = new Date().getTime();
console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
}
}
實(shí)驗(yàn)5次的結(jié)果:
it takes 3.695 Seconds to get 720 items
it takes 3.781 Seconds to get 720 items
it takes 3.94 Seconds to get 720 items
it takes 3.705 Seconds to get 720 items
it takes 3.601 Seconds to get 720 items
可見和解析起來(lái)的時(shí)間是差不多的。
綜上
由我自己了解的知識(shí)和本實(shí)驗(yàn)而言,我的結(jié)論是:python用上多線程下載速度能夠比過nodejs,但是解析網(wǎng)頁(yè)這種事python沒有nodejs快,畢竟js原生就是為了寫網(wǎng)頁(yè),而且復(fù)雜的爬蟲總不能都用字符串去找吧。