本次引入了多進(jìn)程和異常捕捉的概念,python實(shí)現(xiàn)的并行程序有很多需要注意的地方,初學(xué)者如我也是研究了許久,才下得手。
這次的代碼只是簡單的應(yīng)用,并行爬蟲的優(yōu)勢大家可以自行度娘谷哥。
選取西刺網(wǎng)主要為后期建立代理池做個(gè)儲(chǔ)備。
BTW,多進(jìn)程下異常捕捉也是個(gè)需要我們關(guān)注的點(diǎn),要好好學(xué)習(xí)鉆研!
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool,freeze_support
import traceback
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
xiciip=['http://www.xicidaili.com/nn/{page}',
'http://www.xicidaili.com/nt/{page}',
'http://www.xicidaili.com/wn/{page}',
'http://www.xicidaili.com/wt/{page}']
def getIPList(starturl):
iplist = []
try:
for page in range(1,2):
resp = requests.get(starturl.format(page=page),headers=headers)
resp.raise_for_status()
genIPitem(resp.text,iplist)
except Exception as e:
print('erro raised',e)
traceback.print_exc()
finally:
pass
print(iplist)
def genIPitem(html,iplist):
bs = BeautifulSoup(html,'html.parser')
for line in bs.find_all('tr')[1::]:
item = {}
details = line.find_all('td')[1:6]
item['ip'] = details[0].string
item['port'] = details[1].string
item['location'] = details[2].a.string if details[2].a is not None else details[2].string.strip()
item['protocol'] = details[-1].string
item['stype'] = details[-2].string
iplist.append(item)
#單進(jìn)程
#for url in xiciip:
# getIPList(url)
#以下為多進(jìn)程代碼
if __name__ == '__main__':
freeze_support()
pool = Pool()
pool.map(getIPList,xiciip)
pool.close()
pool.join()
print('bug completed')
西刺網(wǎng)爬取很簡單,沒啥難度。不過官方有限制,一分鐘內(nèi)訪問次數(shù)過多會(huì)被禁止IP,得等待1分鐘之后才可以解禁,具體的爬取規(guī)范大家可以到官網(wǎng)上看看。