由于工作需要,爬取智聯(lián)招聘的招聘信息。
一、了解。

image.png
由于智聯(lián)已經(jīng)不用登錄后才能訪問(wèn),所以可以在請(qǐng)求頭中去掉cookie信息也能訪問(wèn)。但是智聯(lián)是動(dòng)態(tài)加載的,所以在控制臺(tái)中直接找到

image.png
上面信息獲取到url,直接利用url打開(kāi)訪問(wèn)json數(shù)據(jù)
在此之前要構(gòu)造請(qǐng)求頭
說(shuō)明一下url的組成
kw 搜索內(nèi)容
cityId 城市ID
kt 不知道為啥一定要為3,其他的關(guān)聯(lián)度有問(wèn)題。。
其他的無(wú)關(guān)緊要
# 根據(jù)第一頁(yè)的URL,抓取“python”崗位的信息
url = r'https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22jl%22:%22489%22,%22kw%22:%22%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88%22,%22kt%22:%223%22%7D&at=9c5682b1a4f54de89c899fb7efc7e359&rt=54eaf1be1b8845c089439d53365ea5dd&_v=0.84300214&x-zp-page-request-id=280f6d80d733447fbebafab7b8158873-1541403039080-617179'
# 構(gòu)造請(qǐng)求的頭信息,防止反爬蟲(chóng)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
二、爬取
利用requests.get函數(shù)發(fā)送請(qǐng)求,基于response返回json數(shù)據(jù)
具體的匹配規(guī)則如下代碼
# 利用for循環(huán),生成規(guī)律的鏈接,并對(duì)這些鏈接進(jìn)行請(qǐng)求的發(fā)送和解析內(nèi)容
for i in range(0,20001,60):
url ='https://fe-api.zhaopin.com/c/i/sou?start='+str(i)+r'&pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:5,%22jl%22:%22489%22,%22kw%22:%22%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88%22,%22kt%22:%223%22%7D&at=17a95e7000264c3898168b11c8f17193&rt=57a342d946134b66a264e18fc60a17c6&_v=0.02365098&x-zp-page-request-id=a3f1b317599f46338d56e5d080a05223-1541300804515-144155'
response = requests.get(url, headers = headers)
print('Down Loading:','https://fe-api.zhaopin.com/c/i/sou?start='+str(i)+'&pageSize=60','......')
name = 'python'
company = [i['company']['name'] for i in response.json()['data']['results']]
size = [i['company']['size']['name'] for i in response.json()['data']['results']]
type = [i['company']['type']['name'] for i in response.json()['data']['results']]
positionURL = [i['positionURL'] for i in response.json()['data']['results']]
workingExp = [i['workingExp']['name'] for i in response.json()['data']['results']]
eduLevel = [i['eduLevel']['name'] for i in response.json()['data']['results']]
salary = [i['salary'] for i in response.json()['data']['results']]
jobName = [i['jobName'] for i in response.json()['data']['results']]
welfare = [i['welfare'] for i in response.json()['data']['results']]
city = [i['city']['items'][0]['name'] for i in response.json()['data']['results']]
createDate = [i['createDate']for i in response.json()['data']['results']]
jobs.append(pd.DataFrame({'name':name,'company':company,'size':size,'type':type,'positionURL':positionURL,
'workingExp':workingExp,'eduLevel':eduLevel,'salary':salary,
'jobName':jobName,'welfare':welfare,'city':city,'createDate':createDate}))
將數(shù)據(jù)導(dǎo)出到Excel文件中,也可以存到數(shù)據(jù)庫(kù)
拼接所有頁(yè)碼下的招聘信息
jobs2 = pd.concat(jobs)
將數(shù)據(jù)導(dǎo)出到Excel文件中
jobs2.to_excel(r'G:\python.xlsx', index = False)
完成,上面的其實(shí)可以?xún)?yōu)化,參考了別人的,由于老板要的緊,就這樣寫(xiě)了。有空改一下,寫(xiě)好一點(diǎn)