Scrapy爬蟲(1)-知乎
我的博客
第一次用Scrap創(chuàng)建的爬蟲,詳細(xì)記錄下
完整代碼請訪問這里,不過代碼可能有所不同
0.思路驗證
在創(chuàng)建這個工程前,我先用段代碼來檢驗基本功能可否運行
知乎可以不用登錄獲取用戶信息,對我來說,方便太多了,而且知乎的查看關(guān)注頁面那里同時顯示有個人用戶信息所以直接訪問:https://www.zhihu.com/people/(token)/following
就可以找到我要的信息了(雖然關(guān)注列表只有20個,不過無所謂)

提取信息如圖,直接copy這段數(shù)據(jù)去相應(yīng)網(wǎng)站分析后,可以得出我要的數(shù)據(jù)在兩個部分,
一個是在['people']['followingByUser'][urltoken]['ids']
另一個是在['entities']['users'][urltoken]
找到后就可以寫代碼開始爬下來了。
不過知乎的json會有空,比如這個用戶沒有學(xué)校的值時,json就沒有相應(yīng)的節(jié)點,如果直接爬就會報錯,然后我也沒有找到比較簡便的處理方法,就寫了try...except(如果用了對象,一個方法來復(fù)用,最后代碼量也差不多,我就放棄了)
代碼如下:
import json
from bs4 import BeautifulSoup
import requests
def user(urltoken):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
url = 'https://www.zhihu.com/people/'+urltoken+'/following'
html = requests.get(url, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
json_text = soup.body.contents[1].attrs['data-state']
ob_json = json.loads(json_text)
followinglist = ob_json['people']['followingByUser'][urltoken]['ids']
tempset = set(followinglist)
tempset.remove(None)
followinglist = list(tempset)
user_json = ob_json['entities']['users'][urltoken]
user_info = user_json['headline']
try:
school = user_json['educations'][0]['school']['name']
except:
school = ''
try:
major = user_json['educations'][0]['major']['name']
except:
major = ''
try:
job = user_json['employments'][0]['job']['name']
except:
job = ''
try:
company = user_json['employments'][0]['company']['name']
except:
company = ''
try:
description = user_json['description']
except:
description = ''
try:
business = user_json['business']['name']
except:
business = ''
try:
zhihu_name = user_json['name']
except:
zhihu_name = ''
try:
location = user_json['locations'][0]['name']
except:
location = ''
gender = user_json['gender']
if gender == 1:
gender = '男'
elif gender == 0:
gender = '女'
else:
gender = '未知'
uesr_list = [user_info, job, company, description, business, zhihu_name, location, gender, school, major]
print(uesr_list)
return followinglist
urltoken = 'sgai'
for urltoken in user(urltoken):
print(user(urltoken))
呈現(xiàn)的結(jié)果部分如下:
['喜歡用數(shù)據(jù)講故事。', '數(shù)據(jù)挖掘工程師', '物流', '有關(guān)于我的工作和生活都在微信公眾號:一個程序員的日常;會接數(shù)據(jù)采集、爬蟲定制、數(shù)據(jù)分析相關(guān)的單子,請直接私信我。', '互聯(lián)網(wǎng)', '路人甲', '上海', '男', '', '']
['不追先圣腳步,我行處,皆道路!', '二當(dāng)家', '游戲公司', '此心光明,亦復(fù)何言。我已委托“維權(quán)騎士”(<a class=" external" target="_blank" rel="nofollow noreferrer"><span class="invisible">http://</span><span class="visible">rightknights.com</span><span class="invisible"></span><i class="icon-external"></i></a>)為我的文章進(jìn)行維權(quán)行動', '互聯(lián)網(wǎng)', '斷罪小學(xué)趙日天', '深圳/南京/上海', '男', '', '']
['yang-ze-yong-3', 'zhong-wen-40-43', 'ni-ke-ri-xiang-ji', 'miloyip', 'zuo-wen-jie', 'lxghost', 'mukadas-kadir', 'justin-99-9', 'wang-ruo-shan-88', 'zhaoyanbo0098', 'guo-tao-45-48', 'mtjj', 'satanzhangdi', 'wang-hong-hao-99', 'bei-mang-4', 'water-five', 'li-ji-ren', 'he-jing-92-23', 'wei-lan-tian-4', 'yang-da-bao-32']
['個人微信:Maekcurtain', '414632028', 'UI設(shè)計交流群', '', '互聯(lián)網(wǎng)', '莫若', '', '男', '微信公眾號', 'imui1060']
['chenqin', 'xia-ji-ji-28', 'yunshu', 'sizhuren', 'sgai', 'hua-sha-94', 'guahu', 'sun-lin-li-72-11', 'luo-pan-57', 'wang-ni-ma-94', 'si-tu-ying-ying-18', 'zheng-gong-si', 'cao-rui-ting-18', 'tian-ji-shun', 'ding-xiang-yi-sheng', 'jue-qiang-de-nu-li', 'ma-bo-yong', 'xiaoxueli', 'ai-an-dong-ni-tu-zi-73', 'guo-zi-501']
['孤獨享受者', '', '', '愿你夢里有喝不完的酒', '', '奧LiVia', '北京', '女', '', '']
['metrodatateam', 'jllijlli', 'zao-meng-zhe-62-62', 'kaiserwang730', 'olivia-60-10', 'qi-e-chi-he-zhi-nan', 'fandaidai', 'an-cheng-98', 'zhou-zuo', 'yang-ru-55-52', 'wang-tiao-tiao-91', 'EDASP', 'ma-ke-28', 'shirley-shan-63', 'lens-27', 'mo-zhi-xian-sheng', 'hu-yang-zi', 'tu-si-ji-da-lao-ye', 'summer-world', 'liusonglin']
['非正經(jīng)演繹派廚藝新鮮人,公眾號:餐桌奇談', '吃貨擔(dān)當(dāng)', '美食圈', '我有一顆饞嘴痣~~所有文章及答案均需付費轉(zhuǎn)載,不允許擅自搬運~我已委托“維權(quán)騎士”(<a class=" external" target="_blank" rel="nofollow noreferrer"><span class="invisible">http://</span><span class="visible">rightknights.com</span><span class="invisible"></span><i class="icon-external"></i></a>)為我的文章進(jìn)行維權(quán)行動', '互聯(lián)網(wǎng)', '芊芊吶小桌兒', '北京', '女', '', '']
['feiyucz', 'richard-45-75', 'nusbrant', 'sgai', 'zhou-dong-yu-55-93', 'easteregg', 'cai-lan-80-17', 'zhao-yu-67-63', 'MrBurning', 'zhouzhao', 'excited-vczh', 'justjavac.com', 'mu-se-wan-sheng-ge', 'simona-wen', 'wstcbh', 'BoomberLiu', 'qing-yuan-zi-84', 'cocokele', 'hei-bai-hui-11-79', 'wangxiaofeng']
由此知道我的想法可以運行起來,所以開始創(chuàng)建工程
1.創(chuàng)建工程
整個流程:從起始url中解析出用戶信息,然后進(jìn)入關(guān)注者界面和被關(guān)注者界面,提取關(guān)系用戶ID和新的用戶鏈接,將用戶信息和關(guān)系用戶ID存儲到MongoDB中,將新的用戶鏈接交給用戶信息解析模塊,依次類推。完成循環(huán)抓取任務(wù)
在終端輸入
$ scrapy startproject zhihuSpider
之后會在當(dāng)前目錄生成以下結(jié)構(gòu)
zhihuSpider
|- scrapy.cfg 項目部署文件
|- zhihuSpider 該項目的python模塊,可以在這里加入代碼
|- __init__.py
|- items.py 主要是將爬取的非結(jié)構(gòu)性的數(shù)據(jù)源提取結(jié)構(gòu)性數(shù)據(jù)
|- middlewares.py
|- pipelines.py 將爬取的數(shù)據(jù)進(jìn)行持久化存儲
|- __pycache__
|- settings.py 配置文件
|- spiders 放置spider代碼的目錄
|- __init__.py
|- __pycache__
2.創(chuàng)建爬蟲模塊
在終端輸入
$ cd zhihuSpider
$ scrapy genspider -t crawl zhihu.com zhihu.com
這里是用scpry的bench的genspider
語法是scrapy genspider[-t template] <name> <domain>
可以用模板來創(chuàng)建spider
之后spider文件夾下面會多出一個zhihu_com.py的文件,里面有段代碼為:
7 name = 'zhihu.com'
8 allowed_domains = ['zhihu.com']
9 start_urls = ['http://zhihu.com/']
其中
name是定義spider名字的字符串,它是必須且唯一的
allowed_domains是可選的,它包含了spider允許爬取的域名列表。當(dāng)OffsiteMiddleware組件啟用是,域名不在列表中的url不會被跟進(jìn)
start_urls為url列表,當(dāng)沒有使用start_requests()方法配置Requests時,Spider將從該列表中開始進(jìn)行爬取
重新回到開頭看到
class ZhihuComSpider(CrawlSpider):
Spider有3大類,最基本的是Spieder,他的屬性有name, allowed_domains, start_urls,custom_settings,crawler,start_requests().除了Spieder外還有CrawlSpider和XMLFeedSpider。
CraelCpider除了從Spider繼承過來的屬性外,還提供了一個新的屬性rules,rules是一個包含一個或多個Rule對象的集合,每個Rule對爬取網(wǎng)站的動作定義了特定的規(guī)則。如果多個Rule匹配了相同的鏈接,則根據(jù)它們在rules屬性中被定義的順序,第一個會被使用。
Rule類的原型為:
scrapy.contrib,spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None, process_request=None)
參數(shù)說明
link_extractor 是一個LinkExtractor對象,定義了如何從爬取到的頁面提取鏈接。
callback回調(diào)函數(shù)接受一個response作為一個參數(shù),應(yīng)避免使用parse作為回調(diào)函數(shù)
cb_kwargs包含傳遞給回調(diào)函數(shù)的參數(shù)的字典
follow是一個布爾值,指定了根據(jù)規(guī)則從respose提取的鏈接是否需要跟進(jìn)
然而,知乎跳轉(zhuǎn)到關(guān)注人的鏈接不是完整的,而是類似/perople/xxx/following的,crawlSpider沒辦法識別,所以不能使用rules(或者是我食用crawlSpider方法不對?我都弄了半天了)
了解了這樣多,結(jié)果我就直接套上去,不使用rules,因為我的鏈接不是從網(wǎng)頁里面提取的,要自己創(chuàng)造的
直接把url注釋掉,因為CrawlSpider屬于Spider類,所以調(diào)用parse解析就好了
class ZhihuComSpider(CrawlSpider):
name = 'zhihu.com'
allowed_domains = ['zhihu.com']
start_urls = ['https://www.zhihu.com/people/sgai/following']
#rules = (
# Rule(LinkExtractor(allow=r'/people/(\w+)/following$', process_value='my_process_value', unique=True, deny_domains=deny), callback='parse_item', follow=True),
#)
def parse(self, response):
然后到setting.py下面更改3個數(shù)值
請求頭
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
關(guān)閉robot
ROBOTSTXT_OBEY = False
關(guān)閉cookies追蹤
COOKIES_ENABLED = False
3.解析網(wǎng)頁
解析網(wǎng)頁直接照搬我之前測試是否可以運行的代碼,并進(jìn)行了相對應(yīng)的修改
- 建立一個url存放列表,進(jìn)行url去重
- 從網(wǎng)頁捕獲用戶urltoken
- tempset部分用戶存在沒有none存在的情況,捕獲錯誤并pass
- description部分存在有些用戶含有html代碼,但是我不知道怎么去除只獲取中文。。。。
- 把獲取的數(shù)據(jù)封裝到item里面
- 判斷捉取的用戶列表有沒有存在False(用戶關(guān)注數(shù)量低于20個就會出現(xiàn),由于直接返回False,所以直接用if判斷)
def parse(self, response):
deny = []
html = response.text
soup = BeautifulSoup(html, 'html.parser')
token = soup.find("a",{"class":"Tabs-link"})
pattern = r'e/(.+)/ac'
urltoken = re.findall(pattern, str(token))[0]
json_text = soup.body.contents[1].attrs['data-state']
ob_json = json.loads(json_text)
followinglist = ob_json['people']['followingByUser'][urltoken]['ids']
tempset = set(followinglist)
try:
tempset.remove(None)
except:
pass
followinglist = list(tempset)
user_json = ob_json['entities']['users'][urltoken]
user_info = user_json['headline']
try:
school = user_json['educations'][0]['school']['name']
except:
school = '該用戶尚未填寫'
try:
major = user_json['educations'][0]['major']['name']
except:
major = '該用戶尚未填寫'
try:
job = user_json['employments'][0]['job']['name']
except:
job = '該用戶尚未填寫'
try:
company = user_json['employments'][0]['company']['name']
except:
company = '該用戶尚未填寫'
try:
description = user_json['description']
except:
description = '該用戶尚未填寫'
try:
business = user_json['business']['name']
except:
business = '該用戶尚未填寫'
try:
zhihu_name = user_json['name']
except:
zhihu_name = '該用戶尚未填寫'
try:
location = user_json['locations'][0]['name']
except:
location = '該用戶尚未填寫'
gender = user_json['gender']
if gender == 1:
gender = '男'
elif gender == 0:
gender = '女'
else:
gender = '未知'
item =UserInfoItem(urltoken=urltoken,user_info=user_info, job=job, company=company, description=description, business=business, zhihu_name=zhihu_name, location=location, gender=gender, school=school, major=major)
yield item
#print(followinglist)
for following in followinglist:
if following:
url = 'https://www.zhihu.com/people/'+following+'/following'
#else:
#url = 'https://www.zhihu.com/people/'+urltoken+'/following'
if url in deny:
pass
else:
deny.append(url)
yield scrapy.Request(url=url,callback=self.parse)
4.定義item
定義兩個item
import scrapy
class UserInfoItem(scrapy.Item):
#id
urltoken = scrapy.Field()
#個人簡介
user_info = scrapy.Field()
#姓名
zhihu_name = scrapy.Field()
#居住地
location = scrapy.Field()
#技術(shù)領(lǐng)域
business = scrapy.Field()
#性別
gender = scrapy.Field()
#公司
company = scrapy.Field()
#職位
job = scrapy.Field()
#學(xué)校
school = scrapy.Field()
#教育經(jīng)歷
major = scrapy.Field()
#簡介
description = scrapy.Field()
5.Pipeline
定義一個Pipeline,讓scrapy把數(shù)據(jù)從item存入到mongodb數(shù)據(jù)庫里面,配套設(shè)置在settings.py里面
import pymongo
class ZhihuspiderPipeline(object):
def __init__(self, mongo_url, mongo_db):
self.mongo_url = mongo_url
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_url=crawler.settings.get('MONGO_URL'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_url)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db.UserInfo.insert(dict(item))
return item
6.其他補充
6.1.利用middlewares.py實現(xiàn)ip代理
配到設(shè)置在settings.py里面
import random
class RandomProxy(object):
def __init__(self,iplist):
self.iplist=iplist
@classmethod
def from_crawler(cls,crawler):
return cls(crawler.settings.getlist('IPLIST'))
def process_request(self, request, spider):
proxy = random.choice(self.iplist)
request.meta['proxy'] =proxy
6.2setting設(shè)置
請求頭添加
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
激活item和中間件
DOWNLOADER_MIDDLEWARES = {
'zhihuSpider.middlewares.RandomProxy':543
}
ITEM_PIPELINES = {
'zhihuSpider.pipelines.ZhihuspiderPipeline': 300,
}
激活隨機爬取時間
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
MONGODB配置
MONGO_URL = 'mongodb://localhost:27017/'
MONGO_DATABASE='zhihu'
ip代理列表
這個ip我是從免費代理IP拿的,測試一下ip代理是否可行,我的ip代理池捉不到什么有效 ip了- -。。。
IPLIST=["https://14.112.76.235:65309"]
7.結(jié)果
用幾個數(shù)據(jù)測試一下是否有保存,要爬取全部的話還要等重新?lián)Q個ip代理池

7.優(yōu)化與補充
- Scrapy 自定義settings--簡化編寫爬蟲操作--加快爬蟲速度
-
v2ex討論《scrapy 抓取速度問題》
還有scrapyd部署和scrapyd-client部署見上一篇文章