- 無聊的時(shí)候,習(xí)慣刷刷知乎,但是新的有價(jià)值的內(nèi)容也不多,多的是不斷涌入的營銷號、推廣和某些知乎live。于是乎,想著不如刷自己的收藏夾吧,很多優(yōu)秀回答其實(shí)看了不久就忘了,靜靜地躺在收藏夾里,一直沒被翻過,何況我收藏頻率雖然不高,幾年下來積累也不少,這樣刷起來也能消磨不少時(shí)光了,還能美其名曰溫故而知新了。雖然前端改版,但是知乎的收藏夾用起來感覺還是不那么方便。自己動(dòng)手,豐衣足食。
效果
- 利用python爬蟲爬取了自己的所有收藏夾,利用flask做后端api和vuejs做前端顯示,前后端分離,實(shí)現(xiàn)效果如下

電腦效果1

電腦效果2

電腦效果3

手機(jī)效果
爬蟲
- 一開始想著github上有許多開源的知乎爬蟲,可以省去不少麻煩,結(jié)果找了一下,高贊的多已不再維護(hù),知乎又改版了,新的項(xiàng)目有一點(diǎn),但是功能不太完善,只有自己上手,畢竟需求很簡單明確,就是收集自己的所有收藏夾內(nèi)容。(使用python3)
- 針對此次需求,爬蟲的邏輯很簡單。知乎在個(gè)人常用機(jī)上直接post用戶名和密碼無需驗(yàn)證碼就可以登錄,利用request.Session保存請求狀態(tài),按照url中?page=num的頁碼規(guī)則直接爬取所有收藏夾頁面,解析出所有收藏夾的url,然后依次請求獲取所有收藏夾下的問答列表,解析出相關(guān)信息。由于內(nèi)容不多,為了方便,直接存為json文件。而且由于收藏夾內(nèi)容不會很多,直接使用requests庫單線程爬取即可。
- 以下為爬蟲代碼,生成兩個(gè)json文件,一個(gè)是所有收藏夾及其下問答的相關(guān)信息
知乎收藏文章.json,一個(gè)是所有問題的回答數(shù)據(jù)url_answer.json,這樣處理,在前端請求時(shí)可以先獲取前者,在要閱讀某個(gè)問題的回答時(shí)再異步請求后者,只獲取對應(yīng)的答案即可。 - 使用了requests_cache庫,僅兩行代碼,使得請求在意外中斷后要重新開始時(shí),直接就從緩存數(shù)據(jù)庫中提取已經(jīng)請求過的頁面,節(jié)省時(shí)間,省去了自己編碼處理請求失敗的麻煩。
import os
import json
from bs4 import BeautifulSoup
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# 參考 http://stackoverflow.com/questions/27981545/suppress-insecurerequestwarning-unverified-https-request-is-being-made-in-pytho
import requests_cache
requests_cache.install_cache('demo_cache')
Cookie_FilePlace = r'.'
Default_Header = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36",
'Host': "www.zhihu.com",
'Origin': "http://www.zhihu.com",
'Pragma': "no-cache",
'Referer': "http://www.zhihu.com/"}
Zhihu_URL = 'https://www.zhihu.com'
Login_URL = Zhihu_URL + '/login/email'
Profile_URL = 'https://www.zhihu.com/settings/profile'
Collection_URL = 'https://www.zhihu.com/collection/%d'
Cookie_Name = 'cookies.json'
os.chdir(Cookie_FilePlace)
r = requests.Session()
#--------------------Prepare--------------------------------#
r.headers.update(Default_Header)
if os.path.isfile(Cookie_Name):
with open(Cookie_Name, 'r') as f:
cookies = json.load(f)
r.cookies.update(cookies)
def login(r):
print('====== zhihu login =====')
email = input('email: ')
password = input("password: ")
print('====== logging.... =====')
data = {'email': email, 'password': password, 'remember_me': 'true'}
value = r.post(Login_URL, data=data).json()
print('====== result:', value['r'], '-', value['msg'])
if int(value['r']) == 0:
with open(Cookie_Name, 'w') as f:
json.dump(r.cookies.get_dict(), f)
def isLogin(r):
url = Profile_URL
value = r.get(url, allow_redirects=False, verify=False)
status_code = int(value.status_code)
if status_code == 301 or status_code == 302:
print("未登錄")
return False
elif status_code == 200:
return True
else:
print(u"網(wǎng)絡(luò)故障")
return False
if not isLogin(r):
login(r)
#---------------------------------------------------------------------#
url_answer_dict= {}
# 單獨(dú)生成一個(gè)答案的url和答案文本之間的字典,便于后臺提供api服務(wù),與123行相關(guān)
#-----------------------get collections-------------------------------#
def getCollectionsList():
collections_list = []
content = r.get(Profile_URL).content
soup = BeautifulSoup(content, 'lxml')
own_collections_url = 'http://' + soup.select('#js-url-preview')[0].text + '/collections'
page_num = 0
while True:
page_num += 1
url = own_collections_url + '?page=%d'% page_num
content = r.get(url).content
soup = BeautifulSoup(content, 'lxml')
data = soup.select_one('#data').attrs['data-state']
collections_dict_raw = json.loads(data)['entities']['favlists'].values()
if not collections_dict_raw:
# if len(collections_dict_raw) == 0:
break
for i in collections_dict_raw:
# print(i['id'],' -- ', i['title'])
collections_list.append({
'title': i['title'],
'url': Collection_URL % i['id'],
})
print('====== prepare Collections Done =====')
return collections_list
#-------------------------
def getQaDictListFromOneCollection(collection_url = 'https://www.zhihu.com/collection/71534108'):
qa_dict_list = []
page_num = 0
while True:
page_num += 1
url = collection_url + '?page=%d'% page_num
content = r.get(url).content
soup = BeautifulSoup(content, 'lxml')
titles = soup.select('.zm-item-title a') # .text ; ['href']
if len(titles) == 0:
break
votes = soup.select('.js-vote-count') # .text
answer_urls = soup.select('.toggle-expand') # ['href']
answers = soup.select('textarea') # .text
authors = soup.select('.author-link-line .author-link') # .text ; ['href']
for title, vote, answer_url, answer, author \
in zip(titles, votes, answer_urls, answers, authors):
author_img = getAthorImage(author['href'])
qa_dict_list.append({
'title': title.text,
'question_url': title['href'],
'answer_vote': vote.text,
'answer_url': answer_url['href'],
#'answer': answer.text,
'author': author.text,
'author_url': author['href'],
'author_img': author_img,
})
url_answer_dict[
answer_url['href'][1:]
] = answer.text
# print(title.text, ' - ', author.text)
return qa_dict_list
def getAthorImage(author_url):
url = Zhihu_URL+author_url
content = r.get(url).content
soup = BeautifulSoup(content, 'lxml')
return soup.select_one('.AuthorInfo-avatar')['src']
def getAllQaDictList():
''' 最終結(jié)果要是列表和字典的嵌套形式,以便前端解析'''
all_qa_dict_list = []
collections_list = getCollectionsList()
for collection in collections_list:
all_qa_dict_list.append({
'ctitle': collection['title'],
'clist': getQaDictListFromOneCollection(collection['url'])
})
print('====== getQa from %s Done =====' % collection['title'])
return all_qa_dict_list
with open(u'知乎收藏文章.json', 'w', encoding='utf-8') as f:
json.dump(getAllQaDictList(), f)
with open(u'url_answer.json', 'w', encoding='utf-8') as f:
json.dump(url_answer_dict, f)
#---------------------utils------------------------------#
# with open('1.html', 'w', encoding='utf-8') as f:
# f.write(soup.prettify())
# import os
# Cookie_FilePlace = r'.'
# os.chdir(Cookie_FilePlace)
# import json
# dict_ = {}
# with open(u'知乎收藏文章.json', 'r', encoding='utf-8') as f:
# dict_ = json.load(f)
前端
- 前端要求不高,就是單頁顯示,要簡潔漂亮,而且要便于我查找和翻看問題和答案。其次是對于我這種html和css戰(zhàn)五渣,js列表遍歷代碼都要現(xiàn)谷歌的人來說,一定要簡單好操作,我選擇了vuejs前端框架(因?yàn)楹唵危矝]有使用webpack)。
- 前端發(fā)展很快,框架和工具讓人應(yīng)接不暇,從我個(gè)人經(jīng)驗(yàn)看,首先是不要害怕,框架和工具是為了幫助我們解決問題的,也就是說,使用框架和工具可以讓我們更簡單更快地開發(fā),不少有效的框架和工具的學(xué)習(xí)成本并不高,掌握了基礎(chǔ)部分,加上利用開源代碼,可以方便地解決不少問題。此外,搜集好工具真是必備技能,大家面對的困難相似,說不定就有人開發(fā)了工具來解決你的痛點(diǎn)呢。
- 首先網(wǎng)站的基本構(gòu)圖采用bootstrap的一個(gè)基本模板,省了不少麻煩。vuejs的組件化特性使得我可以輕松地利用各種開源UI組件,像搭積木一樣把他們拼接起來構(gòu)成我的頁面。在awesome-vue上我找到了符合我審美且簡單易用的UI框架iView,雖然它暫時(shí)還只適用于vue1.x,不過由于我的應(yīng)用簡單,差異不大,就是它了。
- 以下為html代碼,使用vue-resource異步請求數(shù)據(jù),同步到頁面。為了開發(fā)方便,直接采用了jsonp跨域請求的形式。代碼質(zhì)量僅供參考。組件里的template查看不方便,可以復(fù)制出來使用去掉兩邊單引號和對單引號的轉(zhuǎn)義,利用美化html代碼的工具查看。這樣寫是權(quán)宜之計(jì)。
<!DOCTYPE html>
<html lang="zh-CN">
<!--view-source:http://v3.bootcss.com/examples/jumbotron-narrow/#-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>知乎個(gè)人收藏</title>
<link rel="stylesheet" >
<link rel="stylesheet" >
<link rel="stylesheet" type="text/css" >
</head>
<body>
<div id="app">
<div class="container">
<div class="header clearfix">
<h3 class="text-muted">知乎個(gè)人收藏</h3>
</div>
<div class="jumbotron">
<h1>欄目總覽</h1>
<p class="lead">{{ description }}</p>
<my-carousel></my-carousel>
</div>
<div class="row marketing">
<div class="col-lg-6">
<my-card :collection="collection" v-for="collection in left"></my-card>
</div>
<div class="col-lg-6">
<my-card :collection="collection" v-for="collection in right"></my-card>
</div>
</div>
<i-button @click="showLeave" style: "" long>That's all!</i-button>
<Modal :visible.sync="visible" :title="modalTitle"> {{ modalMessage }}
<div v-html="rawHtml" id="inner-content"></div>
</Modal>
<footer class="footer">
<p>© 2017 treelake.</p>
</footer>
</div>
<!-- /container -->
</div>
<script type="text/javascript" src="http://v1.vuejs.org/js/vue.min.js"></script>
<script src="https://cdn.jsdelivr.net/vue.resource/1.2.0/vue-resource.min.js"></script>
<script type="text/javascript" src="http://unpkg.com/iview/dist/iview.min.js"></script>
<script>
Vue.component('my-carousel', {
template: '<div class="showimage"><Carousel arrow="never" autoplay><Carousel-item></Carousel-item><Carousel-item></Carousel-item></Carousel></div>'
})
Vue.component('my-ul', {
template: '<ul id="list"><li v-for="item in items | limitBy limitNum limitFrom"><Badge :count="item.answer_vote" overflow-count="9999"> <a @click="simpleContent(item)" class="author-badge" :style="{ background: \'url(\'+ item.author_img +\') no-repeat\', backgroundSize:\'cover\'}"></a></Badge> <a :href=" \'https://www.zhihu.com\' + item.answer_url" target="_blank" style="font-size: 10px">     {{ item.title }}</a><a :href=" \'https://www.zhihu.com\' + item.question_url" target="_blank"><Icon type="chatbubbles"></Icon></a><hr> </li></ul>',
props: ['items'],
methods: {
changeLimit() {
if (this.limitFrom > this.items.length - this.limitNum) {
this.limitFrom = 0;
} else {
this.limitFrom += this.limitNum;
}
if (this.limitFrom == this.items.length) {
this.limitFrom = 0
}
console.log(this.limitFrom)
},
simpleContent(msg) {
this.$dispatch('child-msg', msg)
// 使用 $dispatch() 派發(fā)事件,事件沿著父鏈冒泡
},
},
data() {
return {
limitNum: 5,
limitFrom: 0,
}
},
events: {
'parent-msg': function () {
this.changeLimit()
}
},
})
Vue.component('my-card', {
template: '<Card style="width:auto; margin-bottom:15px" ><p slot="title"><Icon type="ios-pricetags"></Icon>{{ collection.ctitle }}</p><a v-if="collection.clist.length>5" slot="extra" @click="notify"><Icon type="ios-loop-strong"></Icon>換一換</a> <my-ul :items="collection.clist"></my-ul> </Card>',
props: ['collection'],
methods: {
notify: function () {
this.$broadcast('parent-msg')
// 使用 $broadcast() 廣播事件,事件向下傳導(dǎo)給所有的后代
}
}
})
var shuju, answer;
new Vue({
el: '#app',
data: {
description: '',
visible: false,
// ctitle: '',
allqa: [],
collection: {
'clist': [],
'ctitle': '',
},
left: [],
right: [],
modalMessage: '舊時(shí)光回憶完畢!',
modalTitle: 'Welcome!',
rawHtml: '<a > treelake </a>'
},
methods: {
show() {
this.visible = true;
},
showLeave() {
this.rawHtml = '';
this.modalMessage = '舊時(shí)光回憶完畢!';
this.show();
}
},
events: {
'child-msg': function (msg) {
this.$http.jsonp('/find' + msg.answer_url, {}, { // 單文件測試:http://localhost:5000/find
headers: {},
emulateJSON: true
}).then(function (response) {
// 這里是處理正確的回調(diào)
answer = response.data;
this.rawHtml = answer.answer;
}, function (response) {
// 這里是處理錯(cuò)誤的回調(diào)
console.log(response);
});
this.modalMessage = '';
this.modalTitle = msg.title;
this.show();
}
},
ready: function () {
this.$http.jsonp('/collections', {}, { // 單文件測試 http://localhost:5000/collections/
headers: {},
emulateJSON: true
}).then(function (response) {
// 這里是處理正確的回調(diào)
shuju = response.data
for (i in shuju) {
this.description += (shuju[i].ctitle + ' ');
// console.log(shuju[i])
}
// this.ctitle = shuju[0].ctitle
// this.collection = shuju[0]
this.allqa = shuju
half = parseInt(shuju.length / 2) + 1
this.left = shuju.slice(0, half)
this.right = shuju.slice(half, shuju.length)
console.log(this.collection)
}, function (response) {
// 這里是處理錯(cuò)誤的回調(diào)
console.log(response);
});
}
})
</script>
<style>
#list {
padding: 10px
}
#list li {
margin-bottom: 10px;
padding-bottom: 10px;
}
.jumbotron img {
width: 100%;
}
.author-badge {
width: 38px;
height: 38px;
border-radius: 6px;
display: inline-block;
}
#inner-content img {
width: 100%;
}
</style>
</body>
</html>
后端
- 后端主要提供api,使用了簡潔易用的Flask,但是返回jsonp還需要一層封裝,不過開源世界就是強(qiáng)大,直接找到了Flask-Jsonpify庫,一句話搞定。主要邏輯就是先從本地加載之前爬下來的數(shù)據(jù),然后提供api服務(wù)。
/find/<path:answer_url>路由提供了根據(jù)回答的url查找回答文本內(nèi)容的服務(wù)。 - 最后,想讓flask在根目錄提供html文件,直接訪問ip就可以在手機(jī)上使用。為了不讓flask本身的模板渲染和vuejs的模板特性沖突,直接返回了原本的html文件,避過了flask的模板渲染。
- 以下為服務(wù)端代碼,連同上面兩個(gè)文件放在一起,在爬取資料完畢后,
python xxx.py運(yùn)行服務(wù)即可。
# -*- coding: utf-8 -*-
from flask import Flask
import json
from flask_jsonpify import jsonpify
app = Flask(__name__)
collections = []
with open(u'知乎收藏文章.json', 'r', encoding='utf-8') as f:
collections = json.load(f)
qa_dict = {}
with open('url_answer.json', 'r', encoding='utf-8') as f:
qa_dict = json.load(f)
# print(qa_dict['question/31116099/answer/116025931'])
index_html = ''
with open('zhihuCollection.html', 'r', encoding='utf-8') as f:
index_html = f.read()
@app.route('/')
def index():
return index_html
@app.route('/collections')
def collectionsApi():
return jsonpify(collections)
@app.route('/find/<path:answer_url>') # 使用path修正斜杠的副作用,參見http://flask.pocoo.org/snippets/76/
def answersApi(answer_url):
# show the post with the given id, the id is an integer
return jsonpify({'answer': qa_dict[answer_url]})
@app.route('/test')
def test():
# show the post with the given id, the id is an integer
return jsonpify(qa_dict)
if __name__ == '__main__':
app.run(host='0.0.0.0')