前言
最近在刷算法題,想要整理一下做個(gè)總結(jié),發(fā)現(xiàn)每道題都要復(fù)制粘貼題目和解法實(shí)在太浪費(fèi)時(shí)間了。本著解放勞動(dòng)力的思維研究了下 LeetCode 爬蟲,把題目和代碼爬取自動(dòng)生成 Markdown 文檔。我已經(jīng)把項(xiàng)目提交到了 github 上了,歡迎大家 star、fork。
在這里貼的代碼為了方便理解,在源碼上做了些改動(dòng)。
模擬登陸 LeetCdoe
頁(yè)面分析
要做一個(gè)爬蟲,先得了解頁(yè)面邏輯。如果有現(xiàn)成的接口可以直接調(diào)用,獲取到想要的數(shù)據(jù),就模擬調(diào)用接口得到數(shù)據(jù)。
在這里就以模擬登陸 LeetCode 為例。首先我們用 Chrome 打開登陸頁(yè)面,F(xiàn)12 打開調(diào)試功能,分析 LeetCode 登陸賬號(hào)邏輯。
填寫賬號(hào)密碼,點(diǎn)擊登陸,分析賬號(hào)信息是如何提交上去的。(注:由于登陸成功后頁(yè)面會(huì)默認(rèn)跳轉(zhuǎn),在 Chrome 中會(huì)把之前頁(yè)面的請(qǐng)求數(shù)據(jù)歷史記錄清空,需要勾選 network 下的 Preserve log,保持歷史記錄)。
通過(guò)查找記錄可以發(fā)現(xiàn)通過(guò) https://leetcode.com/accounts/login 接口提交了賬號(hào)信息,這是常見的通過(guò) RESTful 提交數(shù)據(jù)實(shí)現(xiàn)登陸方式。所以我們接下來(lái)就是模擬瀏覽器提交表單數(shù)據(jù)到這一接口中。

表單中共存放了四對(duì)數(shù)據(jù),我們需要一一偽造出來(lái)。其中有一個(gè)名為 csrfmiddlewaretoken 的數(shù)據(jù),這個(gè)是由 LeetCode 生成放到 cookie 中的。在模擬登陸之前,需要通過(guò)訪問(wèn) LeetCode 主頁(yè),取得該 Cookie 值而后填充到我們的表單中。
由于需要維持登陸狀態(tài),這里是用了 requests 中的會(huì)話對(duì)象 Session,下面是 Requests 官方文檔中關(guān)于 Session 的介紹。
會(huì)話對(duì)象讓你能夠跨請(qǐng)求保持某些參數(shù)。它也會(huì)在同一個(gè) Session 實(shí)例發(fā)出的所有請(qǐng)求之間保持 cookie, 期間使用 urllib3 的 connection pooling 功能。所以如果你向同一主機(jī)發(fā)送多個(gè)請(qǐng)求,底層的 TCP 連接將會(huì)被重用,從而帶來(lái)顯著的性能提升。 (參見 HTTP persistent connection).
偽造完表單數(shù)據(jù),還需要偽造表頭,部分反爬蟲邏輯就是通過(guò)表頭中的數(shù)據(jù)來(lái)分辨爬蟲,如 User-Agent 中記錄了你所使用的瀏覽器類型及版本、操作系統(tǒng)及版本、瀏覽器內(nèi)核、等信息的標(biāo)識(shí)。Requests 中默認(rèn)的 User-Agent 是 Python-requests/*.*,如果不偽造很可能會(huì)被反爬機(jī)制攔截。
這里我們就照著自己計(jì)算機(jī)中的參數(shù)進(jìn)行偽造就可以了,請(qǐng)求 header可以在 Request Header中找到(注:不一定要完全,只選取其中比較特殊的字段即可)
偽造完數(shù)據(jù),提交表單即可。由于登陸成功會(huì)自動(dòng)跳轉(zhuǎn),需要在 post 方法中設(shè)置 allow_redirects 為 Fasle 禁止跳轉(zhuǎn),避免不必要的跳轉(zhuǎn)。
代碼
import requests,json
from requests_toolbelt import MultipartEncoder
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def login(username, password):
url = 'https://leetcode.com'
cookies = session.get(url).cookies
for cookie in cookies:
if cookie.name == 'csrftoken':
csrftoken = cookie.value
url = "https://leetcode.com/accounts/login"
params_data = {
'csrfmiddlewaretoken': csrftoken,
'login': username,
'password':password,
'next': 'problems'
}
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/', "origin": "https://leetcode.com"}
m = MultipartEncoder(params_data)
headers['Content-Type'] = m.content_type
session.post(url, headers = headers, data = m, timeout = 10, allow_redirects = False)
is_login = session.cookies.get('LEETCODE_SESSION') != None
return is_login
爬取題目
獲取所有題目信息
頁(yè)面分析
打開題目列表頁(yè)面,調(diào)試發(fā)現(xiàn) https://leetcode.com/api/problems/all/ 這個(gè)接口即為獲取所有題目的接口。獲取解析里面的數(shù)據(jù)便可以得到所有題的名稱、編號(hào)、難度等。這里的接口中并不能獲取到題目的詳細(xì)問(wèn)題說(shuō)明,需要再往下分析。
代碼
import requests,json
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def get_problems():
url = "https://leetcode.com/api/problems/all/"
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive'}
resp = session.get(url, headers = headers, timeout = 10)
question_list = json.loads(resp.content.decode('utf-8'))
for question in question_list['stat_status_pairs']:
# 題目編號(hào)
question_id = question['stat']['question_id']
# 題目名稱
question_slug = question['stat']['question__title_slug']
# 題目狀態(tài)
question_status = question['status']
# 題目難度級(jí)別,1 為簡(jiǎn)單,2 為中等,3 為困難
level = question['difficulty']['level']
# 是否為付費(fèi)題目
if question['paid_only']:
continue
爬取某道題的詳細(xì)信息
頁(yè)面分析
打開某道題的頁(yè)面,分析是否有請(qǐng)求題目的接口(解析頁(yè)面源碼取得數(shù)據(jù)的方式效率比較低,不優(yōu)先使用)。
在題目頁(yè)面中,有幾條鏈接均為 https://leetcode.com/graphql 的請(qǐng)求,其中一條即為請(qǐng)求題目詳細(xì)信息。
這里引出了一個(gè)新概念: GarphQL。
下面摘抄自以LeetCode為例——如何發(fā)送GraphQL Query獲取數(shù)據(jù)。
GraphQL 是一種用于 API 的查詢語(yǔ)言,是由 Facebook 開源的一種用于提供數(shù)據(jù)查詢服務(wù)的抽象框架。在服務(wù)端 API 開發(fā)中,很多時(shí)候定義一個(gè)接口返回的數(shù)據(jù)相對(duì)固定,因此要獲得更多信息或者只想得到某部分信息時(shí),基于 RESTful API 的接口就顯得不那么靈活。而 GraphQL 對(duì) API 中的數(shù)據(jù)提供了一套易于理解的完整描述,使得客戶端能夠準(zhǔn)確地獲得它需要的數(shù)據(jù),而且沒(méi)有任何冗余,也讓 API 更容易地隨著時(shí)間推移而演進(jìn),還能用于構(gòu)建強(qiáng)大的開發(fā)者工具。
現(xiàn)在 LeetCode 的大部分接口都是通過(guò) GraphQL 來(lái)獲取。這里關(guān)于 GraphQL 不做深層次的了解,大致就是通過(guò)類似 SQL 的查詢語(yǔ)言來(lái)實(shí)現(xiàn)一對(duì)多的接口。我們直接復(fù)制獲取題目詳細(xì)信息的查詢語(yǔ)句來(lái)實(shí)現(xiàn)目的。
代碼
import requests,json
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def get_problem_by_slug(slug):
url = "https://leetcode.com/graphql"
params = {'operationName': "getQuestionDetail",
'variables': {'titleSlug': slug},
'query': '''query getQuestionDetail($titleSlug: String!) {
question(titleSlug: $titleSlug) {
questionId
questionFrontendId
questionTitle
questionTitleSlug
content
difficulty
stats
similarQuestions
categoryTitle
topicTags {
name
slug
}
}
}'''
}
json_data = json.dumps(params).encode('utf8')
headers = {'User-Agent': user_agent, 'Connection':
'keep-alive', 'Content-Type': 'application/json',
'Referer': 'https://leetcode.com/problems/' + slug}
resp = session.post(url, data = json_data, headers = headers, timeout = 10)
content = resp.json()
# 題目詳細(xì)信息
question = content['data']['question']
print(question)
爬取最新 AC 代碼
要獲取 AC 代碼,必須得先為登陸狀態(tài),下面代碼均為在假設(shè) session 已登陸的假定條件下。
爬取某個(gè)題目的提交代碼列表
頁(yè)面分析
LeetCode 現(xiàn)在有新版和舊版布局,舊版中沒(méi)有查到相關(guān)的請(qǐng)求,不過(guò)在新版中用 GraphQL 請(qǐng)求了當(dāng)前賬戶的提交代碼清單。其中記錄了所有提交代碼的編號(hào)、編程語(yǔ)言、時(shí)間戳、狀態(tài)等信息,但并沒(méi)有帶有代碼,需要再往下一層分析。
代碼
import requests,json
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def get_submissions(slug):
url = "https://leetcode.com/graphql"
params = {'operationName': "Submissions",
'variables':{"offset":0, "limit":20, "lastKey": '', "questionSlug": slug},
'query': '''query Submissions($offset: Int!, $limit: Int!, $lastKey: String, $questionSlug: String!) {
submissionList(offset: $offset, limit: $limit, lastKey: $lastKey, questionSlug: $questionSlug) {
lastKey
hasNext
submissions {
id
statusDisplay
lang
runtime
timestamp
url
isPending
__typename
}
__typename
}
}'''
}
json_data = json.dumps(params).encode('utf8')
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/',
"Content-Type": "application/json"}
resp = session.post(url, data = json_data, headers = headers, timeout = 10)
content = resp.json()
for submission in content['data']['submissionList']['submissions']:
print(submission)
獲取某次提交代碼
頁(yè)面分析
打開某次提交代碼的頁(yè)面,其鏈接格式如 https://leetcode.com/submissions/detail/123456789/, 后面對(duì)應(yīng)的數(shù)字為每次提交代碼的 id。
前面我們已經(jīng)可以獲取到所有的提交代碼的簡(jiǎn)略信息,找出最新的 AC 代碼的編號(hào),取得頁(yè)面源代碼。
對(duì)頁(yè)面分析發(fā)現(xiàn)這里并沒(méi)有調(diào)用請(qǐng)求數(shù)據(jù),數(shù)據(jù)應(yīng)該是隨頁(yè)面渲染而成,因此我們只能通過(guò)傳統(tǒng)的獲取頁(yè)面源碼正則解析的方式。
用 Chrome 查看網(wǎng)頁(yè)源代碼,發(fā)現(xiàn)代碼被賦予到一個(gè) submissionCode 的變量中去了。

代碼
import requests,json,re
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def get_submission_by_id(submission_id):
url = "https://leetcode.com/submissions/detail/" + submission_id
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', "Content-Type": "application/json"}
code_content = session.get(url, headers = headers, timeout = 10)
pattern = re.compile(r'submissionCode: \'(?P<code>.*)\',\n editCodeUrl', re.S)
m1 = pattern.search(code_content.text)
code = m1.groupdict()['code'] if m1 else None
print(code)
全部代碼
import requests,json,re
from requests_toolbelt import MultipartEncoder
session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
def login(username, password):
url = 'https://leetcode.com'
cookies = session.get(url).cookies
for cookie in cookies:
if cookie.name == 'csrftoken':
csrftoken = cookie.value
url = "https://leetcode.com/accounts/login"
params_data = {
'csrfmiddlewaretoken': csrftoken,
'login': username,
'password':password,
'next': 'problems'
}
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/',
"origin": "https://leetcode.com"}
m = MultipartEncoder(params_data)
headers['Content-Type'] = m.content_type
session.post(url, headers = headers, data = m, timeout = 10, allow_redirects = False)
is_login = session.cookies.get('LEETCODE_SESSION') != None
return is_login
def get_problems():
url = "https://leetcode.com/api/problems/all/"
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive'}
resp = session.get(url, headers = headers, timeout = 10)
question_list = json.loads(resp.content.decode('utf-8'))
for question in question_list['stat_status_pairs']:
# 題目編號(hào)
question_id = question['stat']['question_id']
# 題目名稱
question_slug = question['stat']['question__title_slug']
# 題目狀態(tài)
question_status = question['status']
# 題目難度級(jí)別,1 為簡(jiǎn)單,2 為中等,3 為困難
level = question['difficulty']['level']
# 是否為付費(fèi)題目
if question['paid_only']:
continue
print(question_slug)
def get_problem_by_slug(slug):
url = "https://leetcode.com/graphql"
params = {'operationName': "getQuestionDetail",
'variables': {'titleSlug': slug},
'query': '''query getQuestionDetail($titleSlug: String!) {
question(titleSlug: $titleSlug) {
questionId
questionFrontendId
questionTitle
questionTitleSlug
content
difficulty
stats
similarQuestions
categoryTitle
topicTags {
name
slug
}
}
}'''
}
json_data = json.dumps(params).encode('utf8')
headers = {'User-Agent': user_agent, 'Connection':
'keep-alive', 'Content-Type': 'application/json',
'Referer': 'https://leetcode.com/problems/' + slug}
resp = session.post(url, data = json_data, headers = headers, timeout = 10)
content = resp.json()
# 題目詳細(xì)信息
question = content['data']['question']
print(question)
def get_submissions(slug):
url = "https://leetcode.com/graphql"
params = {'operationName': "Submissions",
'variables':{"offset":0, "limit":20, "lastKey": '', "questionSlug": slug},
'query': '''query Submissions($offset: Int!, $limit: Int!, $lastKey: String, $questionSlug: String!) {
submissionList(offset: $offset, limit: $limit, lastKey: $lastKey, questionSlug: $questionSlug) {
lastKey
hasNext
submissions {
id
statusDisplay
lang
runtime
timestamp
url
isPending
__typename
}
__typename
}
}'''
}
json_data = json.dumps(params).encode('utf8')
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/',
"Content-Type": "application/json"}
resp = session.post(url, data = json_data, headers = headers, timeout = 10)
content = resp.json()
for submission in content['data']['submissionList']['submissions']:
print(submission)
def get_submission_by_id(submission_id):
url = "https://leetcode.com/submissions/detail/" + submission_id
headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', "Content-Type": "application/json"}
code_content = session.get(url, headers = headers, timeout = 10)
pattern = re.compile(r'submissionCode: \'(?P<code>.*)\',\n editCodeUrl', re.S)
m1 = pattern.search(code_content.text)
code = m1.groupdict()['code'] if m1 else None
print(code)
print(login('u', 'p'))
# get_problems()
# get_problem_by_slug('two-sum')
get_submissions('two-sum')
get_submission_by_id('')