www.日韩情色,色五月婷婷91n,九九九九九九热

前言

最近在刷算法題，想要整理一下做個(gè)總結(jié)，發(fā)現(xiàn)每道題都要復(fù)制粘貼題目和解法實(shí)在太浪費(fèi)時(shí)間了。本著解放勞動(dòng)力的思維研究了下 LeetCode 爬蟲，把題目和代碼爬取自動(dòng)生成 Markdown 文檔。我已經(jīng)把項(xiàng)目提交到了 github 上了，歡迎大家 star、fork。

項(xiàng)目源碼

在這里貼的代碼為了方便理解，在源碼上做了些改動(dòng)。

模擬登陸 LeetCdoe

頁(yè)面分析

要做一個(gè)爬蟲，先得了解頁(yè)面邏輯。如果有現(xiàn)成的接口可以直接調(diào)用，獲取到想要的數(shù)據(jù)，就模擬調(diào)用接口得到數(shù)據(jù)。

在這里就以模擬登陸 LeetCode 為例。首先我們用 Chrome 打開登陸頁(yè)面，F(xiàn)12 打開調(diào)試功能，分析 LeetCode 登陸賬號(hào)邏輯。

填寫賬號(hào)密碼，點(diǎn)擊登陸，分析賬號(hào)信息是如何提交上去的。（注：由于登陸成功后頁(yè)面會(huì)默認(rèn)跳轉(zhuǎn)，在 Chrome 中會(huì)把之前頁(yè)面的請(qǐng)求數(shù)據(jù)歷史記錄清空，需要勾選 network 下的 Preserve log，保持歷史記錄）。

通過(guò)查找記錄可以發(fā)現(xiàn)通過(guò) https://leetcode.com/accounts/login 接口提交了賬號(hào)信息，這是常見的通過(guò) RESTful 提交數(shù)據(jù)實(shí)現(xiàn)登陸方式。所以我們接下來(lái)就是模擬瀏覽器提交表單數(shù)據(jù)到這一接口中。

登陸表單

表單中共存放了四對(duì)數(shù)據(jù)，我們需要一一偽造出來(lái)。其中有一個(gè)名為 csrfmiddlewaretoken 的數(shù)據(jù)，這個(gè)是由 LeetCode 生成放到 cookie 中的。在模擬登陸之前，需要通過(guò)訪問(wèn) LeetCode 主頁(yè)，取得該 Cookie 值而后填充到我們的表單中。

由于需要維持登陸狀態(tài)，這里是用了 requests 中的會(huì)話對(duì)象 Session，下面是 Requests 官方文檔中關(guān)于 Session 的介紹。

會(huì)話對(duì)象讓你能夠跨請(qǐng)求保持某些參數(shù)。它也會(huì)在同一個(gè) Session 實(shí)例發(fā)出的所有請(qǐng)求之間保持 cookie，期間使用 urllib3 的 connection pooling 功能。所以如果你向同一主機(jī)發(fā)送多個(gè)請(qǐng)求，底層的 TCP 連接將會(huì)被重用，從而帶來(lái)顯著的性能提升。 (參見 HTTP persistent connection).

偽造完表單數(shù)據(jù)，還需要偽造表頭，部分反爬蟲邏輯就是通過(guò)表頭中的數(shù)據(jù)來(lái)分辨爬蟲，如 User-Agent 中記錄了你所使用的瀏覽器類型及版本、操作系統(tǒng)及版本、瀏覽器內(nèi)核、等信息的標(biāo)識(shí)。Requests 中默認(rèn)的 User-Agent 是 Python-requests/*.*，如果不偽造很可能會(huì)被反爬機(jī)制攔截。

這里我們就照著自己計(jì)算機(jī)中的參數(shù)進(jìn)行偽造就可以了，請(qǐng)求 header可以在 Request Header中找到（注：不一定要完全，只選取其中比較特殊的字段即可）

偽造完數(shù)據(jù)，提交表單即可。由于登陸成功會(huì)自動(dòng)跳轉(zhuǎn)，需要在 post 方法中設(shè)置 allow_redirects 為 Fasle 禁止跳轉(zhuǎn)，避免不必要的跳轉(zhuǎn)。

代碼

import requests,json
from requests_toolbelt import MultipartEncoder

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'

def login(username, password):
    url = 'https://leetcode.com'
    cookies = session.get(url).cookies
    for cookie in cookies:
        if cookie.name == 'csrftoken':
            csrftoken = cookie.value

    url = "https://leetcode.com/accounts/login"
        
    params_data = {
        'csrfmiddlewaretoken': csrftoken,
        'login': username,
        'password':password,
        'next': 'problems'
    }
    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/', "origin": "https://leetcode.com"}
    m = MultipartEncoder(params_data)   

    headers['Content-Type'] = m.content_type
    session.post(url, headers = headers, data = m, timeout = 10, allow_redirects = False)
    is_login = session.cookies.get('LEETCODE_SESSION') != None
    return is_login

爬取題目

獲取所有題目信息

頁(yè)面分析

打開題目列表頁(yè)面，調(diào)試發(fā)現(xiàn) https://leetcode.com/api/problems/all/ 這個(gè)接口即為獲取所有題目的接口。獲取解析里面的數(shù)據(jù)便可以得到所有題的名稱、編號(hào)、難度等。這里的接口中并不能獲取到題目的詳細(xì)問(wèn)題說(shuō)明，需要再往下分析。

代碼

import requests,json

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'

def get_problems():
    url = "https://leetcode.com/api/problems/all/"

    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive'}
    resp = session.get(url, headers = headers, timeout = 10)
       
    question_list = json.loads(resp.content.decode('utf-8'))

    for question in question_list['stat_status_pairs']:
        # 題目編號(hào)
        question_id = question['stat']['question_id']
        # 題目名稱
        question_slug = question['stat']['question__title_slug']
        # 題目狀態(tài)
        question_status = question['status']

        # 題目難度級(jí)別，1 為簡(jiǎn)單，2 為中等，3 為困難
        level = question['difficulty']['level']

        # 是否為付費(fèi)題目
        if question['paid_only']:
            continue

爬取某道題的詳細(xì)信息

頁(yè)面分析

打開某道題的頁(yè)面，分析是否有請(qǐng)求題目的接口（解析頁(yè)面源碼取得數(shù)據(jù)的方式效率比較低，不優(yōu)先使用）。

在題目頁(yè)面中，有幾條鏈接均為 https://leetcode.com/graphql 的請(qǐng)求，其中一條即為請(qǐng)求題目詳細(xì)信息。

這里引出了一個(gè)新概念： GarphQL。

下面摘抄自以LeetCode為例——如何發(fā)送GraphQL Query獲取數(shù)據(jù)。

GraphQL 是一種用于 API 的查詢語(yǔ)言，是由 Facebook 開源的一種用于提供數(shù)據(jù)查詢服務(wù)的抽象框架。在服務(wù)端 API 開發(fā)中，很多時(shí)候定義一個(gè)接口返回的數(shù)據(jù)相對(duì)固定，因此要獲得更多信息或者只想得到某部分信息時(shí)，基于 RESTful API 的接口就顯得不那么靈活。而 GraphQL 對(duì) API 中的數(shù)據(jù)提供了一套易于理解的完整描述，使得客戶端能夠準(zhǔn)確地獲得它需要的數(shù)據(jù)，而且沒(méi)有任何冗余，也讓 API 更容易地隨著時(shí)間推移而演進(jìn)，還能用于構(gòu)建強(qiáng)大的開發(fā)者工具。

現(xiàn)在 LeetCode 的大部分接口都是通過(guò) GraphQL 來(lái)獲取。這里關(guān)于 GraphQL 不做深層次的了解，大致就是通過(guò)類似 SQL 的查詢語(yǔ)言來(lái)實(shí)現(xiàn)一對(duì)多的接口。我們直接復(fù)制獲取題目詳細(xì)信息的查詢語(yǔ)句來(lái)實(shí)現(xiàn)目的。

代碼

import requests,json

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'

def get_problem_by_slug(slug):
    url = "https://leetcode.com/graphql"
    params = {'operationName': "getQuestionDetail",
        'variables': {'titleSlug': slug},
        'query': '''query getQuestionDetail($titleSlug: String!) {
            question(titleSlug: $titleSlug) {
                questionId
                questionFrontendId
                questionTitle
                questionTitleSlug
                content
                difficulty
                stats
                similarQuestions
                categoryTitle
                topicTags {
                        name
                        slug
                }
            }
        }'''
    }

    json_data = json.dumps(params).encode('utf8')
                        
    headers = {'User-Agent': user_agent, 'Connection': 
        'keep-alive', 'Content-Type': 'application/json',
        'Referer': 'https://leetcode.com/problems/' + slug}
    resp = session.post(url, data = json_data, headers = headers, timeout = 10)
    content = resp.json()

    # 題目詳細(xì)信息
    question = content['data']['question']
    print(question)

爬取最新 AC 代碼

要獲取 AC 代碼，必須得先為登陸狀態(tài)，下面代碼均為在假設(shè) session 已登陸的假定條件下。

爬取某個(gè)題目的提交代碼列表

頁(yè)面分析

LeetCode 現(xiàn)在有新版和舊版布局，舊版中沒(méi)有查到相關(guān)的請(qǐng)求，不過(guò)在新版中用 GraphQL 請(qǐng)求了當(dāng)前賬戶的提交代碼清單。其中記錄了所有提交代碼的編號(hào)、編程語(yǔ)言、時(shí)間戳、狀態(tài)等信息，但并沒(méi)有帶有代碼，需要再往下一層分析。

代碼

import requests,json

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'

def get_submissions(slug):
    url = "https://leetcode.com/graphql"
    params = {'operationName': "Submissions",
        'variables':{"offset":0, "limit":20, "lastKey": '', "questionSlug": slug},
            'query': '''query Submissions($offset: Int!, $limit: Int!, $lastKey: String, $questionSlug: String!) {
                submissionList(offset: $offset, limit: $limit, lastKey: $lastKey, questionSlug: $questionSlug) {
                lastKey
                hasNext
                submissions {
                    id
                    statusDisplay
                    lang
                    runtime
                    timestamp
                    url
                    isPending
                    __typename
                }
                __typename
            }
        }'''
    }

    json_data = json.dumps(params).encode('utf8')

    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/',
        "Content-Type": "application/json"}  
    resp = session.post(url, data = json_data, headers = headers, timeout = 10)
    content = resp.json()
    for submission in content['data']['submissionList']['submissions']:
        print(submission)

獲取某次提交代碼

頁(yè)面分析

打開某次提交代碼的頁(yè)面，其鏈接格式如 https://leetcode.com/submissions/detail/123456789/，后面對(duì)應(yīng)的數(shù)字為每次提交代碼的 id。

前面我們已經(jīng)可以獲取到所有的提交代碼的簡(jiǎn)略信息，找出最新的 AC 代碼的編號(hào)，取得頁(yè)面源代碼。

對(duì)頁(yè)面分析發(fā)現(xiàn)這里并沒(méi)有調(diào)用請(qǐng)求數(shù)據(jù)，數(shù)據(jù)應(yīng)該是隨頁(yè)面渲染而成，因此我們只能通過(guò)傳統(tǒng)的獲取頁(yè)面源碼正則解析的方式。

用 Chrome 查看網(wǎng)頁(yè)源代碼，發(fā)現(xiàn)代碼被賦予到一個(gè) submissionCode 的變量中去了。

submission頁(yè)面源碼

代碼

import requests,json,re

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'

def get_submission_by_id(submission_id):
    url = "https://leetcode.com/submissions/detail/" + submission_id
    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', "Content-Type": "application/json"}
    code_content = session.get(url, headers = headers, timeout = 10)

    pattern = re.compile(r'submissionCode: \'(?P<code>.*)\',\n  editCodeUrl', re.S)
    m1 = pattern.search(code_content.text)
    code = m1.groupdict()['code'] if m1 else None
    print(code)

全部代碼

import requests,json,re
from requests_toolbelt import MultipartEncoder

session = requests.Session()
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'


def login(username, password):
    url = 'https://leetcode.com'
    cookies = session.get(url).cookies
    for cookie in cookies:
        if cookie.name == 'csrftoken':
            csrftoken = cookie.value

    url = "https://leetcode.com/accounts/login"
        
    params_data = {
        'csrfmiddlewaretoken': csrftoken,
        'login': username,
        'password':password,
        'next': 'problems'
    }
    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer':         'https://leetcode.com/accounts/login/',
        "origin": "https://leetcode.com"}
    m = MultipartEncoder(params_data)   

    headers['Content-Type'] = m.content_type
    session.post(url, headers = headers, data = m, timeout = 10, allow_redirects = False)
    is_login = session.cookies.get('LEETCODE_SESSION') != None
    return is_login

def get_problems():
    url = "https://leetcode.com/api/problems/all/"

    
    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive'}
    resp = session.get(url, headers = headers, timeout = 10)
       
    question_list = json.loads(resp.content.decode('utf-8'))

    for question in question_list['stat_status_pairs']:
        # 題目編號(hào)
        question_id = question['stat']['question_id']
        # 題目名稱
        question_slug = question['stat']['question__title_slug']
        # 題目狀態(tài)
        question_status = question['status']

        # 題目難度級(jí)別，1 為簡(jiǎn)單，2 為中等，3 為困難
        level = question['difficulty']['level']

        # 是否為付費(fèi)題目
        if question['paid_only']:
            continue
        print(question_slug)      

def get_problem_by_slug(slug):
    url = "https://leetcode.com/graphql"
    params = {'operationName': "getQuestionDetail",
        'variables': {'titleSlug': slug},
        'query': '''query getQuestionDetail($titleSlug: String!) {
            question(titleSlug: $titleSlug) {
                questionId
                questionFrontendId
                questionTitle
                questionTitleSlug
                content
                difficulty
                stats
                similarQuestions
                categoryTitle
                topicTags {
                        name
                        slug
                }
            }
        }'''
    }

    json_data = json.dumps(params).encode('utf8')
                        
    headers = {'User-Agent': user_agent, 'Connection': 
        'keep-alive', 'Content-Type': 'application/json',
        'Referer': 'https://leetcode.com/problems/' + slug}
    resp = session.post(url, data = json_data, headers = headers, timeout = 10)
    content = resp.json()

    # 題目詳細(xì)信息
    question = content['data']['question']
    print(question)

def get_submissions(slug):
    url = "https://leetcode.com/graphql"
    params = {'operationName': "Submissions",
        'variables':{"offset":0, "limit":20, "lastKey": '', "questionSlug": slug},
            'query': '''query Submissions($offset: Int!, $limit: Int!, $lastKey: String, $questionSlug: String!) {
                submissionList(offset: $offset, limit: $limit, lastKey: $lastKey, questionSlug: $questionSlug) {
                lastKey
                hasNext
                submissions {
                    id
                    statusDisplay
                    lang
                    runtime
                    timestamp
                    url
                    isPending
                    __typename
                }
                __typename
            }
        }'''
    }

    json_data = json.dumps(params).encode('utf8')

    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', 'Referer': 'https://leetcode.com/accounts/login/',
        "Content-Type": "application/json"}  
    resp = session.post(url, data = json_data, headers = headers, timeout = 10)
    content = resp.json()
    for submission in content['data']['submissionList']['submissions']:
        print(submission)

def get_submission_by_id(submission_id):
    url = "https://leetcode.com/submissions/detail/" + submission_id
    headers = {'User-Agent': user_agent, 'Connection': 'keep-alive', "Content-Type": "application/json"}
    code_content = session.get(url, headers = headers, timeout = 10)

    pattern = re.compile(r'submissionCode: \'(?P<code>.*)\',\n  editCodeUrl', re.S)
    m1 = pattern.search(code_content.text)
    code = m1.groupdict()['code'] if m1 else None
    print(code)

print(login('u', 'p'))
# get_problems()
# get_problem_by_slug('two-sum')
get_submissions('two-sum')
get_submission_by_id('')

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python爬取 LeetCode 題目及 AC 代碼

Python爬取 LeetCode 題目及 AC 代碼

前言

模擬登陸 LeetCdoe

頁(yè)面分析

代碼

爬取題目

獲取所有題目信息

頁(yè)面分析

代碼

爬取某道題的詳細(xì)信息

頁(yè)面分析

代碼

爬取最新 AC 代碼

爬取某個(gè)題目的提交代碼列表

頁(yè)面分析

代碼

獲取某次提交代碼

頁(yè)面分析

代碼

全部代碼

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python爬取 LeetCode 題目及 AC 代碼

前言

模擬登陸 LeetCdoe

頁(yè)面分析

代碼

爬取題目

獲取所有題目信息

頁(yè)面分析

代碼

爬取某道題的詳細(xì)信息

頁(yè)面分析

代碼

爬取最新 AC 代碼

爬取某個(gè)題目的提交代碼列表

頁(yè)面分析

代碼

獲取某次提交代碼

頁(yè)面分析

代碼

全部代碼

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av