日韩一区不卡人妻少妇,96九色视频在线观看

爬取今日頭條街拍數(shù)據(jù)---反爬策略滑動驗證碼

爬取的主頁：https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D

今日頭條是一個js動態(tài)加載的網(wǎng)站

我一開始用的requests庫通過接口爬取，但是發(fā)現(xiàn)其url請求有一個timestamp請求，一個時間戳的請求，百度了一下，發(fā)現(xiàn)這應該是今日頭條新的反爬策略(“萌新猜測?。　?,,無奈，，從未遇到這種問題，，沒解決掉。然后開始嘗試selenium庫自動化爬取

通過selenium庫進行爬取，代碼如下

結(jié)果不如人意，出現(xiàn)了驗證碼，這應該就是今日頭條的反爬策略，只有把這個驗證碼破解了，才能得到想要的數(shù)據(jù)

滑動驗證碼

這也是我第一次接觸到反爬驗證碼，在一波百度學習之后，思路如下：

由于這個驗證碼是自動跳出的，所以我們直接就能獲取?

步驟1? ：沒有缺口的圖片--未操作的驗證碼

步驟2 ：獲取帶缺口的圖片

步驟3 ：對比2張圖片的不同，得到不一樣的像素點的x值，即要移動的距離。

步驟4 ：模擬人的行為（先勻加速拖動再勻減速拖動，）把需要拖動的距離分為一段段的軌跡?

步驟5 ：實施拖動的過程，完成驗證

步驟7：獲取數(shù)據(jù)

from selenium.webdriver.common.byimport By

from PILimport Image

from ioimport BytesIO

from selenium.webdriver.common.action_chainsimport ActionChains

import time

import re

import json

from bs4import BeautifulSoup

def get_snap(driver):#對整個網(wǎng)頁截圖，保存成圖片，然后用PIL.Image拿到圖片對象

? ? '''

對整個網(wǎng)頁截圖，保存成圖片，然后用PIL.Image拿到圖片對象

? ? :return: 圖片對象

'''

? ? driver.get_screenshot_as_file('snap.png')

page_snap_obj=Image.open('snap.png')

return page_snap_obj

def get_image(wait,driver):#從網(wǎng)頁的網(wǎng)站截圖中，截取驗證碼圖片,圖片的獲取

? ? '''

從網(wǎng)頁的網(wǎng)站截圖中，截取驗證碼圖片

? ? :return: 驗證碼圖片

'''

? ? img = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'validate-main')))

time.sleep(2)# 保證圖片刷新出來

? ? print(img)

localtion = img.location

size = img.size

top = localtion['y']

bottom = localtion['y'] + size['height']

left = localtion['x']

right = localtion['x'] + size['width']

page_snap_obj = get_snap(driver)

crop_imag_obj = page_snap_obj.crop((left, top, right, bottom))

return crop_imag_obj

def get_distance(image1, image2):

'''

拿到滑動驗證碼需要移動的距離

:param image1:沒有缺口的圖片對象

:param image2:帶缺口的圖片對象

:return:需要移動的距離

'''

? ? # 拿到滑動驗證碼需要移動的距離

# :param

# image1: 沒有缺口的圖片對象

# :param

# image2: 帶缺口的圖片對象

# :return:需要移動的距離

? ? start =57

? ? threhold =60

? ? for iin range(start, image1.size[0]):

for jin range(image1.size[1]):

rgb1 = image1.load()[i, j]

rgb2 = image2.load()[i, j]

res1 =abs(rgb1[0] - rgb2[0])

res2 =abs(rgb1[1] - rgb2[1])

res3 =abs(rgb1[2] - rgb2[2])

# print(res1,res2,res3)

? ? ? ? ? ? if not (res1 < threholdand res2 < threholdand res3 < threhold):

return i -7

? ? return i -7

def get_tracks(distance):

'''

拿到移動軌跡，模仿人的滑動行為，先勻加速后勻減速

勻變速運動基本公式：

①v=v0+at

②s=v0t+?at2

③v2-v02=2as

? ? :paramdistance: 需要移動的距離

? ? :return: 存放每0.3秒移動的距離

'''

? ? # 初速度

? ? v =0

? ? # 單位時間為0.2s來統(tǒng)計軌跡，軌跡即0.2內(nèi)的位移

? ? t =0.3

? ? # 位移/軌跡列表，列表內(nèi)的一個元素代表0.2s的位移

? ? tracks = []

# 當前的位移

? ? current =0

? ? # 到達mid值開始減速

? ? mid = distance *4 /5

? ? while current < distance:

if current < mid:

# 加速度越小，單位時間的位移越小,模擬的軌跡就越多越詳細

? ? ? ? ? ? a =2

? ? ? ? else:

a = -3

? ? ? ? # 初速度

? ? ? ? v0 = v

# 0.2秒時間內(nèi)的位移

? ? ? ? s = v0 * t +0.5 * a * (t **2)

# 當前的位置

? ? ? ? current += s

# 添加到軌跡列表

? ? ? ? tracks.append(round(s))

# 速度已經(jīng)達到v,該速度作為下次的初速度

? ? ? ? v = v0 + a * t

return tracks

def main():

driver = webdriver.Chrome()

driver.get('https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D')

wait = WebDriverWait(driver, 20)

# 步驟二：拿到?jīng)]有缺口的圖片

? ? image1 = get_image(wait,driver)

# 步驟三：點擊拖動按鈕，彈出有缺口的圖片

? ? button = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'validate-drag-button')))

button.click()

# 步驟四：拿到有缺口的圖片

? ? image2 = get_image(wait,driver)

print(image1,image1.size)

print(image2,image2.size)

# 步驟五：對比兩張圖片的所有RBG像素點，得到不一樣像素點的x值，即要移動的距離

? ? distance = get_distance(image1, image2)

print(distance)

# 步驟六：模擬人的行為習慣（先勻加速拖動后勻減速拖動），把需要拖動的總距離分成一段一段小的軌跡

? ? tracks = get_tracks(distance)

print(tracks)

print(image1.size)

print(distance, sum(tracks))

# 步驟七：按照軌跡拖動，完全驗證

? ? button = driver.find_elements_by_class_name('ovalidate-drag-button')

ActionChains(driver).click_and_hold(button).perform()

for trackin tracks:

ActionChains(driver).move_by_offset(xoffset=track, yoffset=0).perform()

else:

ActionChains(driver).move_by_offset(xoffset=3, yoffset=0).perform()# 先移過一點

? ? ? ? ActionChains(driver).move_by_offset(xoffset=-3, yoffset=0).perform()# 再退回來，是不是更像人了

? ? time.sleep(0.5)# 0.5秒后釋放鼠標

? ? ActionChains(driver).release().perform()

shixian(driver)

這就是驗證的代碼(說實話，我也不是特別理解算法)，再執(zhí)行shixian()函數(shù)，進行爬取Json的動態(tài)網(wǎng)頁

def shixian(driver):

for jin range(0,1000,20):? ? #offset每20 換一頁，所以這邊要設(shè)置了從offset=0 為第一頁開始爬取，步數(shù)為20，到1000停止爬取,每一頁數(shù)據(jù)，不過這邊會出錯，因為還沒到1000就沒有數(shù)據(jù)可以爬了。(其實只有7頁，也就是0ffset=140!)

url ="https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset={}&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1556371760933".format(j)? ?#這就是今日頭條的json頁面，我們要爬取的數(shù)據(jù)都在者上面

driver.get(url=url)

text = driver.page_source? ? ?#這邊得到的數(shù)據(jù)不是個干凈的json格式字符串，而是以html標簽包裹的字符串，所以當用json.loads 處理時會報錯。

后面還有很多數(shù)據(jù)未顯示

pattern1 = re.compile(r'<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">')? ? ? ? ? #一個正則表達式

out1 = re.sub(pattern1,'',text)?#通過re.sub ，用’‘來替換在text中匹配到的字符串，這樣就做到了把不干凈的html標簽清除? 下面同理

# print(out1)

? ? ? ? pattern =re.compile(r'</pre></body></html>')

data =re.sub(pattern,'',out1)

datad = json.loads(data)

# print(datad)

? ? ? ? shuju = datad['data']? ? #直接選出data 中的內(nèi)容

for iin shuju:

print(i.get('abstract',''))? ? ? ? ? #因為data中的內(nèi)容是字典形式，而且是大字典中包含小字典，所以用for 遍歷每個小字典，再用字典的.get 查詢每個小字典的相應鍵對。.get(’查詢的鍵‘ ，’默認的方式‘) （ps,默認的方式，就是當你字典中查不到出錯時，就默認為空。來防止程序中斷執(zhí)行報錯）

print(i.get('image_list',''))

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

2019-04-29

2019-04-29

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

2019-04-29

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av