原文載于 https://old-panda.com/2020/01/02/python-selenium-download-weibo-video/
都2020年了,還寫爬蟲類型的代碼,頗有種49年入國軍的感覺,但代碼都已經(jīng)寫了,同時這個博客的定位是個人知識庫,簡單記錄下來,萬一以后有用呢。
之前說過,為了在 Reddit 上混幾個積分,寫了個自動發(fā)帖輔助工具,每天發(fā)一則熊貓視頻,是為“一天一熊貓,憂愁遠(yuǎn)離我”,比如說這個帖子,成功吸引了大量熊貓粉的觀看和點贊。每個帖子我都會注明視頻來源,也就是微博。為了練習(xí)用 Selenium 扒視頻,我寫了這樣一個簡單的函數(shù)來找出視頻的真正鏈接
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
HEADERS = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"referer": "https://passport.weibo.com/visitor/visitor?entry=miniblog&a=enter&url=https%3A%2F%2Fweibo.com%2Ftv%2Fv%2FI9cdSBVBP&domain=.weibo.com&ua=php-sso_sdk_client-0.6.28&_rand=1569807841.8018"
}
def parse_weibo_video(url):
option = webdriver.ChromeOptions()
option.add_argument('headless')
driver = webdriver.Chrome(executable_path="/path/to/chromedriver", chrome_options=option)
driver.get(url)
try:
element = WebDriverWait(driver, 60).until(
EC.presence_of_element_located((By.TAG_NAME, "video"))
)
user = WebDriverWait(driver, 60).until(
EC.presence_of_element_located((By.XPATH, "http://div[@class='player_info']//div[@class='clearfix']/a/span"))
)
return element.get_property("src"), user.text
finally:
driver.quit()
這里的參數(shù) url 就是微博視頻的網(wǎng)頁的鏈接,比如說 https://www.weibo.com/tv/v/In7Oce2uO ,代碼中使用 Chrome 無頭瀏覽器來模擬正常用戶瀏覽頁面時的加載過程。下面這句代碼是獲取視頻鏈接的核心
element = WebDriverWait(driver, 60).until(
EC.presence_of_element_located((By.TAG_NAME, "video"))
)
這是告訴 Selenium driver 等待頁面中 html video 標(biāo)簽的出現(xiàn),60秒后超時(限于國外的網(wǎng)絡(luò)條件,訪問國內(nèi)的網(wǎng)站有時候等得挺久的),這樣我們在視頻標(biāo)簽出現(xiàn)之后,直接返回其 src 的值,即為視頻鏈接,然后就可以為所欲為了。