欧美色怡春院aⅴ视,久久久人妻,我艹视频久久

image

我們抓的網(wǎng)站地址是 http://xwxmovie.cn/

用了selenium、BeautifulSoup

首先還是最基本的初始化代碼

baseURL = "http://xwxmovie.cn/"
headers = {
    'Host': 'xwxmovie.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/64.0.3282.167 Safari/537.36"'
}

def browser_get():
    browser = webdriver.Chrome()
    browser.get(baseURL)
    html_text = browser.page_source
    page_count = get_page_count(html_text)
    get_page_data(html_text)

一開始想用BeautifulSoup抓取片段的，猶豫剛學(xué)，很多API還不會用，最后用正則先匹配自己想要的區(qū)域,然后用BeautifulSoup匹配電影名等信息；

items = re.findall(re.compile('<div id="post-.*?class="post-.*?style="position:.*?>'
                                  '.*?<div class="pinbin-image">(.*?)</div>'
                                  '.*?<div class="pinbin-category">(.*?)</div>'
                                  '.*?<div class="pinbin-copy">(.*?)</div>'
                                  '.*?</div>', re.S), html)

這時候我們就要循環(huán)挨個是找自己想要的了；

    for item in items:
        if item[0].strip():
            soup = BeautifulSoup(item[0].strip(), 'html.parser')
            img = soup.find('img', attrs={'class': 'attachment-detail-image wp-post-image'})
            # 圖片
            print("海報(bào)：" + img.get('src'))
        if item[1].strip():
            soup = BeautifulSoup(item[1].strip(), 'html.parser')
            categorys = soup.find_all('a')
            for category in categorys:
                print(category.get_text())
        if item[2].strip():
            soup = BeautifulSoup(item[2].strip(), 'html.parser')
            title = soup.find('a', attrs={'class': 'front-link'})
            print("電影名：" + title.get_text())
            print("鏈接地址：" + title.get('href'))
            date = soup.find('p', attrs={'class': 'pinbin-date'})
            print("日期：" + date.get_text())
            brief = soup.find_all('p')
            print("簡介：" + brief[1].string)

以上就是得到一頁的數(shù)據(jù)；

image

如果想得到總得就需要得到總頁面，然后循環(huán)獲??；

# 得到總頁數(shù)
def get_page_count(html):
    soup = BeautifulSoup(html, 'html.parser')
    page_count = soup.find('span', attrs={'class': 'pages'})
    return int(page_count.get_text()[-4:-2])

最終代碼如下：

# -*- coding: UTF-8 -*-

from selenium import webdriver
from bs4 import BeautifulSoup
import re

baseURL = "http://xwxmovie.cn/"
headers = {
    'Host': 'xwxmovie.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/64.0.3282.167 Safari/537.36"'
}


def browser_get():
    browser = webdriver.Chrome()
    browser.get(baseURL)
    html_text = browser.page_source
    page_count = get_page_count(html_text)
    get_page_data(html_text)


# 得到總頁數(shù)
def get_page_count(html):
    soup = BeautifulSoup(html, 'html.parser')
    page_count = soup.find('span', attrs={'class': 'pages'})
    return int(page_count.get_text()[-4:-2])


def get_page_data(html):
    items = re.findall(re.compile('<div id="post-.*?class="post-.*?style="position:.*?>'
                                  '.*?<div class="pinbin-image">(.*?)</div>'
                                  '.*?<div class="pinbin-category">(.*?)</div>'
                                  '.*?<div class="pinbin-copy">(.*?)</div>'
                                  '.*?</div>', re.S), html)
    for item in items:
        if item[0].strip():
            soup = BeautifulSoup(item[0].strip(), 'html.parser')
            img = soup.find('img', attrs={'class': 'attachment-detail-image wp-post-image'})
            # 圖片
            print("海報(bào)：" + img.get('src'))
        if item[1].strip():
            soup = BeautifulSoup(item[1].strip(), 'html.parser')
            categorys = soup.find_all('a')
            for category in categorys:
                print(category.get_text())
        if item[2].strip():
            soup = BeautifulSoup(item[2].strip(), 'html.parser')
            title = soup.find('a', attrs={'class': 'front-link'})
            print("電影名：" + title.get_text())
            print("鏈接地址：" + title.get('href'))
            date = soup.find('p', attrs={'class': 'pinbin-date'})
            print("日期：" + date.get_text())
            brief = soup.find_all('p')
            print("簡介：" + brief[1].string)

if __name__ == '__main__':
    browser_get()

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Python學(xué)習(xí)：爬個電影資源網(wǎng)站

Python學(xué)習(xí)：爬個電影資源網(wǎng)站

首先還是最基本的初始化代碼

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Python學(xué)習(xí)：爬個電影資源網(wǎng)站

首先還是最基本的初始化代碼

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av