亚洲欧美乱码精品,色绯色影日韩欧美,国产无遮挡啪啪

最近在學(xué)習(xí)python爬蟲的內(nèi)容，就拿豆瓣影評來做一個(gè)練習(xí)。

爬蟲目的：爬取《囧媽》這部電影點(diǎn)贊前20的影評，生成每篇影評的詞云圖
影評鏈接：https://movie.douban.com/subject/30306570/reviews
主要python庫：requests, BeautifulSoup, jieba, wordcloud

囧媽影評

一、爬取影評

1.導(dǎo)入需要的python庫

import requests, csv, jieba
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS
from imageio import imread
from sklearn.feature_extraction.text import CountVectorizer

2.分析頁面的html代碼

谷歌瀏覽器打開開發(fā)者模式查看網(wǎng)頁代碼（右鍵-檢查可以進(jìn)入開發(fā)者模式），如下圖所示，我們發(fā)現(xiàn)每篇影評是沒有完全展開的，所以我們需要獲得每篇影評的鏈接地址，然后根據(jù)鏈接地址得到影評的全部內(nèi)容。

開發(fā)者模式

3.獲得每篇影評標(biāo)題、鏈接、點(diǎn)贊數(shù)并保存

如下代碼中我們定義了四個(gè)函數(shù)，第一個(gè)函數(shù)根據(jù)網(wǎng)址返回網(wǎng)頁內(nèi)容，第二個(gè)函數(shù)獲得每篇影評的點(diǎn)贊數(shù)，用列表來存儲，第三個(gè)函數(shù)獲得每篇影評的標(biāo)題和鏈接，用字典來存儲，第四個(gè)函數(shù)我們將影評內(nèi)容存儲到txt文件中，至此，我們就完成了爬取影評內(nèi)容的函數(shù)。

# 下載后的影評保存路徑
SAVE_PATH = ".\movies\{}.txt"
# 返回網(wǎng)頁內(nèi)容
def get_soup(url):
    headers={
      'user-agent':'Mozilla/5.0'
    }
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    return soup
# 獲得影評的支持?jǐn)?shù)
def get_useful_numbers(soup):
    useful_numbers = soup.find_all(title="有用")
    useful_numbers_list = []
    for s in useful_numbers:
        useful_numbers_list.append(s.span.text.replace("\n","").replace(" ",""))
    return useful_numbers_list
# 獲得影評的標(biāo)題和鏈接
def get_url_dict(soup):
    reviews = soup.find_all(class_="main-bd")
    url_dict = {}
    for review in reviews:
        title = review.find("a").text
        url = review.find("a",href=True)['href']
        url_dict[title] = url
    return url_dict
# 將影評輸出到文本文件保存
def write_to_file(title, url):
    soup = get_soup(url)
    pp = soup.find(id="link-report")
    contents = pp.find_all("p")
    with open(SAVE_PATH.format(title),'w',encoding="utf-8") as f:
        for content in contents:
            f.write("{}\n".format(content.text))

二、生成每篇影評的詞云圖

現(xiàn)在我們有了每篇影評的txt文件，接下來我們定義一個(gè)函數(shù)將文件中的內(nèi)容生成詞云圖。

# 傳入文件名生成詞云圖
def generate_wordcloud(filename):
    # 獲取文章內(nèi)容
    with open(".\movies\{}.txt".format(filename),encoding="utf-8") as f:
        contents = f.read()

    # 使用jieba分詞，獲取詞的列表
    contents_cut = jieba.cut(contents)
    contents_list = " ".join(contents_cut)

    # 制作詞云圖，collocations避免詞云圖中詞的重復(fù)，mask定義詞云圖的形狀，圖片要有背景色
    wc = WordCloud(stopwords=("電影","徐崢", "一個(gè)"), collocations=False, 
                   background_color="white", 
                   font_path=r"C:\Windows\Fonts\simhei.ttf",
                   width=400, height=300, random_state=42)
    wc.generate(contents_list)
    wc.to_file(".\movies\images\{}.png".format(filename))

    # 使用CountVectorizer統(tǒng)計(jì)詞頻
    cv = CountVectorizer()
    contents_count = cv.fit_transform([contents_list])
    # 詞有哪些
    list1 = cv.get_feature_names()   
    # 詞的頻率
    list2 = contents_count.toarray().tolist()[0] 
    # 將詞與頻率一一對應(yīng)
    contents_dict = dict(zip(list1, list2))
    # 輸出csv文件,newline=""，解決輸出的csv隔行問題
    with open(".\movies\csvfiles\{}.csv".format(filename), 'w', newline="") as f:
        writer = csv.writer(f)
        for key, value in contents_dict.items():
            writer.writerow([key, value])

三、主程序與結(jié)果

我們已經(jīng)定義了需要的所有函數(shù)，現(xiàn)在編寫主程序來生成需要的詞云圖。

url = r"https://movie.douban.com/subject/30306570/reviews"  
soup = get_soup(url)
url_dict = get_url_dict(soup)
useful_numbers_list = get_useful_numbers(soup)
i = 0
for key,value in url_dict.items():
    write_to_file(useful_numbers_list[i]+"_"+key,value)
    filename = useful_numbers_list[i]+"_"+key
    generate_wordcloud(filename)
    i+=1

運(yùn)行結(jié)果如下：

囧媽影評詞云圖

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

python爬取《囧媽》豆瓣影評并畫出詞云圖

python爬取《囧媽》豆瓣影評并畫出詞云圖

一、爬取影評

1.導(dǎo)入需要的python庫

2.分析頁面的html代碼

3.獲得每篇影評標(biāo)題、鏈接、點(diǎn)贊數(shù)并保存

二、生成每篇影評的詞云圖

三、主程序與結(jié)果

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

python爬取《囧媽》豆瓣影評并畫出詞云圖

一、爬取影評

1.導(dǎo)入需要的python庫

2.分析頁面的html代碼

3.獲得每篇影評標(biāo)題、鏈接、點(diǎn)贊數(shù)并保存

二、生成每篇影評的詞云圖

三、主程序與結(jié)果

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

3.獲得每篇影評標(biāo)題、鏈接、點(diǎn)贊數(shù)并保存

二、生成每篇影評的詞云圖

三、主程序與結(jié)果