《一出好戲》已經(jīng)上映17天,作為黃渤首部執(zhí)導(dǎo)的電影,已經(jīng)拿下12億的票房,豆瓣評分仍然高達(dá)7.3分,可以說此片成績非常好。作為爬蟲學(xué)習(xí)者,作者很想知道網(wǎng)友對于黃導(dǎo)的這部電影的感受到底怎么樣。本文將會通過Python爬取電影四千多條評論,將爬取的數(shù)據(jù)存儲導(dǎo)數(shù)據(jù)庫,并對影評標(biāo)題制作詞云,最后對網(wǎng)友評分做分析。
01 爬取內(nèi)容

02 主要爬蟲代碼(數(shù)據(jù)爬取和數(shù)據(jù)存儲及評分可視化代碼)
#導(dǎo)入相應(yīng)的庫
from lxml import etree
import requests
import time
import pymysql
from matplotlib import pyplot as plt
from pylab import *
# 連接數(shù)據(jù)庫及光標(biāo)
conn = pymysql.connect(host='localhost', user='root', passwd='123', db='sys', port=3306, charset='utf8')
cursor = conn.cursor()
# liked_gather 用于存放爬取到的所有評分
liked_gather = []
# 定義獲取詳細(xì)頁URL的函數(shù)
def get_info(url):
? ? html = requests.get(url)
? ? selector = etree.HTML(html.text)
? ? infos = selector.xpath('//div[@class="review-list? "]/div')
? ? for info in infos:
? ? ? ? Id = info.xpath('div/@id')
? ? ? ? time = info.xpath('div/header/span[2]/@content')
? ? ? ? liked = info.xpath('div/header/span[1]/@title')
? ? ? ? title = info.xpath('div/div/h2/a/text()')[0]
? ? ? ? useful = info.xpath('div/div/div[3]/a[1]/span[1]/text()')[0].strip()
? ? ? ? useless = info.xpath('div/div/div[3]/a[2]/span/text()')[0].strip()
? ? ? ? respond = info.xpath('div/div/div[3]/a[3]/text()')[0]
? ? ? ? liked_gather.append(liked)
? ? ? ? # 將獲取的信息插入數(shù)據(jù)庫? ?
? ? ? ? cursor.execute(
? ? ? ? ? ? "insert into ylxx (Id,time,liked,title,useful,useless,respond) values(%s,%s,%s,%s,%s,%s,%s)",
? ? ? ? ? ? (str(Id), str(time), str(liked), str(title), str(useful), str(useless), str(respond)))
# 程序主入口
if __name__ == '__main__':
? ? # 構(gòu)建URLS并循環(huán)調(diào)用函數(shù)
? ? urls = ['https://movie.douban.com/subject/26985127/reviews?start={}'.format(str(i*20)) for i in range(1,216)]
? ? for url in urls:
? ? ? ? get_info(url)
? ? ? ? # 睡眠2秒
? ? ? ? time.sleep(2)
? ? conn.commit()
? ? # 將評分分類匯總,且利用matplotlib做柱狀圖
? ? like_labs =? [['很差'],['較差'],['還行'],['推薦'],['力薦']]
? ? frequencies = []
? ? for like_lab in like_labs:
? ? ? ? frequency = liked_gather.count(like_lab)
? ? ? ? frequencies.append(frequency)
? ? label_list = ['很差','較差','還行','推薦','力薦']
? ? mpl.rcParams['font.family']=['Microsoft YaHei']
? ? plt.xticks(arange(5),label_list)
? ? plt.bar(arange(5),frequencies,facecolor='#9999ff',edgecolor='white')
? ? plt.title('一出好戲評分分布',fontsize='large',fontweight='bold')
? ? plt.show()
爬取到的數(shù)據(jù)如下:


