主要利用python結(jié)合爬蟲和可視化技術(shù)對《青春有你2》的進(jìn)行簡單的數(shù)據(jù)統(tǒng)計(jì)與分析。

一、信息獲取

利用前面文章介紹的scrapy框架對愛奇藝官方助力網(wǎng)站進(jìn)行數(shù)據(jù)爬取，不過這里涉及到動態(tài)網(wǎng)頁的抓取，根據(jù)下面的網(wǎng)頁分析有兩種方案，一是需要配合selenium和谷歌驅(qū)動瀏覽器無頭模式一起使用，再根據(jù)網(wǎng)頁標(biāo)簽進(jìn)行解析，二是直接請求網(wǎng)頁的動態(tài)地址，返回json格式的選手?jǐn)?shù)據(jù)。通過爬蟲，我們可以獲取小姐姐們的一些數(shù)據(jù)，包括：姓名、出生日期、身高體重等信息，以及他們的美照。

image

1. 網(wǎng)頁分析

大致瀏覽下助力網(wǎng)站，包含主頁助力頁面，點(diǎn)擊頭像進(jìn)入小姐姐的主頁，這里有禮物解鎖和精彩視頻，再進(jìn)入泡泡圈，就可以了解小姐姐的詳細(xì)信息了。

image

個人信息頁

image

主網(wǎng)站解析:http://www.iqiyi.com/h5act/generalVotePlat.html?activityId=373

方式一：通過selenium模擬訪問返回的網(wǎng)頁源碼進(jìn)行解析。

image

可以通過XPath Helper谷歌瀏覽器插件幫助進(jìn)行網(wǎng)頁解析。

image

方式二：根據(jù)網(wǎng)頁實(shí)際請求地址，可以從Header查看請求地址，直接獲取json返回數(shù)據(jù)。

image

2. 信息獲取

準(zhǔn)備工作

安裝python工具庫scrapy和selenium

scrapy為一個爬蟲框架，之前有介紹：

pip install scrapy

若使用方式一進(jìn)行數(shù)據(jù)抓取，還需要準(zhǔn)備以下工作:安裝selenium和瀏覽器驅(qū)動。我使用的是第二種方式直接獲取請求后的json數(shù)據(jù)，這里也對方式一做簡單記錄。

selenium是一個用于Web應(yīng)用程序測試的工具。直接運(yùn)行在瀏覽器中，就像真正的用戶在操作一樣。支持的瀏覽器有IE（7, 8, 9, 10, 11），F(xiàn)irefox，Safari，Chrome，Opera等，在爬蟲上則是模擬正常用戶訪問網(wǎng)頁并獲取數(shù)據(jù)。

pip install selenium

安裝驅(qū)動

這里我選擇使用谷歌瀏覽器作為模擬用戶操作的瀏覽器，因此需要安裝對應(yīng)的驅(qū)動chromedriver使得selenium可以調(diào)用chrome瀏覽器。

通過幫助 > 關(guān)于Google Chrome(G)，查看瀏覽器版本。

通過淘寶的鏡像：
http://npm.taobao.org/mirrors/chromedriver/

image

找到對應(yīng)版本驅(qū)動并下載到本地。解壓將chromedriver.exe放在有設(shè)置環(huán)境變量的目錄下，便于程序調(diào)用，本人放在python的安裝目錄下。

數(shù)據(jù)爬取

創(chuàng)建scrapy項(xiàng)目

scrapy startproject youngni
cd youngni
scrapy genspider star "替換為愛奇藝官方助力網(wǎng)站"

編寫代碼

編寫Item數(shù)據(jù)項(xiàng)

# -*- coding: utf-8 -*-
import scrapy

class StarItem(scrapy.Item):
    order = scrapy.Field()      # 排序
    name = scrapy.Field()       # 姓名
    page_url = scrapy.Field()   # 個人頁
    option_id = scrapy.Field()  # 個人詳細(xì)頁面id頁
    photo_url = scrapy.Field()  # 照片
    discover_image = scrapy.Field()  # 透明背景照片

    num_fans = scrapy.Field()        # 粉絲數(shù)量
    num_content = scrapy.Field()     # 內(nèi)容數(shù)量

    birth_place = scrapy.Field()  # 出生地
    birth_day = scrapy.Field()    # 生日
    zodiac = scrapy.Field()       # 星座
    height = scrapy.Field()       # 身高
    weight = scrapy.Field()       # 體重
    occupation = scrapy.Field()   # 職業(yè)
    hobby = scrapy.Field()        # 愛好
    profile = scrapy.Field()      # 簡介
    school = scrapy.Field()      # 簡介
    fannick = scrapy.Field()      # 簡介

返回json處理和網(wǎng)頁解析

# -*- coding: utf-8 -*-
import json
import scrapy
from youngni.items import StarItem

class StarSpider(scrapy.Spider):
    name = 'star'
    allowed_domains = ['iqiyi.com']
    # start_urls = ['https://www.iqiyi.com/h5act/generalVotePlat.html?activityId=373']
    start_urls = ['https://vote.iqiyi.com/vote-api/r/getMergeVoteInfo?voteIds=0463831125010981&sourceId=1&uid=2033788978&sign=5a8981521f3c13377f1d0843363d3652']

    info_map = {'出生地':'birth_place', '生日':'birth_day', '星座':'zodiac', '身高':'height',
                '體重':'weight', '職業(yè)':'occupation', '簡介':'profile', '愛好': 'hobby',
                '畢業(yè)院校': 'school', '粉絲昵稱':'fannick'}

    def precess(self, text):
        if '萬' in text:
            return float(text.replace('萬', '')) * 10000
        return int(text)

    def parse(self, response):
        json_data = json.loads(response.body)
        data_dict = json_data['data']['0463831125010981']['childs'][0]['options']
        for index, item_dict in enumerate(data_dict):
            star_item = StarItem()
            star_item['order'] = index                         # 當(dāng)前排名
            star_item['name'] = item_dict['text']              # 姓名
            star_item['page_url'] = item_dict['pageUrl']       # 個人頁
            star_item['option_id'] = item_dict['optionId']     # 個人詳細(xì)頁面id頁
            star_item['photo_url'] = [item_dict['picUrl']]     # 照片
            star_item['discover_image'] = [eval(item_dict['kv'])["discover_image"]]  # 透明背景照片

            url = 'https://m.iqiyi.com/m5/bubble/circleInfo_w{page_url}_p15.html'.format(page_url=star_item['option_id'])
            yield scrapy.Request(url,  meta={'star_item': star_item}, callback=self.parse_circleInfo)

    def parse_circleInfo(self, response):
        star_item = response.meta['star_item']

        infos = response.xpath('//div[@class="topic-info"]/span[@class="c-topic-info"]/text()').extract()
        star_item['num_fans'] = self.precess(infos[0].replace('粉絲：', ''))
        star_item['num_content'] = self.precess(infos[1].replace('內(nèi)容：', ''))

        url = 'https://m.iqiyi.com/m5/bubble/star_s{}_p15.html'.format(star_item['option_id'])
        yield scrapy.Request(url, meta={'star_item': star_item}, callback=self.parse_star_page)

    def parse_star_page(self, response):
        star_item = response.meta['star_item']
        infos = response.xpath('//div[@class="m-starFile-Pinfo"]/div[@class="c-info-wape"]')

        star_infors_tag = infos[0].xpath('//h2/text()').extract()
        star_infors = infos[0].xpath('//div[@class="info-right"]/text()').extract()[:len(star_infors_tag)]
        info_dict = dict(zip(star_infors_tag, star_infors))
        for key in info_dict:
            star_item[self.info_map[key]] = info_dict[key] if '\n' not in info_dict[key] else None
        yield star_item

數(shù)據(jù)存取pipeline，主要實(shí)現(xiàn)將信息寫入到csv文件，圖片處理的pipeline可以不用直接寫，scrapy提供有scrapy.pipelines.images.ImagesPipeline直接保存圖片，需要將照片鏈接轉(zhuǎn)成list，然后和保存路徑一并配置在setting里。

# -*- coding: utf-8 -*-
import pandas as pd
from scrapy.pipelines.images import ImagesPipeline

class YoungniPipeline(object):
    def __init__(self):
        self.info_list = []

    def process_item(self, item, spider):
        self.info_list.append(item)
        return item

    def close_spider(self, spider):
        df = pd.DataFrame(self.info_list)
        df.to_csv('star_info.csv', encoding='utf-8', index=False)

setting配置

BOT_NAME = 'youngni'

SPIDER_MODULES = ['youngni.spiders']
NEWSPIDER_MODULE = 'youngni.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'youngni.pipelines.YoungniPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 1
}
# photo_url是在item.py文件中添加的圖片的url屬性
IMAGES_URLS_FIELD = 'photo_url'
# 設(shè)置存放圖片的路徑
IMAGES_STORE = './images'

運(yùn)行程序，等著收小姐姐們的數(shù)據(jù)吧??

import os
os.system("scrapy crawl star")

二、可視化分析

01. 數(shù)據(jù)可視化

本部分主要根據(jù)上面獲取的數(shù)據(jù)進(jìn)行各類信息的統(tǒng)計(jì)與可視化展示，采用pandas做統(tǒng)計(jì)分析，使用pyecharts進(jìn)行可視化繪制，代碼較多就不展示了。

pyecharts文檔：https://pyecharts.org
pyecharts案例：https://gallery.pyecharts.org

年齡段分析：年齡集中在21至25之間，看來90后的我們還是很青春的嘛。

image

地域分析：可以看出山東妹子最多，其余相對較多的分布在長江流域的省份。

image

身高分析：平均身高167，最高身高175，最矮158。

image

體重分析：數(shù)量最多的體重在48kg，難怪別人吃火鍋，大于48kg的就只能吃黃瓜了??。

星座統(tǒng)計(jì)：相對較多的是獅子座、魔羯座、天秤座和白羊座，不知道出道和星座有關(guān)系沒有？

image

愛好詞云：看看小姐姐們都有些什么愛好，排第一的竟然是看電影??

image

粉絲話題：根據(jù)泡泡圈粉絲統(tǒng)計(jì)數(shù)量進(jìn)行展示，第一名虞書欣絕對的優(yōu)勢啊。

02. 圖像操作

使用百度paddlehub的pyramidbox_face_detection人臉檢測模型對大合照（絕佳的測試樣本啊）進(jìn)行人臉檢測，竟然都還有漏檢的，使用輕量化模型Ultra-Light-Fast-Generic-Face-Detector-1MB更沒法看。

image

人像漫畫：調(diào)用百度人像漫畫API接口實(shí)現(xiàn)

漫畫美少女-費(fèi)沁源

image

據(jù)說小姐姐們撞臉了整個娛樂圈，我用了百度的人臉識別試了試乃萬和王源的最強(qiáng)撞臉，看來實(shí)錘了。

乃萬 VS 王源

我以為相似度會上升呢??

乃萬 VS 女版王源

人臉融合以下試試，幾乎沒有違和感啊

image

最后，要想出道，顏值肯定得過關(guān)，這里仍采用了百度的人臉識別接口進(jìn)行人臉檢測、關(guān)鍵點(diǎn)定位和顏值打分。

image

然后，AI搞事情的我肯定是要對顏值排個序的，看看百度模型認(rèn)為的高顏值是什么樣的。

image

看來這模型還是不行啊，我覺得小姐姐們都應(yīng)該是90多分以上的顏值啊。

排名前三的小姐姐：

image

到此，所有騷操作結(jié)束，快去為自己喜歡的小姐姐助力去吧！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

利用python對《青春有你2》小姐姐們進(jìn)行可視化分析

利用python對《青春有你2》小姐姐們進(jìn)行可視化分析

一、信息獲取

1. 網(wǎng)頁分析

2. 信息獲取

準(zhǔn)備工作

數(shù)據(jù)爬取

二、可視化分析

01. 數(shù)據(jù)可視化

02. 圖像操作

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

利用python對《青春有你2》小姐姐們進(jìn)行可視化分析

一、信息獲取

1. 網(wǎng)頁分析

2. 信息獲取

準(zhǔn)備工作

數(shù)據(jù)爬取

二、可視化分析

01. 數(shù)據(jù)可視化

02. 圖像操作

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

二、可視化分析