日韩女色综合,久精品亚洲,黄色片,日韩

之前有個小伙伴做作家文風分析，大概就是將作家的文章做處理，然后分析作家的寫作風格和一些細節(jié)上的習慣，顯然首先要做的就是將所有文章都分成一個個詞然后進行統(tǒng)計，手動分詞的話顯然是個很復雜的工程，想起來之前看過搜索引擎分詞相關的文章，所以就去找了下python分詞相關的東西，果然找到了一個非常好用的庫——jieba。

照例先上官方文檔,里面有詳細的安裝方法和簡單的介紹和演示，下面就以莫言《紅高粱》為例看一下jiaba分詞的效果。思路就是將文章進行分解，將每一個詞都存到數(shù)據(jù)庫中然后進行分析。

# -*- coding:UTF-8 -*-
import pymysql
#導入jieba詞性標注
from jieba import  posseg 


db_config ={
    'host': '127.0.0.1',
    'port': 3306,
    'user': 'root',
    'password': 'root',
    'db': 'compword',
    'charset': 'utf8'
}
connection = pymysql.connect(**db_config)

with open(r'G:\testData\red.txt', 'r') as file:
    lines = file.readlines()
    for line in lines:
        words = posseg.cut(line.strip())
        with connection.cursor() as cursor:
            sql = 'insert into words(word, flag) values(%s, %s)'
            for word in words:
                cursor.execute(sql, (word.word, word.flag))
        connection.commit()

connection.close()

上面的代碼就是將存在G盤的文章一行一行地讀取并去掉空格進行分詞并以詞性標注，然后存入數(shù)據(jù)庫，主要語句就是

words = posseg.cut(line.strip())

然后再用上次介紹到的plotly做圖形，代碼及效果圖如下：

import pandas
import plotly as py
import plotly.graph_objs as go
import pymysql

py.tools.set_credentials_file(username='venidi', api_key='***********')

db_config ={
    'host': '127.0.0.1',
    'port': 3306,
    'user': 'root',
    'password': 'root',
    'db': 'compword',
    'charset': 'utf8'
}
connection = pymysql.connect(**db_config)

with connection.cursor() as cursor:
    sql = 'select flag,count(*) from words GROUP BY flag'
    cursor.execute(sql)
    rows = cursor.fetchall()

# 使用Pandas中的DataFrame處理便于plotly的使用，轉換成DataFarame的格式，類似二維表
df = pandas.DataFrame([[ij for ij in i] for i in rows])
df.rename(columns={0: 'flag', 1: 'count'}, inplace=True)

trace1 = go.Bar(
    x=df['flag'],
    y=df['count']
)

data = [trace1]
# 離線形式存儲形成的圖表
py.offline.plot(data, filename='g:/test2.html')

結果如下，各詞性使用數(shù)量

各詞性詞的使用量.png

連詞使用量

import pandas
import plotly as py
import plotly.graph_objs as go
import pymysql

py.tools.set_credentials_file(username='venidi', api_key='**********')

db_config ={
    'host': '127.0.0.1',
    'port': 3306,
    'user': 'root',
    'password': 'root',
    'db': 'compword',
    'charset': 'utf8'
}
connection = pymysql.connect(**db_config)

with connection.cursor() as cursor:
    sql = 'select word,count(*) from words where flag = "c" GROUP BY word '
    cursor.execute(sql)
    rows = cursor.fetchall()

# 使用Pandas中的DataFrame處理便于plotly的使用，轉換成DataFarame的格式，類似二維表
df = pandas.DataFrame([[ij for ij in i] for i in rows])
df.rename(columns={0: 'word', 1: 'count'}, inplace=True)

trace1 = go.Scatter(
    x=df['word'],
    y=df['count'],
    mode = 'markers'
)

data = [trace1]
# 離線形式存儲形成的圖表
py.offline.plot(data, filename='g:/test.html')

結果：

連詞使用量.png

好了，大概就是這樣了，另外關于中文信息處理分析啥的我不太懂，就是聽小伙伴轉述，有不對的地方還請見諒，文章中有錯誤的話歡迎大家指正O(∩_∩)O。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

好用的Python中文分詞組件——jieba分詞

好用的Python中文分詞組件——jieba分詞

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

好用的Python中文分詞組件——jieba分詞

相關閱讀更多精彩內容

友情鏈接更多精彩內容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av