from snownlp import SnowNLP
import pandas as pd
import numpy as np
traindata=pd.read_csv('/Users/xuyizhou/Desktop/trainData.csv')
報錯:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 8: inva
subline查看文件亂碼,修改后不是亂碼報錯:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 17077, saw 7
\r的錯
try another way
df=pd.read_xlsx('/Users/xuyizhou/Desktop/trainData.xlsx')
wrong
df=pd.read_excel('/Users/xuyizhou/Desktop/trainData.xlsx')
fundamental operation
df.head()
df.head(1)
df.dtypes
df.index
df.describe
df.iloc[3:5,1:4]
NLP
object->string
eg.
import json
data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
data_string = json.dumps(data)
print 'ENCODED:', data_string
decoded = json.loads(data_string)
print 'DECODED:', decoded
print 'ORIGINAL:', type(data[0]['b'])
print 'DECODED :', type(decoded[0]['b'])
take the content[1] for example
s.words
Out[68]:
['熱水器',
'加',
'熱',
'時間',
'太',
'長',
',',
'安裝',
'費',
'太',
'貴',
',',
'預留',
'太陽能',
'口',
'擺設',
',',
'根本',
'用',
'不',
'到',
',',
'沒有',
'水位',
'指示器',
',',
'加',
'滿',
'熱水',
'的',
'指示',
'燈',
'放在',
'了',
'最',
'側面',
',',
'不',
'方便',
'用戶',
'看',
'指示',
'燈',
',',
'必須',
'斜',
'著',
'看',
'才',
'能',
'看到',
',']
the train data use the
theme-主題 加熱時間;安裝費;用戶;
sentiment_word-情感關鍵詞 太長;太貴;不方便;
use a cycle
successfully split the words
..to be continue 1102