文本分類挖掘預(yù)測

首先說明內(nèi)容有些簡單( (⊙o⊙),僅供參考)

文本預(yù)測數(shù)據(jù)(由于數(shù)據(jù)太多再次測試1w條數(shù)據(jù)即test的數(shù)據(jù)集)

數(shù)據(jù)集地址下載:

https://pan.baidu.com/share/init?surl=XIZwRlG4-yynR9fSEAdRiA
密碼:kxxa

首先將把需要測試的數(shù)據(jù)集暫時(shí)保存下來,進(jìn)行分詞,關(guān)鍵詞提取,集合并集,變換特征向量等操作,把關(guān)鍵詞和特征向量的內(nèi)容保存在文本里;

import jieba
import jieba.analyse
import numpy as np
f_train='C:/Users/Administrator/PycharmProjects/new/練習(xí)/第六月/other/cnews/cnews.test.txt'
list_x=[]
list_y=[]
# while True:
#     print('第一次請注釋,以后請注釋,運(yùn)行l(wèi)ast2')
with open(f_train,'r',encoding='utf-8') as file_train:
    for i in file_train:
        line_list=i.split('\t')
        list_x.append(line_list[1])
        list_y.append(line_list[0])
        # print(i)
# print(list_train)
###分詞 存儲(chǔ)分詞
##+++++++++++++++++++++++++++++++++++++++++++++++

for count,article in enumerate(list_x):
    segment_i=jieba.analyse.extract_tags(article,topK=10,withWeight=False,allowPOS=())
    list_x[count]=segment_i


    # if count>10:
    #     break
print(list_x[0])
# # list1.append(segment_i)
# ##轉(zhuǎn)換變?yōu)?1,0
# ##所有訓(xùn)練集的并集(union)
set_union={}
print(type(set_union))
count=0
for i in list_x:
    count+=1
    print(count)
    set_union=set(set_union)|set(i)
print(len(set_union))
#計(jì)算并集(轉(zhuǎn)化為詞向量)
list_set_union=list(set_union)
print(list_set_union)
with open('特征集合變換列表00.txt','w+') as filelist:
    filelist.write(str(list_set_union))



# with open('特征集合變換列表.txt','r') as f:
#     list_set_union=f.read()
#     list_set_union=eval(list_set_union)
    # print(list_set_union)
    # print(len(list_set_union))

print('*'*100)
list_all=[]
count=0
with open('all00.txt', 'w+') as file:
    for x in list_x:
        count+=1
        print('count:',count)
        list_one=[0 for i in range(len(list_set_union))]
        for i in x:
            for k,v in enumerate(list_set_union):
                if v==i:
                    list_one[k]=1
                    break

        file.write(str(list_one) + '\n')

然后在第二個(gè)python文件里讀取剛才保存的文件,如果寫在一起每次都會(huì)重新生成比較慢,所以在此小編寫了兩個(gè)文件。便于操作。


from sklearn.linear_model import LogisticRegression
import jieba
import jieba.analyse
list_all=[]
list_y=[]
f_train='C:/Users/Administrator/PycharmProjects/new/練習(xí)/第六月/other/cnews/cnews.test.txt'
with open(f_train,'r',encoding='utf-8') as file_train:
    for i in file_train:
        line_list=i.split('\t')
        list_y.append(line_list[0])
print(list_y)
set_y=set(list_y)
print(set_y)
list_set_y=list(set_y)
print(list_set_y)
dict_set_y={}
for k,v in enumerate(list_set_y):
    dict_set_y[k]=v
for i,j in enumerate(list_y):
    for k,v in enumerate(list_set_y):
        if j==v:
            list_y[i]=k
            break
print(list_y)
##列表的形式轉(zhuǎn)換成字符串
with open('all0.txt','r') as f:
    file=f.readlines()
    for k,i in enumerate(file):
        i=i.replace('\n','')
        i=eval(i)
        # print(k)
        list_all.append(i)
    # print(list_all)
    print(len(list_all))

lr_model = LogisticRegression()
lr_model.fit(list_all, list_y)
with open('特征集合變換列表0.txt','r') as f:
    list_set_union=f.read()
    list_set_union=eval(list_set_union)
    # print(list_set_union)
    # print(len(list_set_union))
while True:
    cheshi=input('測試:')
    segment_i=jieba.analyse.extract_tags(cheshi,topK=10,withWeight=False,allowPOS=())
    # print(segment_i)

    list_one = [0 for i in range(len(list_set_union))]
    for x in segment_i:
        for k,v in enumerate(list_set_union):
            if v==x:
                list_one[k]=1
                break
    # print(list_one)
    s=lr_model.predict([list_one])
    print(dict_set_y[s[0]])

直到這里基本可以完成簡單預(yù)測,下面進(jìn)行一個(gè)簡單的前后端界面交互。利用django進(jìn)行交互,簡單說明一下建項(xiàng)目的流程。

image.png

image.png

image.png

注意如果建立了static的包要在setting里,一般最后加上,沒建立這個(gè)包就不用了其他的內(nèi)容暫時(shí)不需要更改

STATICFILES_DIRS = [
    os.path.join(BASE_DIR, 'static'),
]
image.png

image.png
from django.conf.urls import url
from . import views
urlpatterns = [
    url(r'^$',views.index),
    url(r'^serach/$',views.serach),
]

image.png
from django.shortcuts import render
from sklearn.linear_model import LogisticRegression
from sklearn.utils.validation import check_array as check_arrays
import jieba
import time
import jieba.analyse
from django.shortcuts import render,HttpResponse,HttpResponseRedirect,redirect
# Create your views here.
def index(request):
    return render(request, 'index.html')
def serach(request):
    cheshi=request.POST.get('cheshi')
    # print(content)
    mysession=request.session.get('mysession0','')
    list_all = []
    list_y = []
    start=time.time()

    f_train = 'C:/Users/Administrator/PycharmProjects/new/練習(xí)/第六月/other/cnews/cnews.test.txt'
    with open(f_train, 'r', encoding='utf-8') as file_train:
        for i in file_train:
            line_list = i.split('\t')
            list_y.append(line_list[0])
    # print(list_y)
    set_y = set(list_y)
    print(set_y)
    list_set_y = list(set_y)
    print(list_set_y)
    dict_set_y = {}
    for k, v in enumerate(list_set_y):
        dict_set_y[k] = v
    for i, j in enumerate(list_y):
        for k, v in enumerate(list_set_y):
            if j == v:
                list_y[i] = k
                break
    # print(list_y)

    if mysession == '':
        ##列表的形式轉(zhuǎn)換成字符串
        with open('C:/Users/Administrator/PycharmProjects/new/練習(xí)/第六月/other/all0.txt', 'r') as f:
            file = f.readlines()
            for i in file:
                i = i.replace('\n', '')
                i = eval(i)
                # print(i)
                list_all.append(i)
            # print(list_all)
            # print(len(list_all))



        with open('C:/Users/Administrator/PycharmProjects/new/練習(xí)/第六月/other/特征集合變換列表0.txt', 'r') as f:
            list_set_union = f.read()
            list_set_union = eval(list_set_union)
            # print(list_set_union)
            # print(len(list_set_union))
            request.session['mysession0'] = list_all
            request.session['mysession1'] = list_set_union
    # print('*'*100)
    s0=time.time()
    list_all=request.session['mysession0']
    list_set_union=request.session['mysession1']
    s1=time.time()
    # print('session:',s1-s0)
    # print('*' * 100)
    s0=time.time()
    lr_model = LogisticRegression()
    lr_model.fit(list_all, list_y)
    s1 = time.time()
    print('邏輯:', s1 - s0)
    # print('*' * 100)
    # cheshi = input('測試:')
    segment_i = jieba.analyse.extract_tags(cheshi, topK=10, withWeight=False, allowPOS=())
    # print(segment_i)

    list_one = [0 for i in range(len(list_set_union))]
    for x in segment_i:
        for k, v in enumerate(list_set_union):
            if v == x:
                list_one[k] = 1
                break
    # print(list_one)
    s = lr_model.predict([list_one])
    answer=dict_set_y[s[0]]
    print(answer)
    end=time.time()
    print(end-start)
    ctx={
        'content':answer
    }
    return render(request, 'index.html',ctx)

最后我們在模板templates的文件夾中編寫簡單前端程序。

image.png
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
    <script src="/static/js/jquery-1.12.4.min.js"></script>

</head>
<body>
{#<img src="/static/img/1.jpg">#}
<div style="text-align: center;margin-top: 100px">
<form action="/serach/" method="post" >
    {% csrf_token %}
    <textarea cols="50%" rows="10" name="cheshi" id="tt"></textarea><br/>
    <input type="submit" id="submit"><br>
    <input type="text" value="{{ content }}" name="over">

</form>
    <script>
        $("#submit").click(function () {
            if($("#tt").val()==''){
                alert('不能發(fā)空')
                return false
            }

        })

    </script>
</div>

</body>
</html>

到這里前后端交互基本可以實(shí)現(xiàn)了,測試一下,測試之前咱們先遷移一下,否則session無法存儲(chǔ)

image.png

然后運(yùn)行項(xiàng)目

image.png

運(yùn)行之后在瀏覽器中輸入 127.0.0:8000不出意外應(yīng)該出現(xiàn)如下情況,其他意外自行百度解決,一般都是包不全,去安裝好就好了。或者emmm...(此處省略n個(gè)字,請自行腦補(bǔ) (⊙o⊙))

image.png
image.png
image.png

整體過程基本結(jié)束。如有問題請互相討論留言,本內(nèi)容由編者獨(dú)創(chuàng),僅供參考,如有雷同純屬巧合。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 關(guān)于Mongodb的全面總結(jié) MongoDB的內(nèi)部構(gòu)造《MongoDB The Definitive Guide》...
    中v中閱讀 32,301評(píng)論 2 89
  • ORA-00001: 違反唯一約束條件 (.) 錯(cuò)誤說明:當(dāng)在唯一索引所對(duì)應(yīng)的列上鍵入重復(fù)值時(shí),會(huì)觸發(fā)此異常。 O...
    我想起個(gè)好名字閱讀 5,962評(píng)論 0 9
  • 魔殿始祖寢殿中,紅衣美人,睡眼惺忪,感覺好像挺溫暖的,很是舒服,蹭了蹭,想著還能再瞇會(huì)兒,想動(dòng)動(dòng)身子,找個(gè)更加舒服...
    轉(zhuǎn)角花開閱讀 2,199評(píng)論 6 32
  • 林泉林泉閱讀 238評(píng)論 2 1
  • 命令模式(Command):將請求與實(shí)現(xiàn)解耦,并封裝成獨(dú)立對(duì)象,從而使不同的請求對(duì)客戶端的實(shí)現(xiàn)參數(shù)化。 命令模式 ...
    JSUED閱讀 450評(píng)論 0 0

友情鏈接更多精彩內(nèi)容