久草视频在线1,北条麻妃所有步兵视频,深喉东京热毛片

項(xiàng)目位置：node2:/home/disk1/xukaituo/expriments/ngram-2016-11/

Step 1. 轉(zhuǎn)換編碼

iconv -f gbk//IGNORE -t utf-8//IGNORE filename > new_format_file

Step 2. 將非漢字去掉

#!/usr/bin/env python
# coding: utf-8

import codecs
import re
import sys

def remove_non_Chinese_word(input_file, output_file):
    re_non_chinese = ur"[^\u4e00-\u9fa5]+"
    with codecs.open(input_file, 'r', 'utf-8') as inputf:
        with codecs.open(output_file, 'w', 'utf-8') as outputf:
            for line in inputf:
                re_result = re.sub(re_non_chinese, u"", line)
                # new_line = " ".join(re_result)
                new_line = re_result
                outputf.write(new_line + '\n')


if __name__ == '__main__':
    if len(sys.argv) < 3:
        print "Usage: python 0-filter_non_chinese.py input-file output-file"
        sys.exit()
    remove_non_Chinese_word(sys.argv[1], sys.argv[2])

Step 3. 刪除空白行

sed -i '/^$/d' filename

Step 4. 分詞

使用ltp分詞工具
[1]github https://github.com/HIT-SCIR/ltp
[2]文檔 http://ltp.readthedocs.io/zh_CN/latest/api.html#id2
[3]模型 https://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569
部分bash腳本：

cd /home/disk1/xukaituo/expriments/ngram-2016-11/utils
CWSTOOL=/home/disk1/xukaituo/projects/Chinese-word-segmentation
1-Chinese-word-segmentor/cws ${CWSTOOL}/ltp_data/cws.model $2 $3

調(diào)用ltp接口的分詞程序：

// cws.cc

// Copyright 2016 ASLP(Author: Kaituo Xu)

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "segment_dll.h"

int main(int argc, char *argv[])
{
    try {
        if (argc < 4) {
            std::cerr << "cws [model path] [input file path] [output file path]" << std::endl;
            return 1;
        }

        void *engine = segmentor_create_segmentor(argv[1]);
        std::ifstream input(argv[2]);
        std::ofstream output(argv[3], std::ofstream::app);

        if (!engine || !input || !output) {
            return -1;
        }

        std::string line;
        while (getline(input, line)) {
            std::vector<std::string> words;
            int len = segmentor_segment(engine, line, words);
            for (int i = 0; i < len; ++i) {
                output << words[i] << " ";
            }
            output << std::endl;
        }

        segmentor_release_segmentor(engine);
        return 0;

    } catch(const std::exception &e) {
        std::cerr << e.what();
        return -1;
    }
}

Step 5. 將暫時(shí)不用的數(shù)據(jù)進(jìn)行壓縮，節(jié)省磁盤(pán)空間

# 使用`gzip`對(duì)文件進(jìn)行壓縮
gzip <filename>
# 解壓縮
gzip -d <filename>.gz

壓縮后原文件消失，默認(rèn)在<filename>后加.gz;解壓縮后，.gz文件會(huì)消失。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

搜狗800GB數(shù)據(jù)預(yù)處理

搜狗800GB數(shù)據(jù)預(yù)處理

Step 1. 轉(zhuǎn)換編碼

Step 2. 將非漢字去掉

Step 3. 刪除空白行

Step 4. 分詞

Step 5. 將暫時(shí)不用的數(shù)據(jù)進(jìn)行壓縮，節(jié)省磁盤(pán)空間

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

搜狗800GB數(shù)據(jù)預(yù)處理

Step 1. 轉(zhuǎn)換編碼

Step 2. 將非漢字去掉

Step 3. 刪除空白行

Step 4. 分詞

Step 5. 將暫時(shí)不用的數(shù)據(jù)進(jìn)行壓縮，節(jié)省磁盤(pán)空間

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av