Kaggle Data Challenge 第四天

Kaggle_logo.png

Abstract:18年復活節(jié)前的五天,kaggle舉辦了數(shù)據(jù)預處理的五個挑戰(zhàn)。這里做每天學習到的技術要點的回顧。這篇是第四天的內(nèi)容,主要是有關字符數(shù)據(jù)的編碼。


Introduction

Python的數(shù)據(jù)的默認編碼是UTF-8,當導入其他編碼的數(shù)據(jù)時會出錯。今天的內(nèi)容就是理解編碼解碼機制,以及解決如何將其他編碼類型的數(shù)據(jù)導入python并儲存成默認的UTF-8。

創(chuàng)建環(huán)境

需要用到chardet,character encoding detector。

# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

理解編碼解碼機制

文字類型的數(shù)據(jù)一般有兩種格式,一種是string,字符串:

# start with a string
before = "This is the euro symbol: €"

一種是bytes,字節(jié):

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")

編碼成utf-8以后的string變成了bytes格式,以「b」開頭。
b'This is the euro symbol: \xe2\x82\xac'

用utf-8解碼以后依然可以變回原狀:

# convert it back to utf-8
print(after.decode("utf-8"))

This is the euro symbol: €

這里要注意的是:以utf-8編碼后的數(shù)據(jù)沒法用別的編碼器(比如ascii)解碼。

ASCII是最早的編碼,只能編碼英語和有限的符號。

# Your turn! Try encoding and decoding different symbols to ASCII and
# see what happens. I'd recommend $, #, 你好 and ?????? but feel free to
# try other characters. What happens? When would this cause problems?

mytext = "£,#,nihao,你好"
encode_utf = mytext.encode("utf-8", errors = "replace")
encode_ascii = mytext.encode("ascii", errors = "replace")
print(encode_utf)
print(encode_ascii)
print(encode_utf.decode("utf-8"))
print(encode_ascii.decode("ascii"))

結果分別是:
b'\xc2\xa3,#,nihao,\xe4\xbd\xa0\xe5\xa5\xbd'
b'?,#,nihao,??'
£,#,nihao,你好
?,#,nihao,??

由此可見,使用錯誤的編碼器編碼解碼以后,原數(shù)據(jù)信息就丟失了。

讀取不同編碼格式數(shù)據(jù)

回到intro里提到的問題。如果編碼不同,pd.read_csv會報錯:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte
這就需要我們找到原數(shù)據(jù)是什么編碼,然后告訴python按這個來。
但是編碼方式千千萬,挨個試有點累,還好有chardet這個lib可以自動找,和昨天自動找日期格式一樣,雖然不保證100%正確,但是省事。先取出數(shù)據(jù)的前10000個字節(jié)檢測,原因是1. 全部檢測太慢;2.報錯提示錯誤出現(xiàn)在position11,10000個字節(jié)可以涵蓋。
PS:

open() returns a file object, and is most commonly used with two arguments: open(filename, mode).
The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+'opens the file for both reading and writing. The mode argument is optional; 'r'will be assumed if it’s omitted.
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

結果{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
檢測到encoding是windows-1252,可信度是73%。
那就告訴python這么搞,然后趕緊轉(zhuǎn)換成utf-8(默認)存起來。

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

最后要注意一點:有些時候,光看前10000個字節(jié)chardet測出一個結果(比如acsii),告訴python以后還是報錯,那就嘗試更多的字節(jié),會得到不同的檢查結果。


如果這篇文章對你有所幫助,還請幫忙點贊打賞評論分享~謝謝??


最后編輯于
?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

  • 個人筆記,方便自己查閱使用 Py.LangSpec.Contents Refs Built-in Closure ...
    freenik閱讀 67,945評論 0 5
  • 崔總,是個很有個性的女強人,連九九都是她的粉絲,(我好像在夸自己!但我覺得,我的偶像,都不簡單,我是個脫離了低級趣...
    九九弱水三千閱讀 817評論 0 0
  • 此時,金碧輝煌的樓宇間,一位貴氣逼人的男子,英眉微蹙,注視著手中的畫像。良久,他大笑出聲:“果真是一位如畫的美人,...
    冉依閱讀 989評論 2 6
  • 1.心動了,卻不敢靠近 小茉莉,是一個網(wǎng)名,一所普通高校的學生。還是在那個網(wǎng)絡剛剛普及的年代,上網(wǎng)聊天剛剛在校園中...
    魏然2015閱讀 988評論 6 8
  • 我見過的一萬種美,你都沒見過。 去年4月到8月,我徒步旅行走了半個中國,從桂林到西藏再到新疆,最后從寧夏回到了北京...
    曉曉孤帆閱讀 340評論 0 1

友情鏈接更多精彩內(nèi)容