
Abstract:18年復活節(jié)前的五天,kaggle舉辦了數(shù)據(jù)預處理的五個挑戰(zhàn)。這里做每天學習到的技術要點的回顧。這篇是第四天的內(nèi)容,主要是有關字符數(shù)據(jù)的編碼。
Introduction
Python的數(shù)據(jù)的默認編碼是UTF-8,當導入其他編碼的數(shù)據(jù)時會出錯。今天的內(nèi)容就是理解編碼解碼機制,以及解決如何將其他編碼類型的數(shù)據(jù)導入python并儲存成默認的UTF-8。
創(chuàng)建環(huán)境
需要用到chardet,character encoding detector。
# modules we'll use
import pandas as pd
import numpy as np
# helpful character encoding module
import chardet
# set seed for reproducibility
np.random.seed(0)
理解編碼解碼機制
文字類型的數(shù)據(jù)一般有兩種格式,一種是string,字符串:
# start with a string
before = "This is the euro symbol: €"
一種是bytes,字節(jié):
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")
編碼成utf-8以后的string變成了bytes格式,以「b」開頭。
b'This is the euro symbol: \xe2\x82\xac'
用utf-8解碼以后依然可以變回原狀:
# convert it back to utf-8
print(after.decode("utf-8"))
This is the euro symbol: €
這里要注意的是:以utf-8編碼后的數(shù)據(jù)沒法用別的編碼器(比如ascii)解碼。
ASCII是最早的編碼,只能編碼英語和有限的符號。
# Your turn! Try encoding and decoding different symbols to ASCII and
# see what happens. I'd recommend $, #, 你好 and ?????? but feel free to
# try other characters. What happens? When would this cause problems?
mytext = "£,#,nihao,你好"
encode_utf = mytext.encode("utf-8", errors = "replace")
encode_ascii = mytext.encode("ascii", errors = "replace")
print(encode_utf)
print(encode_ascii)
print(encode_utf.decode("utf-8"))
print(encode_ascii.decode("ascii"))
結果分別是:
b'\xc2\xa3,#,nihao,\xe4\xbd\xa0\xe5\xa5\xbd'
b'?,#,nihao,??'
£,#,nihao,你好
?,#,nihao,??
由此可見,使用錯誤的編碼器編碼解碼以后,原數(shù)據(jù)信息就丟失了。
讀取不同編碼格式數(shù)據(jù)
回到intro里提到的問題。如果編碼不同,pd.read_csv會報錯:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte
這就需要我們找到原數(shù)據(jù)是什么編碼,然后告訴python按這個來。
但是編碼方式千千萬,挨個試有點累,還好有chardet這個lib可以自動找,和昨天自動找日期格式一樣,雖然不保證100%正確,但是省事。先取出數(shù)據(jù)的前10000個字節(jié)檢測,原因是1. 全部檢測太慢;2.報錯提示錯誤出現(xiàn)在position11,10000個字節(jié)可以涵蓋。
PS:
open()returns a file object, and is most commonly used with two arguments:open(filename, mode).
The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be'r'when the file will only be read,'w'for only writing (an existing file with the same name will be erased), and'a'opens the file for appending; any data written to the file is automatically added to the end.'r+'opens the file for both reading and writing. The mode argument is optional;'r'will be assumed if it’s omitted.
On Windows,'b'appended to the mode opens the file in binary mode, so there are also modes like'rb','wb', and'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that inJPEGorEXEfiles. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a'b'to the mode, so you can use it platform-independently for all binary files.
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))
# check what the character encoding might be
print(result)
結果{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
檢測到encoding是windows-1252,可信度是73%。
那就告訴python這么搞,然后趕緊轉(zhuǎn)換成utf-8(默認)存起來。
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")
最后要注意一點:有些時候,光看前10000個字節(jié)chardet測出一個結果(比如acsii),告訴python以后還是報錯,那就嘗試更多的字節(jié),會得到不同的檢查結果。