How to determine the encoding of text?

https://stackoverflow.com/questions/436220/how-to-determine-the-encoding-of-text

Correctly detecting the encoding all times is impossible.

正確地檢測(cè)編碼是不可能的。

(From chardet FAQ:)
However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

有些編碼是針對(duì)特定語(yǔ)言?xún)?yōu)化的,而語(yǔ)言并不是隨機(jī)的。有些字符序列會(huì)一直出現(xiàn),但有些序列的出現(xiàn)是毫無(wú)意義的。
一個(gè)精通英語(yǔ)的人打開(kāi)報(bào)紙,發(fā)現(xiàn) "txzqJv 2!dasd0a QqdKjvz",會(huì)立刻認(rèn)出那不是英語(yǔ)(盡管它完全由英文字母組成)。
通過(guò)研究大量的 "典型 "文本,計(jì)算機(jī)算法可以模擬出這種功能,并對(duì)文本的語(yǔ)言做出有根據(jù)的猜測(cè)。

There is the chardet library that uses that study to try to detect encoding. chardet is a port of the auto-detection code in Mozilla.

python有一個(gè)chardet庫(kù)使用這種方式來(lái)嘗試檢測(cè)編碼。
chardet庫(kù)是Mozilla中自動(dòng)檢測(cè)代碼的一個(gè)移植庫(kù)。

You can also use UnicodeDammit. It will try the following methods:

你也可以使用UnicodeDammit。他的原理如下:

An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

你也可以使用UnicodeDammit這個(gè)工具,它將嘗試以下方法:

找編碼的過(guò)程:比如在XML已聲明了,或在HTML文檔的http-equiv META標(biāo)簽中已聲明,如果Beautiful Soup在文件中發(fā)現(xiàn)了這種編碼,它就會(huì)從頭開(kāi)始重新解析文檔,并嘗試使用新的編碼。有個(gè)例外,如果你明確地指定了編碼,并且該編碼確實(shí)有效:那么它將忽略在文檔中發(fā)現(xiàn)的任何編碼。
通過(guò)查看文件的前幾個(gè)字節(jié)來(lái)嗅探編碼。如果在這個(gè)階段檢測(cè)到一個(gè)編碼,它將是UTF-*編碼、EBCDIC或ASCII中的一個(gè)。
如果你安裝了chardet庫(kù),那么它就會(huì)被檢測(cè)到。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容