1.C++支持的編碼

C++支持1,2,3,4個(gè)字節(jié)的字符串，已經(jīng)有了std::string，std::wstring，std::u8string，std::u16string，std::u32string一堆的字符串類型。

類型	字符串表現(xiàn)方式	類型	說明
std::string	"hello world"	char	ANSI
std::wstring	L"hello world"	wchar_t	Unicode
std::u8string	u8"hello world"	char	UTF-8
std::u16string	u"hello world"	char16_t	UTF-16
std::u32string	U"hello world"	char32_t	UTF-32

2.Windows操作系統(tǒng)編碼

Windows操作系統(tǒng)的API包含了Multibyte(ANSI)和Unicode兩種接口，即A和W為后綴的兩套API。其中Multibyte(ANSI)接口能夠處理的字符集，根據(jù)不同的國(guó)家和地區(qū)制定了不同的標(biāo)準(zhǔn)，由此產(chǎn)生了 GB2312、GBK、GB18030、Big5、Shift_JIS 等各自的編碼標(biāo)準(zhǔn)。這些使用多個(gè)字節(jié)來代表一個(gè)字符的各種漢字延伸編碼方式，稱為 ANSI 編碼。在簡(jiǎn)體中文Windows操作系統(tǒng)中，ANSI 編碼代表 GBK (CP 936)編碼；在繁體中文Windows操作系統(tǒng)中，ANSI編碼代表Big5；在日文Windows操作系統(tǒng)中，ANSI 編碼代表 Shift_JIS 編碼。

3.C/C++庫函數(shù)字符編碼

C/C++庫函數(shù)因?yàn)槭褂昧薟indows API來實(shí)現(xiàn)對(duì)操作系統(tǒng)資源的使用，所以使用字符串字符集的標(biāo)準(zhǔn)和Windows操作系統(tǒng)的要求是一致的，比如fopen使用了OpenFile，此時(shí)fopen必須傳入GBK編碼的字符串才可以正常打開文件。

4.字符轉(zhuǎn)碼

GBK，UTF8，UNICODE互相轉(zhuǎn)碼是很常見的，此處使用ICU，代碼如下：

#ifndef STRING_UTIL_H_
#define STRING_UTIL_H_
#include <string>

std::string unicode2gbk(const std::wstring& ws);
std::wstring gbk2unicode(const std::string& str);

std::wstring utf82unicode(const std::string& str);
std::string unicode2utf8(const std::wstring& ws);

std::string utf82gbk(const std::string& str);
std::string gbk2utf8(const std::string& str);
#endif

#include "string_util.h"
#include <assert.h>
#include <memory.h>
#include <unicode/ucnv.h>
#include <unicode/ustring.h>

#define  BUFFER_SIZE 8192

#ifdef WIN32_MSVC
#ifdef _DEBUG
#pragma comment(lib, "icuucd.lib")
#pragma comment(lib, "icudtd.lib")
#else
#pragma comment(lib, "icuuc.lib")
#pragma comment(lib, "icudt.lib")
#endif
#endif

std::string conv(const std::wstring& ws, const char *converterName)
{
    UErrorCode status = U_ZERO_ERROR;
#ifdef _MSC_VER
    const UChar* source = (const UChar*)ws.c_str();
    int32_t srcLength = ws.length();
#else
    UChar source[BUFFER_SIZE];
    int32_t srcLength = 0;
    u_strFromUTF32(source, BUFFER_SIZE, &srcLength, (const UChar32*)ws.c_str(), ws.length(), &status);
    if (U_FAILURE(status))
    {
        return "";
    }
#endif

    UConverter* converter = ucnv_open(converterName, &status);
    if (U_FAILURE(status))
    {
        return "";
    }

    char buffer[BUFFER_SIZE];
    ucnv_fromUChars(converter, buffer, BUFFER_SIZE, source, srcLength, &status);
    if (U_FAILURE(status))
    {
        ucnv_close(converter);
        return "";
    }
    ucnv_close(converter);


    return buffer;
}

std::wstring conv(const std::string& str, const char *converterName)
{
    UErrorCode status = U_ZERO_ERROR;
    UConverter* converter = ucnv_open(converterName, &status);
    if (U_FAILURE(status))
    {
        return L"";
    }

    UChar dest[BUFFER_SIZE];
    ucnv_toUChars(converter, dest, BUFFER_SIZE, str.c_str(), str.length(), &status);
    if (U_FAILURE(status))
    {
        ucnv_close(converter);
        return L"";
    }
    ucnv_close(converter);


#ifdef _MSC_VER
    return (wchar_t*)dest;
#else
    wchar_t dest32[BUFFER_SIZE];
    int32_t pDestLength = 0;
    u_strToUTF32((UChar32*)dest32, BUFFER_SIZE, &pDestLength, dest, destCapacity, &status);
    if (U_FAILURE(status))
    {
        return L"";
    }

    return dest32;
#endif
}

std::string unicode2gbk(const std::wstring& ws)
{
    return conv(ws, "gb18030");
}

std::wstring gbk2unicode(const std::string& str)
{
    return conv(str, "gb18030");
}

std::wstring utf82unicode(const std::string& str)
{
    return conv(str, "utf-8");
}

std::string unicode2utf8(const std::wstring& ws)
{
    return conv(ws, "utf-8");
}

std::string utf82gbk(const std::string& str)
{
    return unicode2gbk(utf82unicode(str));
}

std::string gbk2utf8(const std::string& str)
{
    return unicode2utf8(gbk2unicode(str));
}

5.沒有亂碼

再做以下測(cè)試之前，此處的文本文件字符集代表的是，編譯器讀取該代碼文件時(shí)，對(duì)代碼中的字符串明文使用的字符集。
一般來講，用Visual Stdio新建一個(gè)工程，然后在代碼里面加入以下代碼，是可以正常運(yùn)行的。

#include <iostream>
using namespace std;

int main()
{
    cout << "你好 世界!" << endl;
    return 0;
}

用Notepad3打開源文件，看到源文件使用的字符集為ANSI(CP-936)，即GBK：

圖片.png

因?yàn)槲谋疚募荊BK,字符串是GBK，所以"你好世界!"被解釋為了GBK，所以才一切正常。

6.文本文件字符集導(dǎo)致的亂碼

在Notepad3中，選擇文件->編碼->設(shè)置文檔為->UTF-8后保存，此時(shí)，文本文件字符集變成了UTF8。

圖片.png

再運(yùn)行一下，發(fā)現(xiàn)，輸出的是亂碼了。

圖片.png

因?yàn)槲谋疚募荱TF8,字符串是GBK，所以"你好世界!"被解釋為了UTF8，所以輸出是亂碼了。
解決方法，將字符串轉(zhuǎn)為GBK即可

#include <iostream>
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk("你好 世界!") << endl;
    return 0;
}

7.文本修飾符導(dǎo)致的亂碼

我們先恢復(fù)文件的文本格式，文件->編碼->設(shè)置文檔為->ANSI，然后修改一下源碼，在字符串前加入u8，運(yùn)行，發(fā)現(xiàn)結(jié)果和上段中輸出的亂碼是一樣的。

#include <iostream>
using namespace std;

int main()
{
    cout << u8"你好 世界!" << endl;
    return 0;
}

因?yàn)槲谋疚募荊BK,字符串是UTF8，所以"你好世界!"被解釋為了UTF8，所以輸出也是亂碼。
解決方法，將字符串轉(zhuǎn)為GBK即可

#include <iostream>
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk(u8"你好 世界!") << endl;
    return 0;
}

該處問題其實(shí)很常見，比如通過網(wǎng)絡(luò)發(fā)送過來的字符串，提取其中的一部分?jǐn)?shù)據(jù)后，用std::string保存的字符串大概率使用的是UTF8編碼，這也直接導(dǎo)致一些底層使用fopen這些標(biāo)準(zhǔn)庫的函數(shù)無法正常處理字符串，此處只需要轉(zhuǎn)換為GBK就可以正常使用了。

8.文本文件字符集和文本修飾符同時(shí)使用UTF8

這里，我們把文件格式轉(zhuǎn)換為UTF8，并運(yùn)行以下代碼：

#include <iostream>
using namespace std;

int main()
{
    cout << u8"你好 世界!" << endl;
    return 0;
}

發(fā)現(xiàn)，亂碼和前兩段中的例子輸出不一樣。

圖片.png

這里其實(shí)設(shè)計(jì)到的是雙重轉(zhuǎn)碼的問題，即字符串被字符串修飾符u8轉(zhuǎn)碼一次，再被文本文件轉(zhuǎn)碼一次。所以此處修正方法如下:

#include <iostream>
#include "string_util.h"

using namespace std;

int main()
{
    cout << utf82gbk(utf82gbk(u8"你好 世界!")) << endl;
    return 0;
}

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

C++字符串，字符集，字符轉(zhuǎn)換和各種亂碼原因

C++字符串，字符集，字符轉(zhuǎn)換和各種亂碼原因

1.C++支持的編碼

2.Windows操作系統(tǒng)編碼

3.C/C++庫函數(shù)字符編碼

4.字符轉(zhuǎn)碼

5.沒有亂碼

6.文本文件字符集導(dǎo)致的亂碼

7.文本修飾符導(dǎo)致的亂碼

8.文本文件字符集和文本修飾符同時(shí)使用UTF8

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

C++字符串，字符集，字符轉(zhuǎn)換和各種亂碼原因

1.C++支持的編碼

2.Windows操作系統(tǒng)編碼

3.C/C++庫函數(shù)字符編碼

4.字符轉(zhuǎn)碼

5.沒有亂碼

6.文本文件字符集導(dǎo)致的亂碼

7.文本修飾符導(dǎo)致的亂碼

8.文本文件字符集和文本修飾符同時(shí)使用UTF8

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

C++字符串，字符集，字符轉(zhuǎn)換和各種亂碼原因