通過一個
emoji表情引發(fā)的報錯來認(rèn)識下MySQL中獨一無二的utf8mb4
1.發(fā)現(xiàn)問題
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
at com.example.demo.tst.Three.main(Three.java:51)
碰到這種四個字節(jié)(\xF0\x9F\x98\xAB)的報錯,說明字符集有問題。
2. 分析問題
第一篇文章中,我們就說過mysql中的字符集轉(zhuǎn)換是怎么玩的,那就一個一個來看下哪里的問題。
表結(jié)構(gòu):
mysql> show create table account;
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table
|
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| account | CREATE TABLE `account` (
`id` int(11) NOT NULL AUTO_INCREMENT COMMENT '自增id',
`name` varchar(100) CHARACTER SET utf8mb4 DEFAULT NULL,
`balance` int(11) DEFAULT NULL COMMENT '余額',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
- 先看庫中字段的字符集
mysql> show full columns from account;
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | 自增id |
| name | varchar(100) | utf8_general_ci | YES | | NULL | | select,insert,update,references | |
| balance | int(11) | NULL | YES | | NULL | | select,insert,update,references | 余額 |
+---------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+----------+
3 rows in set (0.01 sec)
上面報錯的字段是name,這里其比較規(guī)則為utf8_general_ci,我們知道這是utf8 的比較規(guī)則,MySQL中的utf8字符集實際是utf8mb3,最大只能放三個字節(jié)。
那就改唄,四個字節(jié)的utf8 -- utf8mb4
MySQL版本5.1.13以后開始支持utf8mb4
3.解決問題
- 檢查列字符集:
mysql> alter table account modify name varchar(100) character set utf8mb4 collate utf8mb4_general_ci;
Query OK, 1 row affected (0.76 sec)
Records: 1 Duplicates: 0 Warnings: 0
- 檢查
MySQL服務(wù)端那三個字符集
mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)
啥?居然都是utf8,沒的說,改:
mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
再查看下:
mysql> show variables like 'character_set_%';
+--------------------------+---------------------------------------+
| Variable_name | Value |
+--------------------------+---------------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | E:\MySQL\mysql-5.7.26\share\charsets\ |
+--------------------------+---------------------------------------+
8 rows in set, 1 warning (0.00 sec)
嗯,改回來了~
- 檢查客戶端發(fā)送字符集(這步有點坑~)
上面兩步都能理解,這一步可能有點不明白為啥還要再來一個字符集?
其實,一個字符從出生(發(fā)送)到安全落庫(落盤),經(jīng)歷的字符集可能有點多,但其第一個字符集就是這個所謂的客戶端發(fā)送字符集。
雖然要經(jīng)過這么多的字符集轉(zhuǎn)換,但并不會有那么多的轉(zhuǎn)換,因為實際情況下會將其中幾個字符集都設(shè)為一致~
我們是通過Java代碼連接MySQL,這里可以在Connector/J中指定(省略其他配置):
jdbc:mysql://localhost:3306/asia?characterEncoding=utf8mb4&useUnicode=true
不過,你會發(fā)現(xiàn)這樣啟動會報錯告訴你不支持這么寫。。
這里先貼下官方說明:
Setting the Character Encoding
The character encoding between client and server is automatically detected upon connection (provided that the Connector/J connection properties
characterEncodingandconnectionCollationare not set). You specify the encoding on the server using the system variablecharacter_set_server(for more information, see Server Character Set and Collation). The driver automatically uses the encoding specified by the server. For example, to use the 4-byte UTF-8 character set with Connector/J, configure the MySQL server withcharacter_set_server=utf8mb4, and leavecharacterEncodingandconnectionCollationout of the Connector/J connection string. Connector/J will then autodetect the UTF-8 setting.
MySQL告訴我們?nèi)缛粢褂?個字節(jié)的utf8,那你需要做兩件事:
-
MySQL服務(wù)端配置character_set_server=utf8mb4 - 不要設(shè)置
characterEncodingandconnectionCollation屬性
網(wǎng)上還有一種說法:
Connector/J did not support utf8mb4 for servers 5.5.2 and newer.
Connector/J now auto-detects servers configured with character_set_server=utf8mb4 or treats the Java encoding utf-8 passed usingcharacterEncoding=... as utf8mb4 in the SET NAMES= calls it makes when establishing the connection. (Bug #54175)
這個意思是Connector/J并不支持utf8mb4,不過有兩種方式等價:
-
MySQL服務(wù)器中配置加上character_set_server=utf8mb4 -
Connector/J中加上characterEncoding=utf8
本文選擇相信官方文檔,那個Bug #54175貼出來也是求同存異, 也可能是我理解錯了[狗頭]
我這里就直接改庫中字符集了:
mysql> set character_set_server=utf8mb4;
Query OK, 0 rows affected (0.00 sec)
嗯,上面吹了這么多牛逼,結(jié)果發(fā)現(xiàn)還是不行:
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2487)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1197)
at com.example.demo.tst.Three.main(Three.java:51)
原因很簡單,需要重啟,不過重啟讀的是配置文件中的配置,本次會話手動設(shè)置的字符集將全部失效。
所以建議寫在MySQL配置文件中 my.ini(windows系統(tǒng))
綜上的配置如下(省略其他配置):
[client]
# 服務(wù)端的三個字符集,效果等同 set NAMES utf8mb4
default-character-set=utf8mb4
[mysqld]
# 指定服務(wù)器系統(tǒng)字符集
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
這里的比較規(guī)則是
ci(默認(rèn)是這種比較規(guī)則),是不區(qū)分大小寫的,如果需要可以改成utf8mb4_bin
再把服務(wù)重啟一哈:
C:\Windows\system32>net stop mysql
MySQL 服務(wù)正在停止..
MySQL 服務(wù)已成功停止。
C:\Windows\system32>
C:\Windows\system32>net start mysql
MySQL 服務(wù)正在啟動 ..
MySQL 服務(wù)已經(jīng)啟動成功。
代碼再執(zhí)行就執(zhí)行成功了,查詢看看:
mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
| 1 | ?? | 10 |
+----+------+---------+
1 row in set (0.00 sec)
這個表情真丑。。
這里還可以玩一個東西,我們把服務(wù)端的三個字符集改成utf8,再查詢看看:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql>
mysql> select * from account;
+----+------+---------+
| id | name | balance |
+----+------+---------+
| 1 | ? | 10 |
+----+------+---------+
1 row in set (0.00 sec)
還可以在這個時候來把列name的字符集改回utf8試試:
mysql> alter table account modify name varchar(100) character set utf8 collate utf8_general_ci;
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\xAB' for column 'name' at row 2
挺好玩吧~~~
番外
- 本地開發(fā)環(huán)境測試的時候發(fā)現(xiàn),當(dāng)配置文件中加了這個配置
character-set-server=utf8mb4,那么無論其他的字符集你怎么玩,連接怎么設(shè)置,你的emoji表情都能正常入庫,而生產(chǎn)環(huán)境不行。
推論:
character-set-server這個字符集的作用可能不止于指定發(fā)送字符集為utf8mb4, 甚至猜測不能指定發(fā)送的字符集
如果有大佬知道的,還望留言告知,萬謝!
- 鑒于以上情況,測試環(huán)境
characterEncoding=utf8此配置基本無用. - 對于生產(chǎn)環(huán)境不太好修改庫設(shè)置的,可以在數(shù)據(jù)源中增加
initSql,以druid配置為例:
dataSource = new DruidDataSource();
// 省略其他配置
List<String> sqlList = new ArrayList<>();
sqlList.add("SET NAMES utf8mb4");
dataSource.setConnectionInitSqls(sqlList);
每次連接都會去設(shè)置字符集為uft8mb4,生產(chǎn)環(huán)境可行。
- 最后捋一捋思路
- 庫中字段字符集為
utf8mb4 - 連接的字符集要為
utf8mb4, 就是連接過程中的那三個, 通過SET NAMES UTF8MB4指定,這一步可以走庫配置也可以走數(shù)據(jù)源設(shè)置,具體見上文 - 發(fā)送字符集要為
utf8mb4,也就是這個配置character-set-server=utf8mb4
- 庫中字段字符集為
按照這個思路,BMP中的字符保管你全部都能存。
BMP是啥? 我下篇文章就寫寫Unicode吧,等我~