一、測(cè)試數(shù)據(jù)
該數(shù)據(jù)采用json格式存儲(chǔ):
id:代表當(dāng)前用戶微博的id;
ids:代表當(dāng)前微博用戶關(guān)注其他微博用戶的id列表;
total_number:關(guān)注微博用戶的總量。
{"id": 1701439105,"ids": [2154137571,3889177061,1496915057,1663973284],"total_number": 493}
{"id": 1701439106,"ids": [2154137572,3889177063,1496915058,1663973285],"total_number": 494}
二、數(shù)據(jù)存儲(chǔ)及解析
方式一:
將源系統(tǒng)json數(shù)據(jù)以字符串的數(shù)據(jù)類型寫入Hive表中,然后通過函數(shù)解析獲取json數(shù)據(jù)。
數(shù)據(jù)建表及寫入
hive (myhive)> create table myhive.tbs_json_test(
json_data string
)
stored as textfile;
hive (myhive)> load data local inpath '/install/hivedatas/json_data.json' into table myhive.tbs_json_test;
數(shù)據(jù)解析
hive (myhive)> select get_json_object(json_data,'$.ids') from myhive.tbs_json_test;
hive (myhive)> select t2.* from myhive.tbs_json_test t1 lateral view json_tuple(t1.json_data,'id','ids') t2 as c1,c2;
get_json_object:用來解析json字符串的一個(gè)字段;
json_tuple:用來解析json字符串的多個(gè)字段。
方式二:
將源系統(tǒng)json數(shù)據(jù)按照key拆成多個(gè)Hive數(shù)據(jù)字段,加載對(duì)應(yīng)的jar包后,導(dǎo)入json數(shù)據(jù)到Hive數(shù)據(jù)表對(duì)應(yīng)字段中。
數(shù)據(jù)建表及寫入
hive (myhive)> add jar /install/apache-hive-3.1.2/hcatalog/share/hcatalog/ hive-hcatalog-core-3.1.2.jar;
hive (myhive)>create table myhive.tbs_json_test2(
id string,
ids array<string>,
total_number int)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;
hive (myhive)> load data local inpath '/install/hivedatas/json_data.json' into table myhive.tbs_json_test2;
直接select 字段名稱即可查看數(shù)據(jù)表。
注意:
1、如果json數(shù)據(jù)不符合規(guī)范查詢則會(huì)報(bào)錯(cuò)。增加如下配置會(huì)挑過錯(cuò)誤數(shù)據(jù),錯(cuò)誤數(shù)據(jù)將變?yōu)镹ULL.
alter table myhive.tbs_json_test2 set serdeproperties( "ignore.malformed.json" = "true");
2、如果json數(shù)據(jù)中包含Hive關(guān)鍵字,數(shù)據(jù)寫入時(shí)有問題,可通過SerDe使用SerDe屬性將Hive列映射到新名稱的屬性下。
例如:如果ids為Hive關(guān)鍵字,則將ids改為ids_alias創(chuàng)建Hive數(shù)據(jù)表。
hive (myhive)>create table myhive.tbs_json_test3(
id string,
ids_alias array<string>,
total_number int)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties("mapping.ids_alias"="ids")
stored as textfile;