Apache Phoenix 常用操作
基礎(chǔ)知識
1****、****Phoenix 主要技術(shù)點
a、將SQL轉(zhuǎn)化為HBase Scan,將結(jié)果封裝為JDBC Result Set。
b、表的元數(shù)據(jù)保存在HBase表(系統(tǒng)表)中。
c、使用了coprocessor 和 custom filter 保證高效,使得小規(guī)模查詢的延時在毫秒級,百萬行的查詢延時在秒級。
· coprocessors to perform operations on the server-side thus minimizing client/server data transfer
· custom filters to prune data as close to the source as possible In addition, to minimize any startup costs, Phoenix uses native HBase APIs rather than going through the map/reduce framework
2****、****JDBC****連接的****URL
jdbc:phoenix [ :<zookeeper quorum> [ :<port number> ] [ :<root node> ] ]
如:
Connection conn = DriverManager.getConnection("jdbc:phoenix:server1,server2:2181");
屬性對應(yīng)于:hbase.zookeeper.quorum, hbase.zookeeper.property.clientPort, zookeeper.znode.parent
3、Phoenix 不支持特性
a、Full Transaction, 現(xiàn)在只支持 TRANSACTION_READ_COMMITTED,不支持其他類型。
b、關(guān)系操作. Union, Intersect, Minus
c、雜項內(nèi)置函數(shù)
4****、****映射到已存在****HBase****表
使用 CREATE TABLE and CREATE VIEW
區(qū)別:
a、CREATE TABLE 啟用 KEEP_DELETED_CELLS = true, CREATE VIEW 不會
b、CREATE TABLE 能添加HBase 中不存在的列族,CREATE VIEW 不會
加鹽處理
因為HBase 數(shù)據(jù)儲存按照 row key 排序,如果HBase表的 row key 是單調(diào)遞增的,則HBase 容易有RegionServer 的局部熱點問題。加鹽可以緩解這個問題。
create table H3 (id varchar not null primary key, cf1.a varchar, cf2.b varchar) SALT_BUCKETS=20; 只能在創(chuàng)建表格時候加,創(chuàng)建后不可更改。
alter table h1 set salt_buckets=10;
Error: ERROR 1024 (42Y83): Salt bucket number may only be specified when creating a table. tableName=H1
加鹽后的注意事項:
a、sequential scan 返回的結(jié)果可能不是自然排序的,如果sequential scan使用了LIMIT語句,將與不加鹽的情況不一樣。
b、 Spit point:If no split points are specified for the table, the salted table would be pre-split on salt bytes boundaries to ensure load distribution among region servers even during the initial phase of the table. If users are to provide split points manually, users need to include a salt byte in the split points they provide.
c、Row Key 排序:Pre-spliting also ensures that all entries in the region server all starts with the same salt byte, and therefore are stored in a sorted manner. When doing a parallel scan across all region servers, we can take advantage of this properties to perform a merge sort of the client side. The resulting scan would still be return sequentially as if it is from a normal table
實際上是改寫了Row Key,添加了一個prefix
new_row_key = (++index % BUCKETS_NUMBER) + original_key
數(shù)據(jù)存儲到 Buckects_Number 個Bucket中 ,每個Bucket的Prefix 相同,在query的時候,同時在各個Bucket進(jìn)行。
提升效率常用方法
1****、加鹽:
加鹽可以將數(shù)據(jù)存入多個region里,從而提升讀寫性能。
CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) SALT_BUCKETS=42
如果有16臺region server,每臺server有4核CPU,則SALT_BUCKETS 設(shè)置為32~64之間。
即如果集群總的CPU核數(shù)為N,則SALT_BUCKETS為 0.5N ~ N 之間。
2****、****split
如果不想通過加鹽來分區(qū),可以自己手動設(shè)置分區(qū)的方法。這樣可以不引入額外的byte,或者改變row key的順序,例子
CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) SPLIT ON ('CS','EU','NA')
3****、使用多個列族
CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR)
4****、使用壓縮
CREATE TABLE TEST (HOST VARCHAR NOT NULL PRIMARY KEY, DESCRIPTION VARCHAR) COMPRESSION='GZ'
5****、使用二級索引
參考其余博文
6****、優(yōu)化集群
參考其余博文
7****、優(yōu)化****phoenix 參數(shù)
使用任意時間戳
在Property里面設(shè)置屬性 "CurrentSCN"。
ts是一個long。
Properties props = new Properties();
props.setProperty(PhoenixRuntime.CURRENT_SCN_ATTRIB, Long.toString(ts));
Connection conn = DriverManager.connect(myUrl, props);
conn.createStatement().execute("UPSERT INTO myTable VALUES ('a')");
conn.commit();
相當(dāng)于:
myTable.put(Bytes.toBytes('a'),ts);
索引基礎(chǔ)
1、immutable Index
原文:Immutable indexing targets use cases that are write once, append only; this is common in time-series data, where you log once, but read multiple times. In this case, the indexing is managed entirely on the client - either we successfully write all the primary and index data or we return a failure to the client. Since once written, rows are never updated, no incremental index maintenance is required making them perform very well. This reduces the overhead of secondary indexing at write time. However, keep in mind that immutable indexing are only applicable in a limited set of use cases.
One restriction of immutable indexes is that rows from the data table may not be deleted. Instead, the only way to delete rows is to drop the entire data table.
Immutable 索引適用于一次寫入,數(shù)據(jù)只添加不修改的情況,例如時間序列數(shù)據(jù)。因為只需要一次寫入,數(shù)據(jù)行不會更新,不需要額外的索引維護(hù),所以性能非常好。
例子:
<pre style="line-height:15.0pt;background:whitesmoke;white-space:pre-wrap;
word-wrap: break-word">CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR) IMMUTABLE_ROWS=true;</pre>
<pre style="margin-bottom:7.5pt;line-height:15.0pt;background:whitesmoke;
white-space:pre-wrap;word-wrap: break-word">CREATE INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);</pre>
2.Global Indexing (Mutable)
原文:Global indexing targets read heavy, low write uses cases. With global indexes, all the performance penalties for indexes occur at write time. We intercept the data table updates on write (DELETE, UPSERT VALUES and UPSERT SELECT), build the index update and then sent any necessary updates to all interested index tables. At read time, Phoenix will select the index table to use that will produce the fastest query time and directly scan it just like any other HBase table. Note, however, if a column is referenced in a query that isn’t part of the index, the index will not be used for that query.
全局索引適用于 高數(shù)據(jù)量讀取,低數(shù)據(jù)量寫入的情況。全局索引的消耗主要在索引寫入的時候,在對數(shù)據(jù)表進(jìn)行寫操作的時候,同時更新所有的索引表。在讀取的時候,將直接從索引表掃描讀取數(shù)據(jù),如果讀取另外一個表一樣。如果query中某個列在索引表中不存在,全局索引將不會用到。
CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR)
CREATE INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);
3.Local Indexing (Mutable)
原文: Local indexing targets *****write heavy*****, *****space constrained***** use cases. With local indexes index data and table data are co-reside at same server so no network overhead during writes and reads. Local indexes can be used even when the query isn’t fully covered i.e. Phoenix automatically retrieve the columns not in the index through point gets against the data table. Unlike global indexes all local indexes data of a table are stored in a separate shared table. At read time when the local index is used, every region must be examined for the data as the exact region location of index data cannot be predetermined which incurs some overhead.
局部索引適用于大數(shù)據(jù)量寫、空間受限的情況下。使用局部索引,索引數(shù)據(jù)和表數(shù)據(jù)將同時放在同一臺server上,所以在讀寫的時候不會有網(wǎng)絡(luò)通信的開銷。如果query中某個列在索引表中不存在,局部索引也能用到。不同于全局索引,一個表的全部局部索引的數(shù)據(jù)保存在同一個共享表中。在讀取的時候,每一個region 都必須檢查數(shù)據(jù),因為索引數(shù)據(jù)的確切的區(qū)域位置無法預(yù)先,確定這會增加一些系統(tǒng)開銷。
例子:
CREATE TABLE my_table (k VARCHAR PRIMARY KEY, v1 VARCHAR, v2 VARCHAR , v3 VARCHAR)
CREATE LOCAL INDEX my_index ON my_table (v2 DESC, v1) INCLUDE (v3);
4****、不會使用到二級索引的情況
創(chuàng)建表:create table usertable (id varchar primary key, firstname varchar, lastname varchar);
創(chuàng)建全局索引:create index idx_name on usertable (firstname);
檢索:select id, firstname, lastname from usertable where firstname = 'foo';
不會使用到索引 idx_name。若要使用到,必須這樣:
create idx_name on usertable (firstname)include(lastname);
5****、不會使用主鍵索引情況:
創(chuàng)建表:CREATE TABLE TEST (pk1 char(1) not null, pk2 char(1) not null, pk3 char(1) not null, non-pk varchar **CONSTRAINT PK PRIMARY KEY(pk1, pk2, pk3) **);
不會使用到索引的檢索:select * from test where pk2='x' and pk3='y'
會使用到索引的檢索:select * from test where pk1='x' and pk2='y'
子查詢
1、IN 和 Not In 的子查詢
SELECT ItemName
FROM Items
WHERE ItemID IN
(SELECT ItemID
FROM Orders
WHERE Date >= to_date('2013/09/02'));
2、Exists 和Not Exists的子查詢
SELECT ItemName
FROM Items i
WHERE EXISTS
(SELECT *
FROM Orders
WHERE Date >= to_date('2013/09/02')
AND ItemID = i.ItemID);
3、半連接、反連接
見Join
4、比較運算
SELECT ID, Name
FROM Contest
WHERE Score >
(SELECT avg(Score)
FROM Contest)
ORDER BY Score DESC;
5、ANY/SOME/ALL 運算
SELECT OrderID
FROM Orders
WHERE quantity >= ANY
(SELECT max(quantity)
FROM Orders
GROUP BY ItemID);
6、相關(guān)子查詢
SELECT PatentID, Title
FROM Patents p
WHERE FileDate <= ALL
(SELECT FileDate
FROM Patents
WHERE Region = p.Region);
7、多重嵌套
SELECT ItemID, ItemName
FROM Items i
WHERE NOT EXISTS
(SELECT *
FROM Orders
WHERE CustomerID IN
(SELECT CustomerID
FROM Customers
WHERE Country = ‘Belgium’)
AND Quantity < 1000
AND ItemID = i.ItemID)
OR ItemID != ALL
(SELECT ItemID
FROM Orders
WHERE CustomerID IN
(SELECT CustomerID
FROM Customers
WHERE Country = ‘Germany’)
AND Quantity < 2000);
8、衍生表
SELECT m, count(*)
FROM
(SELECT max(x) m
FROM a1
GROUP BY name) AS t
GROUP BY m
ORDER BY count(*) DESC;
多租戶
創(chuàng)建多租戶表:
CREATE TABLE base.event (tenant_id VARCHAR, event_type CHAR(1), created_date DATE, event_id BIGINT)
MULTI_TENANT=true;
連接到某租戶的數(shù)據(jù)庫表:
Properties props = new Properties();
props.setProperty("TenantId", "Acme");
Connection conn = DriverManager.getConnection("localhost", props);
在特定租戶連接的情況下,以下語句只創(chuàng)建特定租戶的視圖
CREATE VIEW acme.event AS
SELECT * FROM base.event;
CREATE VIEW acme.login_event AS
SELECT * FROM base.event
WHERE event_type='L';
Array類型
creating a table:
CREATE TABLE regions (
region_name VARCHAR PRIMARY KEY,
zips VARCHAR ARRAY[10],
CONSTRAINT pk PRIMARY KEY (region_name));
Insert
UPSERT INTO regions(region_name,zips)
VALUES('SF Bay Area',ARRAY['94115','94030','94125']);
select:
SELECT zip[1] FROM regions WHERE region_name = 'SF Bay Area';
SELECT region_name FROM regions WHERE zip[1] = '94030' OR zip[2] = '94030' OR zip[3] = '94030';
SELECT region_name FROM regions WHERE zip[1] = '94030' OR zip[2] = '94030' OR zip[3] = '94030';
分頁查詢
組合使用order by, > , LIMIT :
SELECT title, author, isbn, description
FROM library
WHERE published_date > 2010
AND (title, author, isbn) > (?, ?, ?)
ORDER BY title, author, isbn
LIMIT 20
Skip Scan
SELECT * from T
WHERE ((KEY1 >='a' AND KEY1 <= 'b') OR (KEY1 > 'c' AND KEY1 <= 'e'))
AND KEY2 IN (1, 2)
The List<List<KeyRange>> for SkipScanFilter for the above query would be [ [ [ a - b ], [ d - e ] ], [ 1, 2 ] ] where [ [ a - b ], [ d - e ] ] is the range for KEY1and [ 1, 2 ] keys for KEY2.
跟蹤Tracing
只支持Hadoop2
配置 hadoop-metrics2-phoenix.properties
Sample from all the sources every 10 seconds
*.period=10
Write Traces to Phoenix
##########################
ensure that we receive traces on the server
phoenix.sink.tracing.class=org.apache.phoenix.trace.PhoenixMetricsSink
Tell the sink where to write the metrics
phoenix.sink.tracing.writer-class=org.apache.phoenix.trace.PhoenixTableMetricsWriter
Only handle traces with a context of "tracing"
phoenix.sink.tracing.context=tracing
配置 hadoop-metrics2-hbase.properties
ensure that we receive traces on the server
hbase.sink.tracing.class=org.apache.phoenix.trace.PhoenixMetricsSink
Tell the sink where to write the metrics
hbase.sink.tracing.writer-class=org.apache.phoenix.trace.PhoenixTableMetricsWriter
Only handle traces with a context of "tracing"
hbase.sink.tracing.context=tracing
配置 hbase-site.xml
<configuration>
<property>
<name>phoenix.trace.frequency</name>
<value>always</value>
</property>
</configuration>
<property>
<name>phoenix.trace.statsTableName</name>
<value><your custom tracing table name></value>
</property>
The tracing table is initialized via the ddl:
CREATE TABLE SYSTEM.TRACING_STATS (
trace_id BIGINT NOT NULL,
parent_id BIGINT NOT NULL,
span_id BIGINT NOT NULL,
description VARCHAR,
start_time BIGINT,
end_time BIGINT,
hostname VARCHAR,
tags.count SMALLINT,
annotations.count SMALLINT,
CONSTRAINT pk PRIMARY KEY (trace_id, parent_id, span_id)
統(tǒng)計收集
統(tǒng)計收集有助于提升query性能。
命令:
UPDATE STATISTICS my_table
等效于
UPDATE STATISTICS my_table ALL
如果只收集index或者column
UPDATE STATISTICS my_table INDEX
UPDATE STATISTICS my_table COLUMNS
參數(shù)配置
phoenix.stats.guidepost.width 默認(rèn)104857600 (100 MB)
phoenix.stats.guidepost.per.region
phoenix.stats.updateFrequency 默認(rèn)900000 (15 mins)
phoenix.stats.minUpdateFrequency 默認(rèn)7.5 mins
phoenix.stats.useCurrentTime 默認(rèn)true
mutable 和 immutable 表區(qū)別
分別創(chuàng)建表:
create table** my_mutable** (id varchar not null primary key, cf1.a varchar , cf1.b varchar, cf2.c varchar, cf2.d varchar) ;
create table my_immutable (id varchar not null primary key, cf1.a varchar , cf1.b varchar, cf2.c varchar, cf2.d varchar) immutable_rows=true ;
分別創(chuàng)建索引:
create index index_my_mutable on** my_mutable**(a,c) include (b,d);
create index** index_my_immutable** on my_immutable(a,c) include (b,d);
分別插入數(shù)據(jù)
upsert into my_mutable values ('1000001','a1','b1','c1','d1');
upsert into my_mutable values ('1000001','a2','b2','c2','d2');
upsert into my_mutable values ('1000001','a3','b3','c3','d3');
upsert into my_immutable values ('1000001','a1','b1','c1','d1');
upsert into my_immutable values ('1000001','a2','b2','c2','d2');
upsert into my_immutable values ('1000001','a3','b3','c3','d3');
查看數(shù)據(jù):
select * from my_mutable ;
| ID | A | B | C | D |
| 1000001 | a3 | b3 | c3 | d3 |
select * from my_immutable ;
| ID | A | B | C | D |
| 1000001 | a1 | b1 | c1 | d1 |
| 1000001 | a2 | b2 | c2 | d2 |
| 1000001 | a3 | b3 | c3 | d3 |
select * from** index_my_mutable** ;
| CF1:A | CF2:C | :ID | CF1:B | CF2:D |
| a3 | c3 | 1000001 | b3 | d3 |
select * from** index_my_immutable** ;
| CF1:A | CF2:C | :ID | CF1:B | CF2:D |
| a1 | c1 | 1000001 | b1 | d1 |
| a2 | c2 | 1000001 | b2 | d2 |
| a3 | c3 | 1000001 | b3 | d3 |
Global 和 Local 索引。
1****、創(chuàng)建表:
create table **immutable_local **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=true;
create table **immutable_global **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=true;
create table **mutable_local **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=false;
create table **mutable_global **(id varchar not null primary key, cf1.a varchar, cf1.b varchar, cf2.c varchar, cf2.d varchar ) immutable_rows=false;
2****、創(chuàng)建索引:
create index index_mutable_global on mutable_global(a,c) include(b); 成功
create **local **index index_mutable_local on mutable_local(a,c) include(b); 成功
create index** index_immutable_global** on immutable_global(a,c) include(b); 成功
create local index index_immutable_local on immutable_local(a,c) include(b); 失敗,immutable 表無法創(chuàng)建Local . Error: ERROR 1048 (43A04): Local indexes aren't allowed on tables with immutable rows. tableName=INDEX_IMMUTABLE_LOCAL (state=43A04,code=1048)
3.****插入數(shù)據(jù)
upsert into mutable_global values ('100001','a1','b1','c1','d1');
upsert into mutable_global values ('100002','a2','b2','c2','d2');
upsert into mutable_local values ('100001','a1','b1','c1','d1');
upsert into mutable_local values ('100002','a2','b2','c2','d2');
upsert into immutable_global values ('100001','a1','b1','c1','d1');
upsert into immutable_global values ('100002','a2','b2','c2','d2');
4****、測試檢索
檢索中包含了列d, 此列不包含在索引中。
a****、****immutable 表使用的時全表掃描,沒有使用索引
explain select a,b,c,d from immutable_global where a='a1';
CLIENT PARALLEL 1-WAY **FULL SCAN **OVER IMMUTABLE_GLOBAL
SERVER FILTER BY CF1.A = 'a1'
b、mutable 表使用的時全表掃描,沒有使用****Global****索引
explain select a,b,c,d from mutable_global where a='a1';
CLIENT PARALLEL 1-WAY** FULL SCAN** OVER MUTABLE_GLOBAL
SERVER FILTER BY CF1.A = 'a1'
c、mutable 表使用了 LOCAL 索引
explain select a,b,c,d from mutable_local where a='a1';
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1'] |
CLIENT MERGE SORT
5****、****Local 索引細(xì)節(jié)
索引定義:create local index index_mutable_local on mutable_local(a,c) include(b);
索引內(nèi)容:
a****、使用部分索引:索引組合的第一個
explain select a,b,c,d from mutable_local where a='a1' ;
CLIENT PARALLEL 1-WAY **RANGE SCAN **OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1']
CLIENT MERGE SORT
b****、使用部分索引:索引組合的第二個
explain select a,b,c,d from mutable_local where c='c1' ;
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768]
**SERVER FILTER BY **C = 'c1'
CLIENT MERGE SORT
c****、使用部分索引:****include****的部分
explain select a,b,c,d from mutable_local where b='b1' ;
CLIENT PARALLEL 1-WAY** RANGE SCAN** OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768]
SERVER FILTER BY CF1.B = 'b1'
CLIENT MERGE SORT
d****、使用全部索引
explain select a,b,c,d from mutable_local where a='a1' and c='c1' ;
CLIENT PARALLEL 1-WAY **RANGE SCAN **OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
CLIENT MERGE SORT
調(diào)換a和c的位置,phoenix會自動優(yōu)化。
explain select a,b,c,d from mutable_local where c='c1' and a='a1' ;
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
CLIENT MERGE SORT
e****、使用索引中全部字段
explain select a,b,c,d from mutable_local where c='c1' and b='b1' and a='a1' ;
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a1','c1']
SERVER FILTER BY CF1.B = 'b1'
CLIENT MERGE SORT
f****、使用了不存在于索引中的字段。
explain select a,b,c,d from mutable_local where a='a1' and d='d1';
CLIENT PARALLEL 1-WAY** FULL SCAN** OVER MUTABLE_LOCAL
SERVER FILTER BY (CF1.A = 'a1' AND CF2.D = 'd1')
g. 在****Select ****中不要使用*********,*********會導(dǎo)致全表掃描
explain select **a,b,c,d **from MUTABLE_LOCAL where a='a';
CLIENT PARALLEL 1-WAY RANGE SCAN OVER _LOCAL_IDX_MUTABLE_LOCAL [-32768,'a']
CLIENT MERGE SORT
explain select * from MUTABLE_LOCAL where a='a';
** CLIENT PARALLEL 1-WAY FULL SCAN OVER MUTABLE_LOCAL
SERVER FILTER BY CF1.A = 'a' **