PostgreSQL Page頁結(jié)構(gòu)解析(6)- B-Tree索引存儲結(jié)構(gòu)#2

本文簡單介紹了在PG數(shù)據(jù)庫B-Tree索引的物理存儲結(jié)構(gòu)中Special space部分,包括根節(jié)點、左右兄弟節(jié)點等相關(guān)索引結(jié)構(gòu)信息,以及初步探討了PG在物理存儲上如何保證索引的有序性。

一、測試數(shù)據(jù)

測試數(shù)據(jù)同上一節(jié),索引文件raw data:

[xdb@localhost utf8db]$ hexdump -C base/16477/26637
00000000  01 00 00 00 20 5d 0e db  00 00 00 00 40 00 f0 1f  |.... ]......@...|
00000010  f0 1f 04 20 00 00 00 00  62 31 05 00 03 00 00 00  |... ....b1......|
00000020  01 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 f0 bf  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001ff0  00 00 00 00 00 00 00 00  00 00 00 00 08 00 00 00  |................|
00002000  01 00 00 00 98 5c 0e db  00 00 00 00 28 00 b0 1f  |.....\......(...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  b0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

二、索引結(jié)構(gòu)信息

索引物理存儲結(jié)構(gòu)在上一節(jié)已大體介紹,這里主要介紹索引的結(jié)構(gòu)信息。通過pageinspect插件的bt_page_stats函數(shù)可以獲得索引結(jié)構(gòu)信息,包括root/leaf page,next & previous page:

testdb=# select * from bt_page_stats('pk_t_index',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3
(1 row)

相關(guān)的數(shù)據(jù)結(jié)構(gòu)如下:

//---------------------- src/include/access/nbtree.h
/*
 *  BTPageOpaqueData -- At the end of every page, we store a pointer
 *  to both siblings in the tree.  This is used to do forward/backward
 *  index scans.  The next-page link is also critical for recovery when
 *  a search has navigated to the wrong page due to concurrent page splits
 *  or deletions; see src/backend/access/nbtree/README for more info.
 *
 *  In addition, we store the page's btree level (counting upwards from
 *  zero at a leaf page) as well as some flag bits indicating the page type
 *  and status.  If the page is deleted, we replace the level with the
 *  next-transaction-ID value indicating when it is safe to reclaim the page.
 *
 *  We also store a "vacuum cycle ID".  When a page is split while VACUUM is
 *  processing the index, a nonzero value associated with the VACUUM run is
 *  stored into both halves of the split page.  (If VACUUM is not running,
 *  both pages receive zero cycleids.)  This allows VACUUM to detect whether
 *  a page was split since it started, with a small probability of false match
 *  if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
 *  ago.  Also, during a split, the BTP_SPLIT_END flag is cleared in the left
 *  (original) page, and set in the right page, but only if the next page
 *  to its right has a different cycleid.
 *
 *  NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
 *  instead.
 */

typedef struct BTPageOpaqueData
{
    BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
    BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
    union
    {
        uint32      level;      /* tree level --- zero for leaf pages */
        TransactionId xact;     /* next transaction ID, if deleted */
    }           btpo;
    uint16      btpo_flags;     /* flag bits, see below */
    BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
} BTPageOpaqueData;

typedef BTPageOpaqueData *BTPageOpaque;

/* Bits defined in btpo_flags */
#define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
#define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
#define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
#define BTP_META        (1 << 3)    /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

查詢結(jié)果中,type=l(字母l),表示這個block(page)是leaf block,在這個block中有4個item(live_items=4),沒有廢棄的items(dead_items=0),沒有l(wèi)eft sibling(btpo_prev =0)也沒有right sibling(btpo_next =0),也就是左右兩邊都沒有同級節(jié)點。btpo是一個union,值為0,表示該page為葉子page,btpo_flags值為3即BTP_LEAF | BTP_ROOT,既是葉子page也是根page。
這些信息物理存儲在先前介紹過的PageHeader中的special space中,共占用16個字節(jié):

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# select 8192+8176;
 ?column? 
----------
    16368
(1 row)

[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 16368 -n 16
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

三、索引有序性

我們都知道,B-Tree索引是有序的,下面我們看看在物理存儲結(jié)構(gòu)上如何保證有序性。
插入數(shù)據(jù),id=18

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# -- 插入數(shù)據(jù),id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6498 |        0 |     0 |    44 |  8096 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000  01 00 00 00 98 64 0e db  00 00 00 00 2c 00 a0 1f  |.....d......,...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  a0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

插入數(shù)據(jù),id=17

testdb=# -- 插入數(shù)據(jù),id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 查看索引數(shù)據(jù)頁頭數(shù)據(jù)
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6808 |        0 |     0 |    48 |  8080 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump  -C base/16477/26637 -s 8192
00002000  01 00 00 00 08 68 0e db  00 00 00 00 30 00 90 1f  |.....h......0...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  90 9f 20 00 a0 9f 20 00  |.. ... ... ... .|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003f90  00 00 00 00 06 00 10 00  11 00 00 00 00 00 00 00  |................|
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

索引的數(shù)據(jù)區(qū),并沒有按照大小順序排序,\x11(17)在\x12(18)的后面(從尾部開始往前),但在索引頁的頭部ItemId區(qū)域是有序的,第5個ItemId(\x00209f90)指向的是17,而第6個ItemId(\x00209fa0)指向的是18,通過索引數(shù)據(jù)指針的有序保證索引有序性。

四、小結(jié)

小結(jié)一下,知識要點:
1、Special Space:介紹了索引PageHeaderData的Special Space的存儲結(jié)構(gòu),通過Special Space可以獲得B-Tree的root、左右sibling等信息;
2、有序性:索引Page通過索引項指針保證有序性。

最后:

Don’t Be Afraid To Explore Beneath The Surface

Captain Nemo at the helm

Professor Aronnax risked his life and career to find the elusive Nautilus and to join Captain Nemo on a long series of amazing underwater adventures. We should do the same: Don’t be afraid to dive underwater – inside and underneath the tools, languages and technologies that you use every day. You may know all about how to use Postgres, but do you really know how Postgres itself works internally? Take a look inside; before you know it, you’ll be on an underwater adventure of your own.
Studying the Computer Science at work behind the scenes of our applications isn’t just a matter of having fun, it’s part of being a good developer. As software development tools improve year after year, and as building web sites and mobile apps becomes easier and easier, we shouldn’t loose sight of the Computer Science we depend on. We’re all standing on the shoulders of giants – people like Lehman and Yao, and the open source developers who used their theories to build Postgres. Don’t take the tools you use everyday for granted – take a look inside them! You’ll become a wiser developer and you’ll find insights and knowledge you could never have imagined before.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,871評論 0 10
  • 1.前言 對于非常龐大的數(shù)據(jù)量,經(jīng)常需要分片分區(qū)在分片分區(qū)的模式下:每一個數(shù)據(jù)(一行記錄)只屬于一個分區(qū),每個分區(qū)...
    赤子心_d709閱讀 1,072評論 0 4
  • 昨天的天下足球?qū)n}之《德羅巴,藍橋遺夢》,突然想起魔獸已經(jīng)離開藍軍兩年了,兩年了,魔獸始終沒淡出我們的視野,不管是...
    lampard_xu閱讀 1,454評論 7 8
  • 哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈
    召一閱讀 214評論 0 0
  • 選擇排序的思想是有 n 個數(shù),先假定第一個數(shù)最小,然后從剩下 n-1 個數(shù)中選擇一個最小的與第一個數(shù)交換(如果當(dāng)前...
    lesliefang閱讀 192評論 0 0

友情鏈接更多精彩內(nèi)容