背景
近日有小伙伴遇到了一個(gè)生產(chǎn)問(wèn)題,pg_rewind命令會(huì)偶發(fā)的出現(xiàn)這個(gè)錯(cuò)誤消息
... // rewind 過(guò)程信息
The program "initdb" was found by "xxx/bin/pg_rewind"
but was not the same version as pg_rewind.
Check your installation.
Failure,exiting
各種環(huán)境檢查、驗(yàn)證搞了一通之后沒(méi)什么頭緒,找我?guī)兔Γ铱戳丝创a,感覺(jué)是個(gè)bug,那就只能debug試試了。
網(wǎng)上逛了一圈發(fā)現(xiàn),大部分教人跟蹤PG源碼的帖子,都是gdb attach進(jìn)程的方式,并不適合這個(gè)場(chǎng)景,pg_rewind是個(gè)命令,一氣呵成,中途不會(huì)有機(jī)會(huì)讓你停下來(lái)去attach一把的。
研究了下gdb的help信息,發(fā)現(xiàn)--args選項(xiàng)可行。
問(wèn)題調(diào)查完,確實(shí)是個(gè)bug,但并不是社區(qū)版PG的bug,是我們定制版PG的bug。原因不重要,但這個(gè)調(diào)查方法我覺(jué)得可以總結(jié)一下。
跟蹤PG進(jìn)程的兩條路
場(chǎng)景一、跟蹤SQL進(jìn)程
SQL的執(zhí)行是在建立連接之后,因此,可以在建立連接之后,執(zhí)行SQL之前,通過(guò)gdb的方式attach進(jìn)程,附加斷點(diǎn),然后debug跟蹤,舉個(gè)栗子,開(kāi)2個(gè)窗口,一邊執(zhí)行SQL,一邊debug
- 窗口一:建連接,取pid
[guqi@localhost ~]$ psql -p 51005
psql (xxxx based on PG 11.6)
Type "help" for help.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
52069
(1 row)
- 窗口二:gdb attch pid
-- 格式:gdb postgres命令的路徑 pid
[root@localhost ~]# gdb /data/postgres/app/bin/postgres 52069
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
...
0x00007fe6060bef23 in __epoll_wait_nocancel () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64
(gdb)
場(chǎng)景二、跟蹤PG命令
PG安裝路徑的bin目錄內(nèi)有很多封裝好的二進(jìn)制命令,這些不像SQL需要單獨(dú)建連接執(zhí)行,因此跟蹤這些命令的執(zhí)行,也不能像場(chǎng)景一那樣可以事先打好斷點(diǎn)。
strace跟蹤
strace可以跟蹤命令執(zhí)行過(guò)程中的系統(tǒng)調(diào)用,并且-tt選項(xiàng)可以打印調(diào)用的時(shí)間點(diǎn),舉個(gè)栗子:
[guqi@localhost ~]$ strace -tt createdb -h 127.1 -p 51005
14:59:41.171926 execve("/data/guqi/postgres/app/bin/createdb", ["createdb", "-h", "127.1", "-p", "51005"], 0x7ffc4ad5d878 /* 31 vars */) = 0
14:59:41.172721 brk(NULL) = 0xa55000
14:59:41.172861 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc1f4f1f000
14:59:41.172982 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:59:41.173238 open("/data/guqi/postgres/app/lib/tls/x86_64/libpq.so.5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
...
14:59:41.207330 sendto(3, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5
14:59:41.207446 close(3) = 0
14:59:41.207641 exit_group(1) = ?
14:59:41.208050 +++ exited with 1 +++
[guqi@localhost ~]$
gdb跟蹤
gdb可以直接執(zhí)行一個(gè)二進(jìn)制命令
gdb [options] [executable-file [core-file or process-id]]
但是默認(rèn)情況下,這個(gè)executable-file不能帶參數(shù),否則會(huì)報(bào)錯(cuò)。gdb提供了一個(gè)--args選項(xiàng),可以傳遞參數(shù)。進(jìn)入gdb交互之后,start開(kāi)始運(yùn)行命令,gdb會(huì)在主函數(shù)的入口處自動(dòng)打個(gè)斷點(diǎn)(挺人性的),之后就和場(chǎng)景一一樣了。
[guqi@localhost ~]$ gdb --args /data/guqi/postgres/bin/pg_rewind -D /data/guqi/data/master --source-server="host=127.0.0.1 port=51101 user=guqi"
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
...
Reading symbols from /data/guqi/postgres/app/bin/pg_rewind...done.
(gdb) b 433
Breakpoint 1 at 0x40294e: file /data/guqi/src/build_alone/../xxx/src/bin/pg_rewind/pg_rewind.c, line 433.
(gdb) info b
Num Type Disp Enb Address What
1 breakpoint keep y 0x000000000040294e in main
at /data/guqi/src/build_alone/../xxx/src/bin/pg_rewind/pg_rewind.c:433
(gdb)start
Temporary breakpoint 2 at 0x4021a3: file /data/guqi/src/build_alone/../xxx/src/bin/pg_rewind/pg_rewind.c, line 125.
Starting program: /data/guqi/postgres/app/bin/pg_rewind -D /data/guqi/data/master --source-server=host=127.0.0.1\ port=51101\ user=guqi
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Temporary breakpoint 2, main (argc=4, argv=0x7fffffffe388)
at /data/guqi/src/build_alone/../xxx/src/bin/pg_rewind/pg_rewind.c:125
125 set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_rewind"));
(gdb) c
Continuing.
場(chǎng)景二的兩個(gè)方式,可以結(jié)合使用。
GDB常用的調(diào)試命令
- 在指定的文件,指定行,打斷點(diǎn)
b postgres.c:line_num
- 查看/刪除斷點(diǎn)
info b
delete 1-5(斷點(diǎn)序號(hào))
- 執(zhí)行
c:執(zhí)行程序,直到斷點(diǎn)或者結(jié)束為止
n:?jiǎn)尾綀?zhí)行
s:?jiǎn)尾綀?zhí)行,遇到函數(shù)調(diào)用,會(huì)進(jìn)入函數(shù)內(nèi)部
- 打印程序內(nèi)的變量
-- 這個(gè)比較多變
-- 變量的形式支持類型強(qiáng)轉(zhuǎn),指針引用,內(nèi)存地址等等,很強(qiáng)大
p var
- 主動(dòng)調(diào)用函數(shù)
call func_name(pam_1,pam_2)
- 跳越到程序指定行去執(zhí)行,類似goto語(yǔ)法
-- 跳躍過(guò)去之后,程序會(huì)直接開(kāi)始執(zhí)行,相當(dāng)于從第xx行開(kāi)始敲了個(gè)continue命令
jump line_num