分享一個(gè)之前整理的查找線程死鎖或掛起的原因;
注:服務(wù)器環(huán)境 linux ,用于C/C++編寫(xiě)的進(jìn)程,JAVA原理類(lèi)似。
常見(jiàn)由線程掛起導(dǎo)致的現(xiàn)象
程序處理速度由慢到嚴(yán)重超時(shí),最后全部超時(shí),重啟程序會(huì)循環(huán)這一現(xiàn)象,那90%是線程被掛起了。
常見(jiàn)的線程掛起或死鎖有
線程鎖里面出現(xiàn)死循環(huán),鎖不能被釋放,導(dǎo)致其它線程一直等待;
鎖里加鎖,即雙重鎖;
多線程編程里,共享資源沒(méi)有加線程鎖,造成多線程共同強(qiáng)奪資源而掛起。
判斷進(jìn)程是否掛起
使用pstree命令查看某進(jìn)程的線程數(shù):pstree -p |grep [進(jìn)程名]。
例如下:
yuejctest:[/yuejc]pstree -p |grep Ywdeal
|-Ywdeal(8969)-+-{Ywdeal}(9013) [線程1]
| `-{Ywdeal}(9016) [線程2]
如果在次執(zhí)行此函數(shù),發(fā)現(xiàn)線程數(shù)一直在增加(程序中有限制,達(dá)到限制時(shí)不在增加也不減)說(shuō)明線程無(wú)法釋放,可能被掛起。
什么是pstack
此命令可顯示每個(gè)進(jìn)程的棧跟蹤,使用 pstack 來(lái)確定進(jìn)程掛起的位置。此命令的唯一選項(xiàng)是‘要檢查進(jìn)程的 PID’。
pstack pid,你會(huì)得到很多信息:
例如下:
yuejctest:[/yuejc]pstack 8969
Thread 3 (Thread 0x42e88940 (LWP 9013)):
#0 0x00000039e329a0b1 in nanosleep () from /lib64/libc.so.6
#1 0x00000039e3299f99 in sleep () from /lib64/libc.so.6
#2 0x0000000000406dc9 in pthread_mdb_keepconnect ()
#3 0x00000039e3e064a7 in start_thread () from /lib64/libpthread.so.0
#4 0x00000039e32d3c2d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x44e89940 (LWP 9016)):
#0 0x00000039e329a0b1 in nanosleep () from /lib64/libc.so.6
#1 0x00000039e3299f99 in sleep () from /lib64/libc.so.6
#2 0x0000000000406b49 in pthread_db_keepconnect ()
#3 0x00000039e3e064a7 in start_thread () from /lib64/libpthread.so.0
#4 0x00000039e32d3c2d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b2518adcdd0 (LWP 8969)):
#0 0x00000039e32d4f52 in msgrcv () from /lib64/libc.so.6
#1 0x000000000045eab4 in msgRcv ()
#2 0x00000000004056d0 in main ()
現(xiàn)實(shí)中遇到的問(wèn)題,幫你如何從pstack信息中找到掛起原因:
yuejcapp2:[/yuejc]pstack 23677
Thread 12 (Thread 0x43d8b940 (LWP 23686)):
#0 0x00000032ec00d91b in read () from /lib64/libpthread.so.0
#1 0x00000000004a3735 in _NetReadSocket ()
#2 0x00000000004a3d1e in _dci_recv_msg ()
#3 0x0000000000493ca9 in _dci_query_buf ()
#4 0x0000000000494365 in _dci_send_query ()
#5 0x0000000000494daf in si_dci_query_p ()
#6 0x000000000049388a in dci_query_p ()
#7 0x000000000041df0d in mdb_stream::excuteSql() ()
#8 0x000000000041f283 in mdb_stream::open(mdb_connect&, char const*, int, int) ()
#9 0x0000000000406bfa in pthread_mdb_keepconnect(void*) ()
#10 0x00000032ec00673d in start_thread () from /lib64/libpthread.so.0
#11 0x00000032eb4d3d1d in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x45d8c940 (LWP 23687)):
#0 0x00000032eb49a1a1 in nanosleep () from /lib64/libc.so.6
#1 0x00000032eb49a089 in sleep () from /lib64/libc.so.6
#2 0x0000000000406b3d in pthread_db_keepconnect(void*) ()
#3 0x00000032ec00673d in start_thread () from /lib64/libpthread.so.0
#4 0x00000032eb4d3d1d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x47d8d940 (LWP 19145)):
#0 0x00000032ec00d91b in read () from /lib64/libpthread.so.0
#1 0x00000000004a3735 in _NetReadSocket ()
#2 0x00000000004a3d1e in _dci_recv_msg ()
#3 0x0000000000493ca9 in _dci_query_buf ()
#4 0x0000000000494365 in _dci_send_query ()
#5 0x0000000000494daf in si_dci_query_p ()
#6 0x000000000049388a in dci_query_p ()
#7 0x000000000041df0d in mdb_stream::excuteSql() ()
#8 0x000000000048ee6b in mdb_stream::operator<<(int const&) ()
#9 0x000000000046c345 in mdb_select_userinfo_W ()
#10 0x0000000000426d2c in GetUserInfo(char*, _USER_INFO*) ()
#11 0x0000000000443005 in UserAbilityDeal(int&) ()
#12 0x000000000044f48f in ServiceOPenNewAdd(_SERVICE_OPEN_REQ&) ()
#13 0x0000000000408160 in pthread_service_open(void*) ()
#14 0x00000032ec00673d in start_thread () from /lib64/libpthread.so.0
#15 0x00000032eb4d3d1d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x49d8e940 (LWP 19375)):
#0 0x00000032ec00d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032ec008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032ec008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004304e9 in GetOrderInfo(char*, int, _ORDER_QUERY_INFO*) ()
#4 0x0000000000441007 in GetServiceOpenMoreInfo(_SERVICE_OPEN_REQ&) ()
#5 0x0000000000407ce6 in pthread_service_open(void*) ()
#6 0x00000032ec00673d in start_thread () from /lib64/libpthread.so.0
#7 0x00000032eb4d3d1d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x4dd90940 (LWP 19717)):
#0 0x00000032ec00d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000032ec008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x00000032ec008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004463ed in ServiceOpenChangePayType(_SERVICE_OPEN_REQ&) ()
#4 0x0000000000408444 in pthread_service_open(void*) ()
#5 0x00000032ec00673d in start_thread () from /lib64/libpthread.so.0
#6 0x00000032eb4d3d1d in clone () from /lib64/libc.so.6
yuejcapp2:[/yuejc/log]
對(duì)以上線程信息的分析,#0表示最底層的那個(gè)函數(shù)正在處理:
<1>.線程 Thread 12 正在read ()資源,
線程 Thread 11 在nanosleep ()暫停某個(gè)線程,
線程 Thread 10 正在read ()資源,
線程 Thread 9 在__lll_lock_wait ()對(duì)資源加鎖等待,
線程 Thread 8 在__lll_lock_wait ()對(duì)資源加鎖等待,
<2>.根據(jù)對(duì)以上線程的分析結(jié)果,檢查T(mén)hread 11 是守護(hù)進(jìn)程,人為正常暫停,且此線程鎖正常。
而Thread 9和Thread 8等待加鎖鎖定資源,是正常的等待。是什么原因讓這兩個(gè)線程一直等待呢?
在看一下Thread 12和Thread 10兩個(gè)線程同時(shí)在read資源,造成了資源強(qiáng)奪現(xiàn)象而被掛起。
<3>.根據(jù)以上分析,檢查T(mén)hread 12和Thread 10信息中的pthread_mdb_keepconnect()函數(shù)中的mdb_stream::open()函數(shù)和mdb_select_userinfo_W()函數(shù)。
發(fā)現(xiàn)線程Thread 12提示的mdb_stream::open()函數(shù),在代碼中沒(méi)有加線程鎖,增加線程鎖后,程序運(yùn)行正常,掛起現(xiàn)象解決。