今天增加了一個(gè)并發(fā)的測(cè)試用例,用于驗(yàn)證新增的Cony On Write 在并發(fā)場(chǎng)景下的正確性,結(jié)果 go test -v 執(zhí)行之后,測(cè)試用例直接崩潰,然后黑漆漆的終端上出現(xiàn)了如下報(bào)錯(cuò):
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x9e8b6c]
從內(nèi)容上來看,關(guān)鍵的信息是 segmentation violation,也叫作段違規(guī)
那么什么是 segmentation violation 以及為什么會(huì)出現(xiàn) segmentation violation 呢?經(jīng)過一番搜索后,終于找到了我認(rèn)為對(duì) segmentation violation 解釋比較貼切的一篇文章,以下是部分引用:
A "segmentation violation" signal is sent to a process of which the memory management unit detected an attempt to use a memory address that does not belong to it.
現(xiàn)代硬件設(shè)備都會(huì)包含一個(gè) memory management unit(MMU) 的硬件來保護(hù)內(nèi)存訪問,以防止不同的進(jìn)程修改彼此的內(nèi)存。MMU檢查到一個(gè)進(jìn)程試圖訪問不屬于自己的內(nèi)存時(shí)(無效的內(nèi)存引用),就會(huì)發(fā)送一個(gè)SIGSEGV 的signal,進(jìn)程就會(huì)出現(xiàn)segmentation violation 錯(cuò)誤。
看到這里,了解協(xié)程實(shí)現(xiàn)的同學(xué)可能會(huì)問:為什么Go編寫的測(cè)試用例會(huì)出現(xiàn)這個(gè)錯(cuò)誤呢?因?yàn)镚o是一門包含GC的語言,runtime管理內(nèi)存的分配和回收,哪怕是并發(fā)調(diào)用的,在指針訪問安全的情況下,最多也就會(huì)出現(xiàn)競(jìng)態(tài)條件,而不是內(nèi)存訪問錯(cuò)誤???
是的,正常來說確實(shí)如此,不過在真正分析問題前,先交代一下問題的背景,讓你有一個(gè)直觀的了解。
背景
下面是之前非并發(fā)的測(cè)試用例(該用例是正確的):
func TestDispatch_V710(t *testing.T) {
gen := datacenter.Gen{
RealOrderCount: 60,
RelayOrderCount: 20,
ShortAppointOrderCount: 15,
LongAppointOrderCount: 5,
}
// 生成數(shù)據(jù)
gen.Do(nil, nil)
// 初始化策略引擎
if err := strategy.Init("../../conf/strategy_engine_conf.yaml"); err != nil {
t.Error(err)
os.Exit(1)
}
// 模擬計(jì)算策略
dataCenter := gen.GetDataCenter()
utils.NewSimulationStrategy(dataCenter, nil, strategy.GetStrategyTree()).Do()
// 算法引擎做最優(yōu)化匹配
dispatch := optimal.Dispatch{}
dispatch.OptimalDispatch(dataCenter)
}
下面是增加并發(fā)后的測(cè)試用例:
func TestDispatch_OnApolloChanged_V710(t *testing.T) {
// 初始化策略引擎
if err := strategy.Init("../../conf/strategy_engine_conf.yaml"); err != nil {
t.Error(err)
os.Exit(1)
}
manager, err := pkgUtils.NewApolloManager(&pkgUtils.ApolloManagerConfig{
ConfigServerURL: "http://192.168.205.10:8080",
AppID: "strategydispatch",
Cluster: "default",
Namespaces: []string{strategy.ApolloNamespace},
BackupFile: "",
IP: "",
AccessKey: "",
})
if err != nil {
t.Error(err)
os.Exit(1)
}
// 注冊(cè)策略引擎配置事件回調(diào)
manager.RegisterHandler(strategy.ApolloNamespace, strategy.ApolloNotifyHandler, pkgUtils.ApolloErrHandler)
go manager.Run()
wg := sync.WaitGroup{}
// 測(cè)試并發(fā)執(zhí)行
for i := 0; i < 2; i++ {
wg.Add(1)
go func() {
defer wg.Done()
// 20 * 50s, 執(zhí)行計(jì)算, 并測(cè)試apollo變更
for i := 0; i < 3; i++ {
gen := datacenter.Gen{
RealOrderCount: 60,
RelayOrderCount: 20,
ShortAppointOrderCount: 15,
LongAppointOrderCount: 5,
}
// 生成數(shù)據(jù)
gen.Do(nil, nil)
// 模擬計(jì)算策略
dataCenter := gen.GetDataCenter()
utils.NewSimulationStrategy(dataCenter, nil, strategy.GetStrategyTree()).Do()
// 算法引擎做最優(yōu)化匹配
dispatch := optimal.Dispatch{}
dispatch.OptimalDispatch(dataCenter)
fmt.Println(dataCenter.DispatchResult)
time.Sleep(time.Second * 5)
}
}()
}
wg.Wait()
}
仔細(xì)觀察代碼你會(huì)發(fā)現(xiàn)變量是在goroutine內(nèi)部初始化的,也就是說都屬于goroutine stack的 local變量,唯一一個(gè)共享的變量是
strategy.GetStrategyTree(),不過這個(gè)是為了測(cè)試COW的正確性。
同時(shí)該部分的代碼存在cgo,這也是唯一有盲點(diǎn)的地方,因?yàn)?code>cgo對(duì)于使用者來說是透明的,那么可能產(chǎn)生segmentation violation 應(yīng)該只有cgo的部分了。
cgo代碼
dispatch.OptimalDispatch(dataCenter) 這行代碼包含cgo調(diào)用,OptimalDispatch 的函數(shù)如下:
func (d *Dispatch) OptimalDispatch(dataCenter *common.DataCenter) {
// ......省略部分代碼
degrade := km.Entrance(orderCarPair, dataCenter)
if degrade {
subStart := time.Now()
km.Greedy(orderCarPair, dataCenter)
// ......省略部分代碼
}
// ......省略部分代碼
}
其中km.Entrance(orderCarPair, dataCenter)會(huì)真調(diào)用C++代碼
func Entrance(Graphy map[string][]common.OrderWithCarInfo, dataCenter *common.DataCenter) (degrade bool) {
// ......省略部分代碼
// 這里會(huì)調(diào)用c++代碼
result := C.entrance((*C.double)(unsafe.Pointer(&cArray[0])), C.long(max_v_num))
// ......省略部分代碼
}
C++的接口聲明如下
long* entrance(double * input_weight, long input_max_v_num);
其中Go會(huì)向C++傳遞一個(gè)slice, C++也會(huì)返回給Go一個(gè)long array
定位
在文章開始的時(shí)候,由于計(jì)算部分用了goroutine pool, 錯(cuò)誤信息沒有全部復(fù)制,現(xiàn)在來看一下錯(cuò)誤信息中的runtime.stack部分
=== RUN TestDispatch_OnApolloChanged_V710
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1d8 pc=0xa1328c]
runtime stack:
runtime.throw(0xb9420c, 0x2a)
/usr/local/lib/go/src/runtime/panic.go:1114 +0x72
runtime.sigpanic()
/usr/local/lib/go/src/runtime/signal_unix.go:679 +0x46a
goroutine 90 [syscall]:
runtime.cgocall(0xa12360, 0xc00350ebf8, 0xf7a7d668e8941901)
/usr/local/lib/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00350ebc8 sp=0xc00350eb90 pc=0x4059eb
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0)
_cgo_gotypes.go:48 +0x4e fp=0xc00350ebf8 sp=0xc00350ebc8 pc=0x89e7be
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km.Entrance(0xc0000a2540, 0xc0035fe500, 0x1313ae0)
/mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/unit/km/km.go:108 +0xaad fp=0xc00350f4a8 sp=0xc00350ebf8 pc=0x89f2cd
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/optimal.(*Dispatch).OptimalDispatch(0xc00350ff98, 0xc0035fe500) /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/optimal/dispatch.go:32 +0x49b fp=0xc00350ff38 sp=0xc00350f4a8 pc=0x8a07eb
fabu.ai/IntelligentTransport/strategy_dispatch/tests/dispatch.TestDispatch_OnApolloChanged_V710.func1(0xc00358f530)
/mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/dispatch_v710_test.go:67 +0x93 fp=0xc00350ffd8 sp=0xc00350ff38 pc=0xa11f73
runtime.goexit()
/usr/local/lib/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00350ffe0 sp=0xc00350ffd8 pc=0x468e31
created by fabu.ai/IntelligentTransport/strategy_dispatch/tests/dispatch.TestDispatch_OnApolloChanged_V710
/mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/dispatch_v710_test.go:47 +0x2f2
goroutine 1 [chan receive]:
testing.(*T).Run(0xc0035b7c20, 0xb8c69c, 0x21, 0xba8468, 0x48c901)
/usr/local/lib/go/src/testing/testing.go:1044 +0x37e
testing.runTests.func1(0xc0035b7b00)
/usr/local/lib/go/src/testing/testing.go:1285 +0x78
testing.tRunner(0xc0035b7b00, 0xc0035f5e10)
/usr/local/lib/go/src/testing/testing.go:992 +0xdc
testing.runTests(0xc003581960, 0x12c0ec0, 0x4, 0x4, 0x0)
/usr/local/lib/go/src/testing/testing.go:1283 +0x2a7
testing.(*M).Run(0xc0035ae200, 0x0)
/usr/local/lib/go/src/testing/testing.go:1200 +0x15f
main.main()
_testmain.go:54 +0x135
錯(cuò)誤信息的runtime statck部分出現(xiàn)了cgo調(diào)用相關(guān)錯(cuò)誤,其中km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0) 是cgo編譯過程中生成的中間代碼
runtime.cgocall(0xa12360, 0xc00350ebf8, 0xf7a7d668e8941901)
/usr/local/lib/go/src/runtime/cgocall.go:133 +0x5b fp=0xc00350ebc8 sp=0xc00350eb90 pc=0x4059eb
fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance(0xc0046c2000, 0x64, 0x0)
因此可以確定是cgo部分的代碼導(dǎo)致了該問題。
在非并發(fā)下CGO調(diào)用是正常的,也就是說CGO代碼本身是正常的。
在并發(fā)下調(diào)用CGO部分出現(xiàn)了問題,有可能和Go的runtime的一些機(jī)制有關(guān)系,因此需要定位到runtime部分,也就是runtime 在做cgo調(diào)用的時(shí)候哪一步出發(fā)了segmentation violation
coredump
熟悉C/C++的同學(xué)都知道,在Linux系統(tǒng)下,如果程序出現(xiàn)了內(nèi)存相關(guān)的異常錯(cuò)誤,會(huì)產(chǎn)生coredump文件。順著這個(gè)思路,Go能否產(chǎn)生core文件呢?答案是可以的:
? ~ ulimit -c
0
? ~ ulimit -c unlimited
? ~ ulimit -c
unlimited
默認(rèn)的coredump文件大小為0,我設(shè)置為unlimited , 也可以合理的設(shè)置其大小。
之后編譯運(yùn)行程序,讓其產(chǎn)生coredump文件
? ~ GOTRACEBACK=crash ./strategy_dispatch_test
GOTRACEBACK=crash 環(huán)境變量 設(shè)置為 crash 就是允許生成coredump文件了。
不過由于我是測(cè)試用例,嘗試先設(shè)置GOTRACEBACK=crash ,然后 go test 無效,只能將測(cè)試用例的代碼轉(zhuǎn)換為可編譯的main 程序。
coredump文件分析
coredump文件運(yùn)行不會(huì)導(dǎo)致進(jìn)程崩潰,有了coredump文件,就可以加載coredump文件做更進(jìn)一步的分析了。
我通過dlv工具去加載coredump文件:
dlv core ./strategy_dispatch_test core
然后輸入stack,打印出stack trace信息
Type 'help' for list of commands.
(dlv) stack
0 0x0000000000466931 in runtime.raise
at /usr/local/lib/go/src/runtime/sys_linux_amd64.s:165
1 0x00000000004644a2 in runtime.asmcgocall
at /usr/local/lib/go/src/runtime/asm_amd64.s:640
2 0x000000000040593f in runtime.cgocall
at /usr/local/lib/go/src/runtime/cgocall.go:143
3 0x000000000087acae in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance
at _cgo_gotypes.go:48
4 0x000000000087b7bd in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km.Entrance
at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/unit/km/km.go:108
5 0x000000000087ccdb in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/optimal.(*Dispatch).OptimalDispatch
at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/pkg/service/algorithm/optimal/dispatch.go:32
6 0x00000000009e7903 in main.main.func1
at /mnt/d/Workspace/Onedrive/wordspace/code/t3go.cn/strategy_dispatch/tests/dispatch/tmp/tmp.go:68
7 0x0000000000464d91 in runtime.goexit
at /usr/local/lib/go/src/runtime/asm_amd64.s:1373
通過stack trace 信息,發(fā)現(xiàn)在3處
3 0x000000000087acae in fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance
at _cgo_gotypes.go:48
出現(xiàn)了C++中間代碼的調(diào)用
//go:cgo_unsafe_args
func _Cfunc_entrance(p0 *_Ctype_double, p1 _Ctype_long) (r1 *_Ctype_long) {
_cgo_runtime_cgocall(_cgo_743da1d4b169_Cfunc_entrance, uintptr(unsafe.Pointer(&p0)))
if _Cgo_always_false {
_Cgo_use(p0)
_Cgo_use(p1)
}
return
}
可以更加確定是CGO出了問題,繼續(xù)跟蹤stack trace信息,在2處告訴我們cgocall.go:143, 程序進(jìn)入了runtime部分,
2 0x000000000040593f in runtime.cgocall
at /usr/local/lib/go/src/runtime/cgocall.go:143
查看runtime部分對(duì)應(yīng)的代碼
// Call from Go to C.
//
// This must be nosplit because it's used for syscalls on some
// platforms. Syscalls may have untyped arguments on the stack, so
// it's not safe to grow or scan the stack.
//
//go:nosplit
func cgocall(fn, arg unsafe.Pointer) int32 {
// ... 省略一些錯(cuò)誤處理
mp := getg().m
mp.ncgocall++
mp.ncgo++
// Reset traceback.
mp.cgoCallers[0] = 0
// Announce we are entering a system call
// so that the scheduler knows to create another
// M to run goroutines while we are in the
// foreign code.
//
// The call to asmcgocall is guaranteed not to
// grow the stack and does not allocate memory,
// so it is safe to call while "in a system call", outside
// the $GOMAXPROCS accounting.
//
// fn may call back into Go code, in which case we'll exit the
// "system call", run the Go code (which may grow the stack),
// and then re-enter the "system call" reusing the PC and SP
// saved by entersyscall here.
entersyscall()
// Tell asynchronous preemption that we're entering external
// code. We do this after entersyscall because this may block
// and cause an async preemption to fail, but at this point a
// sync preemption will succeed (though this is not a matter
// of correctness).
osPreemptExtEnter(mp)
mp.incgo = true
// 這里是143行
errno := asmcgocall(fn, arg)
// ... 省略部分代碼
return errno
}
程序停在了cgocall 函數(shù)的這個(gè)位置 errno := asmcgocall(fn, arg), 這個(gè)函數(shù)是匯編實(shí)現(xiàn),并且在stack trace也給出了對(duì)應(yīng)代碼的位置提示
1 0x00000000004644a2 in runtime.asmcgocall
at /usr/local/lib/go/src/runtime/asm_amd64.s:640
查看 asm_amd64.s 這個(gè)文件,640行對(duì)應(yīng)的匯編代碼是這部分
// func asmcgocall(fn, arg unsafe.Pointer) int32
// Call fn(arg) on the scheduler stack,
// aligned appropriately for the gcc ABI.
// See cgocall.go for more details.
TEXT ·asmcgocall(SB),NOSPLIT,$0-20
MOVQ fn+0(FP), AX
MOVQ arg+8(FP), BX
MOVQ SP, DX
// Figure out if we need to switch to m->g0 stack.
// We get called to create new OS threads too, and those
// come in on the m->g0 stack already.
get_tls(CX)
MOVQ g(CX), R8
CMPQ R8, $0
JEQ nosave
MOVQ g_m(R8), R8
MOVQ m_g0(R8), SI
MOVQ g(CX), DI
CMPQ SI, DI
JEQ nosave
MOVQ m_gsignal(R8), SI
CMPQ SI, DI
JEQ nosave
// Switch to system stack.
MOVQ m_g0(R8), SI
CALL gosave<>(SB) // 程序崩潰在這里
MOVQ SI, g(CX)
MOVQ (g_sched+gobuf_sp)(SI), SP
// Now on a scheduling stack (a pthread-created stack).
// Make sure we have enough room for 4 stack-backed fast-call
// registers as per windows amd64 calling convention.
SUBQ $64, SP
ANDQ $~15, SP // alignment for gcc ABI
MOVQ DI, 48(SP) // save g
MOVQ (g_stack+stack_hi)(DI), DI
SUBQ DX, DI
MOVQ DI, 40(SP) // save depth in stack (can't just save SP, as stack might be copied during a callback)
MOVQ BX, DI // DI = first argument in AMD64 ABI
MOVQ BX, CX // CX = first argument in Win64
CALL AX
// Restore registers, g, stack pointer.
get_tls(CX)
MOVQ 48(SP), DI
MOVQ (g_stack+stack_hi)(DI), SI
SUBQ 40(SP), SI
MOVQ DI, g(CX)
MOVQ SI, SP
MOVL AX, ret+16(FP)
RET
nosave:
// Running on a system stack, perhaps even without a g.
// Having no g can happen during thread creation or thread teardown
// (see needm/dropm on Solaris, for example).
// This code is like the above sequence but without saving/restoring g
// and without worrying about the stack moving out from under us
// (because we're on a system stack, not a goroutine stack).
// The above code could be used directly if already on a system stack,
// but then the only path through this code would be a rare case on Solaris.
// Using this code for all "already on system stack" calls exercises it more,
// which should help keep it correct.
SUBQ $64, SP
ANDQ $~15, SP
MOVQ $0, 48(SP) // where above code stores g, in case someone looks during debugging
MOVQ DX, 40(SP) // save original stack pointer
MOVQ BX, DI // DI = first argument in AMD64 ABI
MOVQ BX, CX // CX = first argument in Win64
CALL AX
MOVQ 40(SP), SI // restore original stack pointer
MOVQ SI, SP
MOVL AX, ret+16(FP)
RET
640行對(duì)應(yīng)的部分是 CALL gosave<>(SB) ,不過我們先不著急分析這一行匯編代碼,我們先看 asmcgocall 這部分匯編代碼干了什么(需要一些匯編和Plan9匯編知識(shí))
asmcgocall匯編代碼分析
整個(gè)asmcgocall函數(shù)是執(zhí)行cgo調(diào)用,那么在640行(gosave)之前,函數(shù)做了什么事情呢?
TEXT ·asmcgocall(SB),NOSPLIT,$0-20
MOVQ fn+0(FP), AX
MOVQ arg+8(FP), BX
MOVQ SP, DX
get_tls(CX) // 獲取g指針
MOVQ g(CX), R8 // R8 = g
CMPQ R8, $0 // if R8 == 0, goto nosave
JEQ nosave
MOVQ g_m(R8), R8 // R8 = g.m
MOVQ m_g0(R8), SI // SI = g.m.g0
MOVQ g(CX), DI // DI = g
CMPQ SI, DI // if g == g.m.g0, goto nosave
JEQ nosave
MOVQ m_gsignal(R8), SI // SI = g.m.gsingal
CMPQ SI, DI // if g.m.gsingal == g, goto nosave
JEQ nosave
在上面的匯編代碼中,出現(xiàn)三次CMQP和JEQ指令,它們都會(huì)跳轉(zhuǎn)到 nosave ,那么 如果CMQP成立執(zhí)行了JEQ到nosave 是做什么呢?
nosave:
// Running on a system stack, perhaps even without a g.
// Having no g can happen during thread creation or thread teardown
// (see needm/dropm on Solaris, for example).
// This code is like the above sequence but without saving/restoring g
// and without worrying about the stack moving out from under us
// (because we're on a system stack, not a goroutine stack).
// The above code could be used directly if already on a system stack,
// but then the only path through this code would be a rare case on Solaris.
// Using this code for all "already on system stack" calls exercises it more,
// which should help keep it correct.
SUBQ $64, SP
ANDQ $~15, SP
MOVQ $0, 48(SP) // where above code stores g, in case someone looks during debugging
MOVQ DX, 40(SP) // save original stack pointer
MOVQ BX, DI // DI = first argument in AMD64 ABI
MOVQ BX, CX // CX = first argument in Win64
CALL AX
MOVQ 40(SP), SI // restore original stack pointer
MOVQ SI, SP
MOVL AX, ret+16(FP)
RET
nosave部分略微有些復(fù)雜,簡單來說就是當(dāng)前的cgo調(diào)用可以直接運(yùn)行在 系統(tǒng)棧,而不是協(xié)程棧
那么之前的代碼就很清晰了:
-
CMPQ R8, $0表示當(dāng)前沒有運(yùn)行的g,自然也就不存在協(xié)程棧,可以直接運(yùn)行在系統(tǒng)棧 -
CMPQ SI, DIg0指向的是系統(tǒng)棧,而如果g == g0,就表示g0運(yùn)行當(dāng)前的g的fn函數(shù),自然就可以到系統(tǒng)棧上操作 -
CMPQ SI, DI這個(gè)表示具體的是什么,還沒有弄的很清楚,不過也是滿足條件到系統(tǒng)棧上直接運(yùn)行的。
那么當(dāng)不滿足到系統(tǒng)棧上運(yùn)行時(shí),會(huì)發(fā)生什么?asmgocall后半部分告訴了我們答案
TEXT ·asmcgocall(SB),NOSPLIT,$0-20
// 省略前半部分代碼
// Switch to system stack.
MOVQ m_g0(R8), SI // SI = g.m.g0
CALL gosave<>(SB) // 程序崩潰在這里
MOVQ SI, g(CX) // g = g.m.g0
MOVQ (g_sched+gobuf_sp)(SI), SP // 保存狀態(tài)
// Now on a scheduling stack (a pthread-created stack).
// Make sure we have enough room for 4 stack-backed fast-call
// registers as per windows amd64 calling convention.
SUBQ $64, SP
ANDQ $~15, SP // alignment for gcc ABI
MOVQ DI, 48(SP) // save g
MOVQ (g_stack+stack_hi)(DI), DI
SUBQ DX, DI
MOVQ DI, 40(SP) // save depth in stack (can't just save SP, as stack might be copied during a callback)
MOVQ BX, DI // DI = first argument in AMD64 ABI
MOVQ BX, CX // CX = first argument in Win64
CALL AX
// Restore registers, g, stack pointer.
get_tls(CX)
MOVQ 48(SP), DI
MOVQ (g_stack+stack_hi)(DI), SI
SUBQ 40(SP), SI
MOVQ DI, g(CX)
MOVQ SI, SP
MOVL AX, ret+16(FP)
當(dāng)不滿足時(shí)
-
會(huì)發(fā)生棧切換,首先通過
gosave保存goroutine stack,可以看一下gosave做了什么// func gosave(buf *gobuf) // save state in Gobuf; setjmp TEXT runtime·gosave(SB), NOSPLIT, $0-8 MOVQ buf+0(FP), AX // 將 gobuf 賦值給 AX LEAQ buf+0(FP), BX // 取參數(shù)地址,也就是 caller 的 SP MOVQ BX, gobuf_sp(AX) // 保存 caller SP,再次運(yùn)行時(shí)的棧頂 MOVQ 0(SP), BX MOVQ BX, gobuf_pc(AX) // 保存 caller PC,再次運(yùn)行時(shí)的指令地址 MOVQ $0, gobuf_ret(AX) MOVQ BP, gobuf_bp(AX) // Assert ctxt is zero. See func save. MOVQ gobuf_ctxt(AX), BX TESTQ BX, BX JZ 2(PC) CALL runtime·badctxt(SB) get_tls(CX) // 獲取 tls MOVQ g(CX), BX // 將 g 的地址存入 BX MOVQ BX, gobuf_g(AX) // 保存 g 的地址 RETgosave會(huì)保存調(diào)度信息到g0.sched, 設(shè)置了 g0.sched.sp 和 g0.sched.pc 執(zhí)行goroutine stack -> system stack
執(zhí)行cgo調(diào)用(
gosave之后)
問題原因猜測(cè)
協(xié)程切換
從asmcgocall部分代碼分析中可以得出一個(gè)結(jié)論:goroutine stack 進(jìn)行了切換。
同時(shí)go官方文檔中說過
calling a C function does not block other goroutines
熟悉go runtime的同學(xué)可能知道,goroutine的實(shí)現(xiàn)依賴TLS的,如果在一個(gè)Thread上的goroutine切換,無論怎么切換,都處于一個(gè)Thread TLS內(nèi), 但如果多個(gè)Thread之間進(jìn)行切換,極有可能出現(xiàn)該問題
假如有Goroutine [G1, G2]
- G1被調(diào)度到Thread1,G1在Goroutine Stack 創(chuàng)建了變量
cArray參數(shù)傳遞給C調(diào)用 - G2被調(diào)度到Thread2,假如
cArray是全局變量,如果不涉及CGO調(diào)用,程序也就race condition,但涉及CGO調(diào)用,會(huì)出現(xiàn): Thread2 訪問 Thread1??臻g, 也就會(huì)出現(xiàn)segmentation violation錯(cuò)誤了。
但由于我們的cArray是在Goroutine局部創(chuàng)建的,因此這個(gè)問題可以排除掉。
TLS訪問越界
還有一種情況,G1和G2調(diào)度到了線程Thread1和Thtread2,G1先創(chuàng)建了CGO調(diào)用運(yùn)行所需的地址,G2在運(yùn)行時(shí)也使用了這個(gè)地址執(zhí)行CGO,但該地址在T1, G2處于Thread2。
也就是說是執(zhí)行過gosave做了棧切換,執(zhí)行到CGO調(diào)用崩潰的。
調(diào)試驗(yàn)證
為了驗(yàn)證猜測(cè),繼續(xù)使用dlv調(diào)試, 輸入grs 查看所有的goroutine,可以看到 Goroutine 71 和 Goroutine 71 的確在不同的線程上運(yùn)行了執(zhí)行km._Cfunc_entrance。
(dlv) grs
* Goroutine 71 - User: _cgo_gotypes.go:48 fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance (0x87af3e) (thread 11217)
Goroutine 72 - User: _cgo_gotypes.go:48 fabu.ai/IntelligentTransport/strategy_dispatch/pkg/service/algorithm/unit/km._Cfunc_entrance (0x87af3e) (thread 11214)
[324 goroutines]
既然這樣,如果CPU只有一個(gè)core的時(shí)候,也就是只有一個(gè)Thread的時(shí)候,是否就不會(huì)出現(xiàn)問題呢?
通過如下代碼限制Go運(yùn)行時(shí)可用的CPU Core沒有效果,CPU Core仍是多個(gè)。
println(runtime.NumCPU())
runtime.GOMAXPROCS(1)
println(runtime.NumCPU())
于是使用Docker容器(VM也一樣),限制CPU Core = 1,果然,程序是正常運(yùn)行的。
于是也就驗(yàn)證了之前的猜測(cè),可能具體的原因并非是CGO的地址訪問越界(可能是返回值或者其他,不過不需要在繼續(xù)深挖匯編和runtime了),已經(jīng)可以確定的是:多個(gè)Goroutine調(diào)度到多個(gè)Thread上執(zhí)行CGO調(diào)用,會(huì)出現(xiàn)訪問其他Thread TLS的情況,從而產(chǎn)生segmentation violation
解決
通過限制CPU Core的方式并不算真正的解決方式,想要解決該問題的關(guān)鍵在于不同的Thread上的G執(zhí)行CGO調(diào)用時(shí),不能是并發(fā)的,一種很自然的方式是 sync.Mutex
于是在Goroutine的部分增加了Lock后,即使不限制CPU仍然沒有問題
事情到此,基本上可以結(jié)束了,但我們應(yīng)該在試著問一下自己:sync.Mutex為什么能解決問題?
互斥鎖的是讓線程串行執(zhí)行,Go中也不例外,Go的Mutex中Lock處于不同的模式時(shí)會(huì)使用不同的方式互斥,感興趣的同學(xué)可以從這幾部分下手
- spin-lock 與 runtime.procyield, 會(huì)涉及到:Inter PAUSE指令流水線優(yōu)化
- sync_runtime_SemacquireMutex