GoLang-內(nèi)存管理

一、tcmalloc介紹<參考資源>

go的內(nèi)存管理和tcmalloc(thread-caching malloc)很像,先看一下tcmalloc的實(shí)現(xiàn)。

1.1 簡(jiǎn)介

tcmalloc是google推出的一種內(nèi)存分配器,常見(jiàn)的內(nèi)存分配器還有g(shù)libc的ptmalloc和google的jemalloc。相比于ptmalloc,tcmalloc性能更好,特別適用于高并發(fā)場(chǎng)景。

1.2 tcmalloc算法策略

tcmalloc分配的內(nèi)存主要來(lái)自兩個(gè)地方:全局緩存堆和進(jìn)程的私有緩存。對(duì)于一些小容量的內(nèi)存申請(qǐng)使用進(jìn)程的私有緩存,私有緩存不足的時(shí)候可以再?gòu)娜志彺嫔暾?qǐng)一部分作為私有緩存。對(duì)于大容量的內(nèi)存申請(qǐng)則需要從全局緩存中進(jìn)行申請(qǐng)。而大小容量的邊界就是32k。緩存的組織方式是一個(gè)單鏈表數(shù)組,數(shù)組的每個(gè)元素是一個(gè)單鏈表,鏈表中的每個(gè)元素具有相同的大小。

1.2.1 Small Object Allocation

小對(duì)象內(nèi)存分配默認(rèn)會(huì)分配86個(gè)不同大小的塊,而這些塊的大小并沒(méi)有明確說(shuō)明,需要查一下源碼。每種大小的塊的數(shù)組的長(zhǎng)度都采用使用了才初始化,有點(diǎn)類似于lazy-initialize。


Small Object Allocation

1.2.2 Big Object Allocation

對(duì)于大于32k的內(nèi)存申請(qǐng),使用全局內(nèi)存來(lái)分配。全局內(nèi)存的組織也是單鏈表數(shù)組,數(shù)組長(zhǎng)度為256,分別對(duì)用1 page大小, 2 page大小(1 page=4k)。


Big Object Allocation

1.2.3 Span

tcmalloc使用span來(lái)管理內(nèi)存分頁(yè),一個(gè)span可以包含幾個(gè)連續(xù)分頁(yè)。span的狀態(tài)只有未分配、作為大對(duì)象分配、作為小對(duì)象分配。


Span

1.3 golang對(duì)比

golang的內(nèi)存分配并不是和tcmalloc一模一樣。

  • 局部緩存并不是分配給進(jìn)程或者線程,而是分配給P(這個(gè)還需要說(shuō)一下go的goroutine實(shí)現(xiàn))
  • go的GC是stop the world,并不是每個(gè)進(jìn)程單獨(dú)進(jìn)行GC。
  • span的管理更有效率

二、基本知識(shí)

2.1 golang運(yùn)行調(diào)度

在 Golang 里面有三個(gè)基本的概念:G, M, P。

  • G(Goroutine):我們所說(shuō)的協(xié)程,為用戶級(jí)的輕量級(jí)線程,每個(gè)Goroutine對(duì)象中的sched保存著其上下文信息
  • M(Machine):對(duì)內(nèi)核級(jí)線程的封裝,數(shù)量對(duì)應(yīng)真實(shí)的CPU數(shù)(真正干活的對(duì)象)
  • P(Processor):即為G和M的調(diào)度對(duì)象,用來(lái)調(diào)度G和M之間的關(guān)聯(lián)關(guān)系,其數(shù)量可通過(guò)GOMAXPROCS()來(lái)設(shè)置,默認(rèn)為核心數(shù)。
    一個(gè) Goroutine 的運(yùn)行需要 G + P + M 三部分結(jié)合起來(lái)。

2.2 逃逸分析(escape analysis)

逃逸分析請(qǐng)戳 GoLang-逃逸分析

三、關(guān)鍵數(shù)據(jù)結(jié)構(gòu)

幾個(gè)關(guān)鍵的地方:

  • mcache: per-P cache,可以認(rèn)為是 local cache。
  • mcentral: 全局 cache,mcache 不夠用的時(shí)候向 mcentral 申請(qǐng)。
  • mheap: 當(dāng) mcentral 也不夠用的時(shí)候,通過(guò) mheap 向操作系統(tǒng)申請(qǐng)。
    可以將其看成多級(jí)內(nèi)存分配器。

3.1 mcache

我們知道每個(gè) Gorontine 的運(yùn)行都是綁定到一個(gè) P 上面,mcache 是每個(gè) P 的 cache。這么做的好處是分配內(nèi)存時(shí)不需要加鎖。mcache 結(jié)構(gòu)如下。

//go version 1.10.8
//file runtime/mcache.go

// Per-thread (in Go, per-P) cache for small objects.
// No locking needed because it is per-thread (per-P).
//
// mcaches are allocated from non-GC'd memory, so any heap pointers
// must be specially handled.
//
//go:notinheap
type mcache struct {
    // The following members are accessed on every malloc,
    // so they are grouped here for better caching.
    next_sample int32   // trigger heap sample after allocating this many bytes
    local_scan  uintptr // bytes of scannable heap allocated

    // Allocator cache for tiny objects w/o pointers.
    // See "Tiny allocator" comment in malloc.go.

    // tiny points to the beginning of the current tiny block, or
    // nil if there is no current tiny block.
    //
    // tiny is a heap pointer. Since mcache is in non-GC'd memory,
    // we handle it by clearing it in releaseAll during mark
    // termination.
    tiny             uintptr
    tinyoffset       uintptr
    local_tinyallocs uintptr // number of tiny allocs not counted in other stats

    // The rest is not accessed on every malloc.

    alloc [numSpanClasses]*mspan // spans to allocate from, indexed by spanClass

    stackcache [_NumStackOrders]stackfreelist

    // Local allocator stats, flushed during GC.
    local_nlookup    uintptr                  // number of pointer lookups
    local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
    local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
    local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
}

我們可以暫時(shí)只關(guān)注 alloc [_NumSizeClasses]*mspan,這是一個(gè)大小為 67 的指針(指針指向 mspan )數(shù)組(_NumSizeClasses = 67),每個(gè)數(shù)組元素用來(lái)包含特定大小的塊。當(dāng)要分配內(nèi)存大小時(shí),為 object 在 alloc 數(shù)組中選擇合適的元素來(lái)分配。67 種塊大小為 0,8 byte, 16 byte, … ,這個(gè)和 tcmalloc 稍有區(qū)別。

//go version 1.10.8
//file runtime/sizeclasses.go

var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}

上面的 alloc 類似內(nèi)存池的 freelist 數(shù)組或者鏈表,正常實(shí)現(xiàn)每個(gè)數(shù)組元素是一個(gè)鏈表,鏈表由特定大小的塊串起來(lái)。但是這里統(tǒng)一使用了 mspan 結(jié)構(gòu),那么只有一種可能,就是 mspan 中記錄了需要分配的塊大小。我們來(lái)看一下 mspan 的結(jié)構(gòu)。

3.2 mspan

span 在 tcmalloc 中作為一種管理內(nèi)存的基本單位而存在。Golang 的 mspan 的結(jié)構(gòu)如下。

//go version 1.10.8
//file runtime/mheap.go

//go:notinheap
type mspan struct {
    next *mspan     // next span in list, or nil if none
    prev *mspan     // previous span in list, or nil if none
    list *mSpanList // For debugging. TODO: Remove.

    startAddr uintptr // address of first byte of span aka s.base()
    npages    uintptr // number of pages in span

    manualFreeList gclinkptr // list of free objects in _MSpanManual spans

    // freeindex is the slot index between 0 and nelems at which to begin scanning
    // for the next free object in this span.
    // Each allocation scans allocBits starting at freeindex until it encounters a 0
    // indicating a free object. freeindex is then adjusted so that subsequent scans begin
    // just past the newly discovered free object.
    //
    // If freeindex == nelem, this span has no free objects.
    //
    // allocBits is a bitmap of objects in this span.
    // If n >= freeindex and allocBits[n/8] & (1<<(n%8)) is 0
    // then object n is free;
    // otherwise, object n is allocated. Bits starting at nelem are
    // undefined and should never be referenced.
    //
    // Object n starts at address n*elemsize + (start << pageShift).
    freeindex uintptr
    // TODO: Look up nelems from sizeclass and remove this field if it
    // helps performance.
    nelems uintptr // number of object in the span.

    // Cache of the allocBits at freeindex. allocCache is shifted
    // such that the lowest bit corresponds to the bit freeindex.
    // allocCache holds the complement of allocBits, thus allowing
    // ctz (count trailing zero) to use it directly.
    // allocCache may contain bits beyond s.nelems; the caller must ignore
    // these.
    allocCache uint64

    // allocBits and gcmarkBits hold pointers to a span's mark and
    // allocation bits. The pointers are 8 byte aligned.
    // There are three arenas where this data is held.
    // free: Dirty arenas that are no longer accessed
    //       and can be reused.
    // next: Holds information to be used in the next GC cycle.
    // current: Information being used during this GC cycle.
    // previous: Information being used during the last GC cycle.
    // A new GC cycle starts with the call to finishsweep_m.
    // finishsweep_m moves the previous arena to the free arena,
    // the current arena to the previous arena, and
    // the next arena to the current arena.
    // The next arena is populated as the spans request
    // memory to hold gcmarkBits for the next GC cycle as well
    // as allocBits for newly allocated spans.
    //
    // The pointer arithmetic is done "by hand" instead of using
    // arrays to avoid bounds checks along critical performance
    // paths.
    // The sweep will free the old allocBits and set allocBits to the
    // gcmarkBits. The gcmarkBits are replaced with a fresh zeroed
    // out memory.
    allocBits  *gcBits
    gcmarkBits *gcBits

    // sweep generation:
    // if sweepgen == h->sweepgen - 2, the span needs sweeping
    // if sweepgen == h->sweepgen - 1, the span is currently being swept
    // if sweepgen == h->sweepgen, the span is swept and ready to use
    // h->sweepgen is incremented by 2 after every GC

    sweepgen    uint32
    divMul      uint16     // for divide by elemsize - divMagic.mul
    baseMask    uint16     // if non-0, elemsize is a power of 2, & this will get object allocation base
    allocCount  uint16     // number of allocated objects
    spanclass   spanClass  // size class and noscan (uint8)
    incache     bool       // being used by an mcache
    state       mSpanState // mspaninuse etc
    needzero    uint8      // needs to be zeroed before allocation
    divShift    uint8      // for divide by elemsize - divMagic.shift
    divShift2   uint8      // for divide by elemsize - divMagic.shift2
    elemsize    uintptr    // computed from sizeclass or from npages
    unusedsince int64      // first time spotted by gc in mspanfree state
    npreleased  uintptr    // number of pages released to the os
    limit       uintptr    // end of data in span
    speciallock mutex      // guards specials list
    specials    *special   // linked list of special records sorted by offset.
}

從上面的結(jié)構(gòu)可以看出:

  • next, prev: 指針域,因?yàn)?mspan 一般都是以鏈表形式使用。
  • npages: mspan 的大小為 page 大小的整數(shù)倍。
  • spanclass(size class and noscan (uint8)): 0 ~ _NumSizeClasses 之間的一個(gè)值,這個(gè)解釋了我們的疑問(wèn)。比如,spanclass = 3,那么這個(gè) mspan 被分割成 32 byte 的塊。
  • elemsize: 通過(guò) sizeclass 或者 npages 可以計(jì)算出來(lái)。比如 sizeclass = 3, elemsize = 32 byte。對(duì)于大于 32Kb 的內(nèi)存分配,都是分配整數(shù)頁(yè),elemsize = page_size * npages。
  • nelems: span 中包塊的總數(shù)目。
  • freeindex: 0 ~ nelemes-1,表示分配到第幾個(gè)塊。

3.3 mcentral

上面說(shuō)到當(dāng) mcache 不夠用的時(shí)候,會(huì)從 mcentral 申請(qǐng)。那我們下面就來(lái)介紹一下 mcentral。

//go version 1.10.8
//file runtime/mcentral.go

// Central list of free objects of a given size.
//
//go:notinheap
type mcentral struct {
    lock      mutex
    spanclass spanClass
    nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
    empty     mSpanList // list of spans with no free objects (or cached in an mcache)

    // nmalloc is the cumulative count of objects allocated from
    // this mcentral, assuming all spans in mcaches are
    // fully-allocated. Written atomically, read under STW.
    nmalloc uint64
}

// mSpanList heads a linked list of spans.
//
//go:notinheap
type mSpanList struct {
    first *mspan // first span in list, or nil if none
    last  *mspan // last span in list, or nil if none
}

mcentral 分析:

  • spanclass: 也有成員 spanclass,那么 mcentral 是不是也有 67 個(gè)呢?是的。
  • lock: 因?yàn)闀?huì)有多個(gè) P 過(guò)來(lái)競(jìng)爭(zhēng)。
  • nonempty: mspan 的雙向鏈表,當(dāng)前 mcentral 中可用的 mspan list。
  • empty: 已經(jīng)被使用的,可以認(rèn)為是一種對(duì)所有 mspan 的 track。

問(wèn)題來(lái)了,mcentral 存在于什么地方?雖然在上面我們將 mcentral 和 mheap 作為兩個(gè)部分來(lái)講,但是作為全局的結(jié)構(gòu),這兩部分是可以定義在一起的。實(shí)際上也是這樣,mcentral 包含在 mheap 中。

3.4 mheap

Golang 中的 mheap 結(jié)構(gòu)定義如下。

//go version 1.10.8
//file runtime/mheap.go

// Main malloc heap.
// The heap itself is the "free[]" and "large" arrays,
// but all the other global data is here too.
//
// mheap must not be heap-allocated because it contains mSpanLists,
// which must not be heap-allocated.
//
//go:notinheap
type mheap struct {
    lock      mutex
    free      [_MaxMHeapList]mSpanList // free lists of given length up to _MaxMHeapList
    freelarge mTreap                   // free treap of length >= _MaxMHeapList
    busy      [_MaxMHeapList]mSpanList // busy lists of large spans of given length
    busylarge mSpanList                // busy lists of large spans length >= _MaxMHeapList
    sweepgen  uint32                   // sweep generation, see comment in mspan
    sweepdone uint32                   // all spans are swept
    sweepers  uint32                   // number of active sweepone calls

    // allspans is a slice of all mspans ever created. Each mspan
    // appears exactly once.
    //
    // The memory for allspans is manually managed and can be
    // reallocated and move as the heap grows.
    //
    // In general, allspans is protected by mheap_.lock, which
    // prevents concurrent access as well as freeing the backing
    // store. Accesses during STW might not hold the lock, but
    // must ensure that allocation cannot happen around the
    // access (since that may free the backing store).
    allspans []*mspan // all spans out there

    // spans is a lookup table to map virtual address page IDs to *mspan.
    // For allocated spans, their pages map to the span itself.
    // For free spans, only the lowest and highest pages map to the span itself.
    // Internal pages map to an arbitrary span.
    // For pages that have never been allocated, spans entries are nil.
    //
    // Modifications are protected by mheap.lock. Reads can be
    // performed without locking, but ONLY from indexes that are
    // known to contain in-use or stack spans. This means there
    // must not be a safe-point between establishing that an
    // address is live and looking it up in the spans array.
    //
    // This is backed by a reserved region of the address space so
    // it can grow without moving. The memory up to len(spans) is
    // mapped. cap(spans) indicates the total reserved memory.
    spans []*mspan

    // sweepSpans contains two mspan stacks: one of swept in-use
    // spans, and one of unswept in-use spans. These two trade
    // roles on each GC cycle. Since the sweepgen increases by 2
    // on each cycle, this means the swept spans are in
    // sweepSpans[sweepgen/2%2] and the unswept spans are in
    // sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
    // unswept stack and pushes spans that are still in-use on the
    // swept stack. Likewise, allocating an in-use span pushes it
    // on the swept stack.
    sweepSpans [2]gcSweepBuf

    _ uint32 // align uint64 fields on 32-bit for atomics

    // Proportional sweep
    //
    // These parameters represent a linear function from heap_live
    // to page sweep count. The proportional sweep system works to
    // stay in the black by keeping the current page sweep count
    // above this line at the current heap_live.
    //
    // The line has slope sweepPagesPerByte and passes through a
    // basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
    // any given time, the system is at (memstats.heap_live,
    // pagesSwept) in this space.
    //
    // It's important that the line pass through a point we
    // control rather than simply starting at a (0,0) origin
    // because that lets us adjust sweep pacing at any time while
    // accounting for current progress. If we could only adjust
    // the slope, it would create a discontinuity in debt if any
    // progress has already been made.
    pagesInUse         uint64  // pages of spans in stats _MSpanInUse; R/W with mheap.lock
    pagesSwept         uint64  // pages swept this cycle; updated atomically
    pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
    sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
    sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
    // TODO(austin): pagesInUse should be a uintptr, but the 386
    // compiler can't 8-byte align fields.

    // Malloc stats.
    largealloc  uint64                  // bytes allocated for large objects
    nlargealloc uint64                  // number of large object allocations
    largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
    nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
    nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)

    // range of addresses we might see in the heap
    bitmap        uintptr // Points to one byte past the end of the bitmap
    bitmap_mapped uintptr

    // The arena_* fields indicate the addresses of the Go heap.
    //
    // The maximum range of the Go heap is
    // [arena_start, arena_start+_MaxMem+1).
    //
    // The range of the current Go heap is
    // [arena_start, arena_used). Parts of this range may not be
    // mapped, but the metadata structures are always mapped for
    // the full range.
    arena_start uintptr
    arena_used  uintptr // Set with setArenaUsed.

    // The heap is grown using a linear allocator that allocates
    // from the block [arena_alloc, arena_end). arena_alloc is
    // often, but *not always* equal to arena_used.
    arena_alloc uintptr
    arena_end   uintptr

    // arena_reserved indicates that the memory [arena_alloc,
    // arena_end) is reserved (e.g., mapped PROT_NONE). If this is
    // false, we have to be careful not to clobber existing
    // mappings here. If this is true, then we own the mapping
    // here and *must* clobber it to use it.
    arena_reserved bool

    _ uint32 // ensure 64-bit alignment

    // central free lists for small size classes.
    // the padding makes sure that the MCentrals are
    // spaced CacheLineSize bytes apart, so that each MCentral.lock
    // gets its own cache line.
    // central is indexed by spanClass.
    central [numSpanClasses]struct {
        mcentral mcentral
        pad      [sys.CacheLineSize - unsafe.Sizeof(mcentral{})%sys.CacheLineSize]byte
    }

    spanalloc             fixalloc // allocator for span*
    cachealloc            fixalloc // allocator for mcache*
    treapalloc            fixalloc // allocator for treapNodes* used by large objects
    specialfinalizeralloc fixalloc // allocator for specialfinalizer*
    specialprofilealloc   fixalloc // allocator for specialprofile*
    speciallock           mutex    // lock for special record allocators.

    unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
}

var mheap_ mheap

mheap_ 是一個(gè)全局變量,會(huì)在系統(tǒng)初始化的時(shí)候初始化(在函數(shù) mallocinit() 中)。我們先看一下 mheap 具體結(jié)構(gòu)。

  • allspans []*mspan: 所有的 spans 都是通過(guò) mheap_ 申請(qǐng),所有申請(qǐng)過(guò)的 mspan 都會(huì)記錄在 allspans。結(jié)構(gòu)體中的 lock 就是用來(lái)保證并發(fā)安全的。注釋中有關(guān)于 STW 的說(shuō)明,與Golang 的 GC 有關(guān),這里不做詳細(xì)說(shuō)明。
  • central [_NumSizeClasses]…: 這個(gè)就是之前介紹的 mcentral ,每種大小的塊對(duì)應(yīng)一個(gè) mcentral。mcentral 上面介紹過(guò)了。pad 可以認(rèn)為是一個(gè)字節(jié)填充,為了避免偽共享(false sharing)問(wèn)題的。False Sharing 可以參考 False Sharing - wikipedia,這里就不細(xì)說(shuō)了。
  • sweepgen, sweepdone: GC 相關(guān)。(Golang 的 GC 策略是 Mark & Sweep, 這里是用來(lái)表示 sweep 的,這里就不再深入了。)
  • free [_MaxMHeapList]mSpanList: 這是一個(gè) SpanList 數(shù)組,每個(gè) SpanList 里面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 個(gè) page 組成。比如 free[3] 是由包含 3 個(gè) page 的 mspan 組成的鏈表。free 表示的是 free list,也就是未分配的。對(duì)應(yīng)的還有 busy list。
  • freelarge: mspan 組成的鏈表,每個(gè)元素(也就是 mspan)的 page 個(gè)數(shù)大于 127。對(duì)應(yīng)的還有 busylarge。
  • spans []*mspan: 記錄 arena 區(qū)域頁(yè)號(hào)(page number)和 mspan 的映射關(guān)系。
  • arena_start, arena_end, arena_used: 要解釋這幾個(gè)變量之前要解釋一下 arena。arena 是 Golang 中用于分配內(nèi)存的連續(xù)虛擬地址區(qū)域。由 mheap 管理,堆上申請(qǐng)的所有內(nèi)存都來(lái)自 arena。那么如何標(biāo)志內(nèi)存可用呢?操作系統(tǒng)的常見(jiàn)做法用兩種:一種是用鏈表將所有的可用內(nèi)存都串起來(lái);另一種是使用位圖來(lái)標(biāo)志內(nèi)存塊是否可用。結(jié)合上面一條 spans,內(nèi)存的布局是下面這樣的。


    golang內(nèi)存布局
  • spanalloc, cachealloc fixalloc: fixalloc 是 free-list,用來(lái)分配特定大小的塊。
  • 剩下的是一些統(tǒng)計(jì)信息和 GC 相關(guān)的信息,這里暫且按住不表。

四、初始化

在系統(tǒng)初始化階段,上面介紹的幾個(gè)結(jié)構(gòu)會(huì)被進(jìn)行初始化,我們直接看一下初始化代碼:mallocinit()。

//go version 1.10.8
//file runtime/malloc.go

func mallocinit() {
    if class_to_size[_TinySizeClass] != _TinySize {
        throw("bad TinySizeClass")
    }

    testdefersizes()

    // Copy class sizes out for statistics table.
    for i := range class_to_size {
        memstats.by_size[i].size = uint32(class_to_size[i])
    }

    // Check physPageSize.
    if physPageSize == 0 {
        // The OS init code failed to fetch the physical page size.
        throw("failed to get system page size")
    }
    if physPageSize < minPhysPageSize {
        print("system page size (", physPageSize, ") is smaller than minimum page size (", minPhysPageSize, ")\n")
        throw("bad system page size")
    }
    if physPageSize&(physPageSize-1) != 0 {
        print("system page size (", physPageSize, ") must be a power of 2\n")
        throw("bad system page size")
    }

    // The auxiliary regions start at p and are laid out in the
    // following order: spans, bitmap, arena.
    var p, pSize uintptr
    var reserved bool

    // The spans array holds one *mspan per _PageSize of arena.
    var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
    spansSize = round(spansSize, _PageSize)
    // The bitmap holds 2 bits per word of arena.
    var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
    bitmapSize = round(bitmapSize, _PageSize)

    // Set up the allocation arena, a contiguous area of memory where
    // allocated data will be found.
    if sys.PtrSize == 8 {
        // On a 64-bit machine, allocate from a single contiguous reservation.
        // 512 GB (MaxMem) should be big enough for now.
        //
        // The code will work with the reservation at any address, but ask
        // SysReserve to use 0x0000XXc000000000 if possible (XX=00...7f).
        // Allocating a 512 GB region takes away 39 bits, and the amd64
        // doesn't let us choose the top 17 bits, so that leaves the 9 bits
        // in the middle of 0x00c0 for us to choose. Choosing 0x00c0 means
        // that the valid memory addresses will begin 0x00c0, 0x00c1, ..., 0x00df.
        // In little-endian, that's c0 00, c1 00, ..., df 00. None of those are valid
        // UTF-8 sequences, and they are otherwise as far away from
        // ff (likely a common byte) as possible. If that fails, we try other 0xXXc0
        // addresses. An earlier attempt to use 0x11f8 caused out of memory errors
        // on OS X during thread allocations.  0x00c0 causes conflicts with
        // AddressSanitizer which reserves all memory up to 0x0100.
        // These choices are both for debuggability and to reduce the
        // odds of a conservative garbage collector (as is still used in gccgo)
        // not collecting memory because some non-pointer block of memory
        // had a bit pattern that matched a memory address.
        //
        // Actually we reserve 544 GB (because the bitmap ends up being 32 GB)
        // but it hardly matters: e0 00 is not valid UTF-8 either.
        //
        // If this fails we fall back to the 32 bit memory mechanism
        //
        // However, on arm64, we ignore all this advice above and slam the
        // allocation at 0x40 << 32 because when using 4k pages with 3-level
        // translation buffers, the user address space is limited to 39 bits
        // On darwin/arm64, the address space is even smaller.
        arenaSize := round(_MaxMem, _PageSize)
        pSize = bitmapSize + spansSize + arenaSize + _PageSize
        for i := 0; i <= 0x7f; i++ {
            switch {
            case GOARCH == "arm64" && GOOS == "darwin":
                p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
            case GOARCH == "arm64":
                p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
            default:
                p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
            }
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
    }

    if p == 0 {
        // On a 32-bit machine, we can't typically get away
        // with a giant virtual address space reservation.
        // Instead we map the memory information bitmap
        // immediately after the data segment, large enough
        // to handle the entire 4GB address space (256 MB),
        // along with a reservation for an initial arena.
        // When that gets used up, we'll start asking the kernel
        // for any memory anywhere.

        // We want to start the arena low, but if we're linked
        // against C code, it's possible global constructors
        // have called malloc and adjusted the process' brk.
        // Query the brk so we can avoid trying to map the
        // arena over it (which will cause the kernel to put
        // the arena somewhere else, likely at a high
        // address).
        procBrk := sbrk0()

        // If we fail to allocate, try again with a smaller arena.
        // This is necessary on Android L where we share a process
        // with ART, which reserves virtual memory aggressively.
        // In the worst case, fall back to a 0-sized initial arena,
        // in the hope that subsequent reservations will succeed.
        arenaSizes := []uintptr{
            512 << 20,
            256 << 20,
            128 << 20,
            0,
        }

        for _, arenaSize := range arenaSizes {
            // SysReserve treats the address we ask for, end, as a hint,
            // not as an absolute requirement. If we ask for the end
            // of the data segment but the operating system requires
            // a little more space before we can start allocating, it will
            // give out a slightly higher pointer. Except QEMU, which
            // is buggy, as usual: it won't adjust the pointer upward.
            // So adjust it upward a little bit ourselves: 1/4 MB to get
            // away from the running binary image and then round up
            // to a MB boundary.
            p = round(firstmoduledata.end+(1<<18), 1<<20)
            pSize = bitmapSize + spansSize + arenaSize + _PageSize
            if p <= procBrk && procBrk < p+pSize {
                // Move the start above the brk,
                // leaving some room for future brk
                // expansion.
                p = round(procBrk+(1<<20), 1<<20)
            }
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
        if p == 0 {
            throw("runtime: cannot reserve arena virtual address space")
        }
    }

    // PageSize can be larger than OS definition of page size,
    // so SysReserve can give us a PageSize-unaligned pointer.
    // To overcome this we ask for PageSize more and round up the pointer.
    p1 := round(p, _PageSize)
    pSize -= p1 - p

    spansStart := p1
    p1 += spansSize
    mheap_.bitmap = p1 + bitmapSize
    p1 += bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1
    mheap_.arena_alloc = p1
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()
}

4.1 arena 相關(guān)

// The spans array holds one *mspan per _PageSize of arena.
var spansSize uintptr = (_MaxMem + 1) / _PageSize * sys.PtrSize
spansSize = round(spansSize, _PageSize)
// The bitmap holds 2 bits per word of arena.
var bitmapSize uintptr = (_MaxMem + 1) / (sys.PtrSize * 8 / 2)
bitmapSize = round(bitmapSize, _PageSize)
arenaSize := round(_MaxMem, _PageSize)

// _MaxMem is the maximum heap arena size minus 1.
//
// On 32-bit, this is also the maximum heap pointer value,
// since the arena starts at address 0.
_MaxMem = 1<<_MHeapMap_TotalBits - 1        

首先解釋一下變量 _MaxMem ,里面還有一個(gè)變量就不再列出來(lái)了。簡(jiǎn)單來(lái)說(shuō) _MaxMem 就是系統(tǒng)為 arena 區(qū)分配的大小:64 位系統(tǒng)分配 512 G;對(duì)于 Windows 64 位系統(tǒng),arena 區(qū)分配 32 G。round 是一個(gè)對(duì)齊操作,向上取 _PageSize 的倍數(shù)。實(shí)現(xiàn)也很有意思,代碼如下。

// round n up to a multiple of a.  a must be a power of 2.
func round(n, a uintptr) uintptr {
    return (n + a - 1) &^ (a - 1)
}

bitmap 用兩個(gè) bit 表示一個(gè)字的可用狀態(tài),那么算下來(lái) bitmap 的大小為 16 G。spans 記錄的 arena 區(qū)的塊頁(yè)號(hào)和對(duì)應(yīng)的 mspan 指針的對(duì)應(yīng)關(guān)系。比如 arena 區(qū)內(nèi)存地址 x,對(duì)應(yīng)的頁(yè)號(hào)就是 page_num = (x - arena_start) / page_size,那么 spans 就會(huì)記錄 spans[page_num] = x。如果 arena 為 512 G的話,spans 區(qū)的大小為 512 G / 8K * 8 = 512 M。這里值得注意的是 Golang 的內(nèi)存管理虛擬地址頁(yè)大小為 8k。

_PageSize = 1 << _PageShift
_PageMask = _PageSize - 1

所以這一段連續(xù)的的虛擬內(nèi)存布局(64 位)如下:


golang虛擬內(nèi)存布局

4.2 虛擬地址申請(qǐng)

主要是下面這段代碼。

pSize = bitmapSize + spansSize + arenaSize + _PageSize
for i := 0; i <= 0x7f; i++ {
    switch {
    case GOARCH == "arm64" && GOOS == "darwin":
        p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
    case GOARCH == "arm64":
        p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
    default:
        p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
    }
    p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
    if p != 0 {
        break
    }
}   

初始化的時(shí)候,Golang 向操作系統(tǒng)申請(qǐng)一段連續(xù)的地址空間,就是上面的 spans + bitmap + arena。p 就是這段連續(xù)地址空間的開(kāi)始地址,不同平臺(tái)的 p 取值不一樣。向 OS 申請(qǐng)的時(shí)候視不同的 OS 版本,調(diào)用不同的系統(tǒng)調(diào)用,比如 Unix 系統(tǒng)調(diào)用 mmap (mmap 向操作系統(tǒng)內(nèi)核申請(qǐng)新的虛擬地址區(qū)間,可指定起始地址和長(zhǎng)度),Windows 系統(tǒng)調(diào)用 VirtualAlloc (類似 mmap)。

//來(lái)源于互聯(lián)網(wǎng)

//bsd
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    if sys.PtrSize == 8 && uint64(n) > 1<<32 || sys.GoosNacl != 0 {
        *reserved = false
        return v
    }

    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}

//darwin
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    return p
}

//linux
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    ...
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}
//windows
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    // v is just a hint.
    // First try at v.
    v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
    if v != nil {
        return v
    }

    // Next let the kernel choose the address.
    return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
}

4.3 mheap 初始化

我們上面介紹 mheap 結(jié)構(gòu)的時(shí)候知道 spans, bitmap, arena 都是存在于 mheap 中的,從操作系統(tǒng)申請(qǐng)完地址之后就是初始化 mheap 了。

    // PageSize can be larger than OS definition of page size,
    // so SysReserve can give us a PageSize-unaligned pointer.
    // To overcome this we ask for PageSize more and round up the pointer.
    p1 := round(p, _PageSize)
    pSize -= p1 - p

    spansStart := p1
    p1 += spansSize
    mheap_.bitmap = p1 + bitmapSize
    p1 += bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1
    mheap_.arena_alloc = p1
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()

p 是從連續(xù)虛擬地址的起始地址,先進(jìn)行對(duì)齊,然后初始化 arena,bitmap,spans 地址。mheap_.init()會(huì)初始化 fixalloc 等相關(guān)的成員,還有 mcentral 的初始化。

// Initialize the heap.
func (h *mheap) init(spansStart, spansBytes uintptr) {
    h.treapalloc.init(unsafe.Sizeof(treapNode{}), nil, nil, &memstats.other_sys)
    h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
    h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
    h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
    h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)

    // Don't zero mspan allocations. Background sweeping can
    // inspect a span concurrently with allocating it, so it's
    // important that the span's sweepgen survive across freeing
    // and re-allocating a span to prevent background sweeping
    // from improperly cas'ing it from 0.
    //
    // This is safe because mspan contains no heap pointers.
    h.spanalloc.zero = false

    // h->mapcache needs no init
    for i := range h.free {
        h.free[i].init()
        h.busy[i].init()
    }

    h.busylarge.init()
    for i := range h.central {
        h.central[i].mcentral.init(spanClass(i))
    }

    sp := (*slice)(unsafe.Pointer(&h.spans))
    sp.array = unsafe.Pointer(spansStart)
    sp.len = 0
    sp.cap = int(spansBytes / sys.PtrSize)

    // Map metadata structures. But don't map race detector memory
    // since we're not actually growing the arena here (and TSAN
    // gets mad if you map 0 bytes).
    h.setArenaUsed(h.arena_used, false)
}

mheap 初始化之后,對(duì)當(dāng)前的線程也就是 M 進(jìn)行初始化。

//獲取當(dāng)前 G
_g_ := getg()
// 獲取 G 上綁定的 M 的 mcache
_g_.m.mcache = allocmcache()

4.4 per-P mcache 初始化

上面好像并沒(méi)有說(shuō)到針對(duì) P 的 mcache 初始化,因?yàn)檫@個(gè)時(shí)候還沒(méi)有初始化 P。我們看一下 bootstrap 的代碼。

func schedinit() {
    ...
    mallocinit()
    ...
    
    if procs > _MaxGomaxprocs {
        procs = _MaxGomaxprocs
    }
    if procresize(procs) != nil {
        ...
    }
}

其中 mallocinit() 上面說(shuō)過(guò)了。對(duì) P 的初始化在函數(shù) procresize() 中執(zhí)行,我們下面只看內(nèi)存相關(guān)的部分。

//go version 1.10.8
//file runtime/proc.go

func procresize(nprocs int32) *p {
    ...
    // initialize new P's
    for i := int32(0); i < nprocs; i++ {
        pp := allp[i]
        if pp == nil {
            pp = new(p)
            pp.id = i
            pp.status = _Pgcstop
            pp.sudogcache = pp.sudogbuf[:0]
            for i := range pp.deferpool {
                pp.deferpool[i] = pp.deferpoolbuf[i][:0]
            }
            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
        }
        // P mcache 初始化
        if pp.mcache == nil {
            if old == 0 && i == 0 {
                if getg().m.mcache == nil {
                    throw("missing mcache?")
                }
                // P[0] 分配給主 Goroutine
                pp.mcache = getg().m.mcache // bootstrap
            } else {
                // P[0] 之外的 P 申請(qǐng) mcache
                pp.mcache = allocmcache()
            }
        }
        ...
    }
    ...
}

所有的 P 都存放在一個(gè)全局?jǐn)?shù)組 allp 中,procresize() 的目的就是將 allp 中用到的 P 進(jìn)行初始化,同時(shí)對(duì)多余 P 的資源剝離。

五、內(nèi)存分配

先說(shuō)一下給對(duì)象 object 分配內(nèi)存的主要流程:

  1. object size > 32K,則使用 mheap 直接分配。
  2. object size < 16 byte,使用 mcache 的小對(duì)象分配器 tiny 直接分配。 (其實(shí) tiny 就是一個(gè)指針,暫且這么說(shuō)吧。)
  3. object size > 16 byte && size <=32K byte 時(shí),先使用 mcache 中對(duì)應(yīng)的 size class 分配。
  4. 如果 mcache 對(duì)應(yīng)的 size class 的 span 已經(jīng)沒(méi)有可用的塊,則向 mcentral 請(qǐng)求。
  5. 如果 mcentral 也沒(méi)有可用的塊,則向 mheap 申請(qǐng),并切分。
  6. 如果 mheap 也沒(méi)有合適的 span,則向操作系統(tǒng)申請(qǐng)。
    我們看一下在堆上,也就是 arena 區(qū)分配內(nèi)存的相關(guān)函數(shù)。
package main

func foo() *int {
    x := 1
    return &x
}

func main() {
    x := foo()
    println(*x)
}

根據(jù)之前介紹的逃逸分析,foo() 中的 x 會(huì)被分配到堆上。把上面代碼保存為 main.go 看一下匯編代碼。

$ go build -gcflags '-l' -o main main.go
$ go tool objdump -s "main\.foo" main
TEXT main.foo(SB) /Users/xxx/src/heap/main.go
  main.go:3             0x104aad0               65488b0c25a0080000      MOVQ GS:0x8a0, CX                       
  main.go:3             0x104aad9               483b6110                CMPQ 0x10(CX), SP                       
  main.go:3             0x104aadd               7639                    JBE 0x104ab18                           
  main.go:3             0x104aadf               4883ec18                SUBQ $0x18, SP                          
  main.go:3             0x104aae3               48896c2410              MOVQ BP, 0x10(SP)                       
  main.go:3             0x104aae8               488d6c2410              LEAQ 0x10(SP), BP                       
  main.go:4             0x104aaed               488d052c990000          LEAQ type.*+38976(SB), AX               
  main.go:4             0x104aaf4               48890424                MOVQ AX, 0(SP)                          
  main.go:4             0x104aaf8               e82304fcff              CALL runtime.newobject(SB)              
  main.go:4             0x104aafd               488b442408              MOVQ 0x8(SP), AX                        
  main.go:4             0x104ab02               48c70001000000          MOVQ $0x1, 0(AX)                        
  main.go:5             0x104ab09               4889442420              MOVQ AX, 0x20(SP)                       
  main.go:5             0x104ab0e               488b6c2410              MOVQ 0x10(SP), BP                       
  main.go:5             0x104ab13               4883c418                ADDQ $0x18, SP                          
  main.go:5             0x104ab17               c3                      RET                                     
  main.go:3             0x104ab18               e80389ffff              CALL runtime.morestack_noctxt(SB)       
  main.go:3             0x104ab1d               ebb1                    JMP main.foo(SB)                        
  :-1                   0x104ab1f               cc                      INT $0x3   

堆上內(nèi)存分配調(diào)用了 runtime 包的 newobject 函數(shù)。

//go version 1.10.8
//file runtime/malloc.go

// implementation of new builtin
// compiler (both frontend and SSA backend) knows the signature
// of this function
func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    if gcphase == _GCmarktermination {
        throw("mallocgc called with gcphase == _GCmarktermination")
    }

    if size == 0 {
        return unsafe.Pointer(&zerobase)
    }

    if debug.sbrk != 0 {
        align := uintptr(16)
        if typ != nil {
            align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }

    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {
            assistG = assistG.m.curg
        }
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)

        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }

    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {
        throw("malloc deadlock")
    }
    if mp.gsignal == getg() {
        throw("malloc during signal")
    }
    mp.mallocing = 1

    shouldhelpgc := false
    dataSize := size
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    if size <= maxSmallSize {
        // object size <= 32K
        if noscan && size < maxTinySize {
        // 小于 16 byte 的小對(duì)象分配
            // Tiny allocator.
            //
            // Tiny allocator combines several tiny allocation requests
            // into a single memory block. The resulting memory block
            // is freed when all subobjects are unreachable. The subobjects
            // must be noscan (don't have pointers), this ensures that
            // the amount of potentially wasted memory is bounded.
            //
            // Size of the memory block used for combining (maxTinySize) is tunable.
            // Current setting is 16 bytes, which relates to 2x worst case memory
            // wastage (when all but one subobjects are unreachable).
            // 8 bytes would result in no wastage at all, but provides less
            // opportunities for combining.
            // 32 bytes provides more opportunities for combining,
            // but can lead to 4x worst case wastage.
            // The best case winning is 8x regardless of block size.
            //
            // Objects obtained from tiny allocator must not be freed explicitly.
            // So when an object will be freed explicitly, we ensure that
            // its size >= maxTinySize.
            //
            // SetFinalizer has a special case for objects potentially coming
            // from tiny allocator, it such case it allows to set finalizers
            // for an inner byte of a memory block.
            //
            // The main targets of tiny allocator are small strings and
            // standalone escaping variables. On a json benchmark
            // the allocator reduces number of allocations by ~12% and
            // reduces heap size by ~20%.
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySpanClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySpanClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        //object size > 32K byte
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero, noscan)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }

    var scanSize uintptr
    if !noscan {
        // If allocating a defer+arg block, now that we've picked a malloc size
        // large enough to hold everything, cut the "asked for" size down to
        // just the defer header, so that the GC bitmap will record the arg block
        // as containing nothing at all (as if it were unused space at the end of
        // a malloc block caused by size rounding).
        // The defer arg areas are scanned as part of scanstack.
        if typ == deferType {
            dataSize = unsafe.Sizeof(_defer{})
        }
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {
                scanSize = dataSize - typ.size + typ.ptrdata
            }
        } else {
            scanSize = typ.ptrdata
        }
        c.local_scan += scanSize
    }

    // Ensure that the stores above that initialize x to
    // type-safe memory and set the heap bits occur before
    // the caller can make x observable to the garbage
    // collector. Otherwise, on weakly ordered machines,
    // the garbage collector could follow a pointer to x,
    // but see uninitialized memory or stale heap bits.
    publicationBarrier()

    // Allocate black during GC.
    // All slots hold nil so no scanning is needed.
    // This may be racing with GC so do it atomically if there can be
    // a race marking the bit.
    if gcphase != _GCoff {
        gcmarknewobject(uintptr(x), size, scanSize)
    }

    if raceenabled {
        racemalloc(x, size)
    }

    if msanenabled {
        msanmalloc(x, size)
    }

    mp.mallocing = 0
    releasem(mp)

    if debug.allocfreetrace != 0 {
        tracealloc(x, size, typ)
    }

    if rate := MemProfileRate; rate > 0 {
        if size < uintptr(rate) && int32(size) < c.next_sample {
            c.next_sample -= int32(size)
        } else {
            mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }

    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }

    if shouldhelpgc {
        if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
            gcStart(gcBackgroundMode, t)
        }
    }

    return x
}

整個(gè)分配過(guò)程可以根據(jù) object size 拆解成三部分:size < 16 byte, 16 byte <= size <= 32 K byte, size > 32 K byte。

5.1 size 小于 16 byte

對(duì)于小于 16 byte 的內(nèi)存塊,mcache 有個(gè)專門的內(nèi)存區(qū)域 tiny 用來(lái)分配,tiny 是指針,指向開(kāi)始地址。

func mallocgc(...) {
    ...
            off := c.tinyoffset
            // 地址對(duì)齊
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            // 分配
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // tiny 不夠了,為其重新分配一個(gè) 16 byte 內(nèi)存塊
            span := c.alloc[tinySizeClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySizeClass)
            }
            x = unsafe.Pointer(v)
            //將申請(qǐng)的內(nèi)存塊全置為 0
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            // 如果申請(qǐng)的內(nèi)存塊用不完,則將剩下的給 tiny,用 tinyoffset 記錄分配了多少。
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
}

如上所示,tinyoffset 表示 tiny 當(dāng)前分配到什么地址了,之后的分配根據(jù) tinyoffset 尋址。先根據(jù)要分配的對(duì)象大小進(jìn)行地址對(duì)齊,比如 size 是 8 的倍數(shù),tinyoffset 和 8 對(duì)齊。然后就是進(jìn)行分配。如果 tiny 剩余的空間不夠用,則重新申請(qǐng)一個(gè) 16 byte 的內(nèi)存塊,并分配給 object。如果有結(jié)余,則記錄在 tiny 上。

5.2 size 大于 32 K byte

對(duì)于大于 32 Kb 的內(nèi)存分配,直接跳過(guò) mcache 和 mcentral,通過(guò) mheap 分配。

func mallocgc(...) {
    } else {
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
    ...
}

func largeAlloc(size uintptr, needzero bool) *mspan {
    ...
    npages := size >> _PageShift
    if size&_PageMask != 0 {
        npages++
    }
    ...
    s := mheap_.alloc(npages, 0, true, needzero)
    if s == nil {
        throw("out of memory")
    }
    s.limit = s.base() + size
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

對(duì)于大于 32 K 的內(nèi)存分配都是分配整數(shù)頁(yè),先右移然后低位與計(jì)算需要的頁(yè)數(shù)。

5.3 size 介于 16 和 32K

對(duì)于 size 介于 16 ~ 32K byte 的內(nèi)存分配先計(jì)算應(yīng)該分配的 sizeclass,然后去 mcache 里面 alloc[sizeclass] 申請(qǐng),如果 mcache.alloc[sizeclass] 不足以申請(qǐng),則 mcache 向 mcentral 申請(qǐng),然后再分配。mcentral 給 mcache 分配完之后會(huì)判斷自己需不需要擴(kuò)充,如果需要?jiǎng)t想 mheap 申請(qǐng)。

func mallocgc(...) {
        ...
        } else {
            var sizeclass uint8
            //計(jì)算 sizeclass
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            //從對(duì)應(yīng)的 span 里面分配一個(gè) object 
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
          }
}

我們首先看一下如何計(jì)算 sizeclass 的,預(yù)先定義了兩個(gè)數(shù)組:size_to_class8 和 size_to_class128。 數(shù)組 size_to_class8,其第 i 個(gè)值表示地址區(qū)間 ( (i-1)8, i8 ] (smallSizeDiv = 8) 對(duì)應(yīng)的 sizeclass,size_to_class128 類似。小于 1024 - 8 = 1016 (smallSizeMax=1024),使用 size_to_class8,否則使用數(shù)組 size_to_class128??匆幌聰?shù)組具體的值:0, 1, 2, 3, 3, 4, 4…。舉個(gè)例子,比如要分配 17 byte 的內(nèi)存 (16 byte 以下的使用 mcache.tiny 分配),sizeclass = size_to_calss8[(17+7)/8] = size_to_class8[3] = 3。不得不說(shuō)這種用空間換時(shí)間的策略確實(shí)提高了運(yùn)行效率。
計(jì)算出 sizeclass,那么就可以去 mcache.alloc[sizeclass] 分配了,注意這是一個(gè) mspan 指針,真正的分配函數(shù)是 nextFreeFast() 函數(shù),如下。

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    if theBit < 64 {
        result := s.freeindex + uintptr(theBit)
        if result < s.nelems {
            freeidx := result + 1
            if freeidx%64 == 0 && freeidx != s.nelems {
                return 0
            }
            s.allocCache >>= uint(theBit + 1)
            s.freeindex = freeidx
            s.allocCount++
            return gclinkptr(result*s.elemsize + s.base())
        }
    }
    return 0
}

allocCache 這里是用位圖表示內(nèi)存是否可用,1 表示可用。然后通過(guò) span 里面的 freeindex 和 elemsize 來(lái)計(jì)算地址即可。
如果 mcache.alloc[sizeclass] 已經(jīng)不夠用了,則從 mcentral 申請(qǐng)內(nèi)存到 mcache。

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(sizeclass uint8) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    s = c.alloc[sizeclass]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {
            println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        systemstack(func() {
            // 這個(gè)地方 mcache 向 mcentral 申請(qǐng)
            c.refill(int32(sizeclass))
        })
        shouldhelpgc = true
        s = c.alloc[sizeclass]
        // mcache 向 mcentral 申請(qǐng)完之后,再次從 mcache 申請(qǐng)
        freeIndex = s.nextFreeIndex()
    }

    ...
}

// nextFreeIndex returns the index of the next free object in s at
// or after s.freeindex.
// There are hardware instructions that can be used to make this
// faster if profiling warrants it.
// 這個(gè)函數(shù)和 nextFreeFast 有點(diǎn)冗余了
func (s *mspan) nextFreeIndex() uintptr {
    ...
}

mcache 向 mcentral,如果 mcentral 不夠,則向 mheap 申請(qǐng)。

//go version 1.10.8

func (c *mcache) refill(sizeclass int32) *mspan {
    ...
    // 向 mcentral 申請(qǐng)
    s = mheap_.central[sizeclass].mcentral.cacheSpan()
    ...
    return s
}

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    ...
    // Replenish central list if empty.
    s = c.grow()
}

func (c *mcentral) grow() *mspan {
    npages := uintptr(class_to_allocnpages[c.sizeclass])
    size := uintptr(class_to_size[c.sizeclass])
    n := (npages << _PageShift) / size
    
    //這里向 mheap 申請(qǐng)
    s := mheap_.alloc(npages, c.sizeclass, false, true)
    ...
    return s
}

如果 mheap 不足,則向 OS 申請(qǐng)。接上面的代碼 mheap_.alloc()

func (h *mheap) alloc(npage uintptr, sizeclass int32, large bool, needzero bool) *mspan {
    ...
    var s *mspan
    systemstack(func() {
        s = h.alloc_m(npage, sizeclass, large)
    })
    ...
}

func (h *mheap) alloc_m(npage uintptr, sizeclass int32, large bool) *mspan {
    ... 
    s := h.allocSpanLocked(npage)
    ...
}

func (h *mheap) allocSpanLocked(npage uintptr) *mspan {
    ...
    s = h.allocLarge(npage)
    if s == nil {
        if !h.grow(npage) {
            return nil
        }
        s = h.allocLarge(npage)
        if s == nil {
            return nil
        }
    }
    ...
}

func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {
        ask = _HeapAllocChunk
    }

    v := h.sysAlloc(ask)
    ...
}

整個(gè)函數(shù)調(diào)用鏈如上所示,最后 sysAlloc 會(huì)調(diào)用系統(tǒng)調(diào)用(mmap 或者 VirtualAlloc,和初始化那部分有點(diǎn)類似)去向操作系統(tǒng)申請(qǐng)。

六、內(nèi)存回收

這里只會(huì)介紹一些簡(jiǎn)單的內(nèi)存回收。

6.1 mcache 回收

mcache 回收可以分兩部分:第一部分是將 alloc 中未用完的內(nèi)存歸還給對(duì)應(yīng)的 mcentral。

func freemcache(c *mcache) {
    systemstack(func() {
        c.releaseAll()
        ...

        lock(&mheap_.lock)
        purgecachedstats(c)
        mheap_.cachealloc.free(unsafe.Pointer(c))
        unlock(&mheap_.lock)
    })
}

func (c *mcache) releaseAll() {
    for i := 0; i < _NumSizeClasses; i++ {
        s := c.alloc[i]
        if s != &emptymspan {
            mheap_.central[i].mcentral.uncacheSpan(s)
            c.alloc[i] = &emptymspan
        }
    }
    // Clear tinyalloc pool.
    c.tiny = 0
    c.tinyoffset = 0
}

6.2 mcentral 回收

當(dāng) mspan 沒(méi)有 free object 的時(shí)候,將 mspan 歸還給 mheap。

func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
    ...
    lock(&c.lock)
    ...
    if s.allocCount != 0 {
        unlock(&c.lock)
        return false
    }

    c.nonempty.remove(s)
    unlock(&c.lock)
    mheap_.freeSpan(s, 0)
    return true
}

6.3 mheap

mheap 并不會(huì)定時(shí)向操作系統(tǒng)歸還,但是會(huì)對(duì) span 做一些操作,比如合并相鄰的 span。

七、總結(jié)

目前GoLang內(nèi)存管理還處于學(xué)習(xí)階段,上文多是整理學(xué)習(xí)了一些目前互聯(lián)網(wǎng)的信息,參考資源都已列出,有些地方還沒(méi)有理解,如有問(wèn)題,請(qǐng)多交流。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • Go語(yǔ)言內(nèi)置運(yùn)行時(shí)(就是runtime),拋棄了傳統(tǒng)的內(nèi)存分配方式,改為自主管理。這樣可以自主地實(shí)現(xiàn)更好的內(nèi)存使用...
    ddu_sw閱讀 1,546評(píng)論 0 5
  • 幾個(gè)關(guān)鍵數(shù)據(jù)結(jié)構(gòu) mspan 由mheap管理的頁(yè)面,記錄了所分配的塊大小和起始地址等 mcache 與P(可看做...
    SuperGopher閱讀 486評(píng)論 0 0
  • Go語(yǔ)言——內(nèi)存管理 參考: 圖解 TCMalloc Golang 內(nèi)存管理 Go 內(nèi)存管理 問(wèn)題 內(nèi)存碎片:避免...
    陳先生_9e91閱讀 4,916評(píng)論 1 10
  • 介紹 了解操作系統(tǒng)對(duì)內(nèi)存的管理機(jī)制后,現(xiàn)在可以去看下 Go 語(yǔ)言是如何利用底層的這些特性來(lái)優(yōu)化內(nèi)存的。Go 的內(nèi)存...
    達(dá)菲格閱讀 9,020評(píng)論 2 47
  • 基于1.8.3版本,64位Linux操作系統(tǒng) 1、概述 Go內(nèi)存管理基于tcmalloc,使用連續(xù)虛擬地址,以頁(yè)(...
    Love語(yǔ)鬼閱讀 8,547評(píng)論 1 14

友情鏈接更多精彩內(nèi)容