2018-05-13 (舊文整理) 內(nèi)存碎片綜述

2016/01/25

內(nèi)存分配的過(guò)程

用戶態(tài)的程序調(diào)用 malloc 函數(shù)申請(qǐng)內(nèi)存. malloc 是一個(gè)庫(kù)函數(shù), 由 glibc 實(shí)現(xiàn). glibc 內(nèi)部實(shí)現(xiàn)了一個(gè)內(nèi)存管理機(jī)制, 在用戶態(tài)層對(duì)進(jìn)程空間的內(nèi)存分配進(jìn)行管理.
不同的情況下, malloc 會(huì)調(diào)用系統(tǒng)調(diào)用 brk 或是 mmap 函數(shù)向內(nèi)核申請(qǐng)內(nèi)存區(qū)域.
當(dāng)使用 mmap 申請(qǐng)內(nèi)存成功后, 內(nèi)核為這個(gè)新的內(nèi)存區(qū)域創(chuàng)建一個(gè)VMA結(jié)點(diǎn). 這個(gè)結(jié)點(diǎn)放入進(jìn)程內(nèi)存描述符 struct mm_struct 中的 mmap分量保存.
我們都知道 Linux 利用 Intel CPU 的保護(hù)模式 (虛擬地址模式) , 采用頁(yè)表方式進(jìn)行內(nèi)存管理. 每個(gè)進(jìn)程擁有自己的頁(yè)面入口. 這個(gè)入口也記錄在進(jìn)程描述符中. 分量是 pgd.
對(duì)于新分配的內(nèi)存區(qū)域, 內(nèi)核并沒(méi)有真的為其分配相應(yīng)的物理內(nèi)存. Linux 采用"用時(shí)才分配"的策略". 當(dāng)進(jìn)程訪問(wèn)這個(gè)新分配的內(nèi)存區(qū)域時(shí), 觸發(fā)"缺頁(yè)中斷", 由這個(gè)中斷陷入內(nèi)核. 內(nèi)核才開(kāi)始為這個(gè)內(nèi)存區(qū)域分配相應(yīng)的物理內(nèi)存. 物理內(nèi)存是以頁(yè)為單位分配的. 所有的頁(yè)根據(jù) buddyinfo 算法組織起來(lái), 形成多個(gè)隊(duì)列. 隊(duì)列內(nèi)的每個(gè)成員是相同大小的頁(yè)面. 第一個(gè)隊(duì)列的成員的大小是 2^0 * pagesize, 意味著是 4KB. 第二個(gè)隊(duì)列的成員的大小是 2^1 * pagesize, 意味著是8KB. 最后一個(gè)隊(duì)列的成員大小是 2^10 pagesize, 意味著是 4MB. buddyinfo 從最適合的隊(duì)列中取出頁(yè)面分配出去. 所以, 通過(guò)"缺頁(yè)中斷", 進(jìn)程請(qǐng)求內(nèi)核向 buddyinfo 申請(qǐng)分配頁(yè)面. 一旦分配成功, 這個(gè)新的內(nèi)存區(qū)域與其對(duì)應(yīng)頁(yè)的關(guān)系被更新到了進(jìn)程頁(yè)表中.
注: 由于進(jìn)程空間的內(nèi)存是虛擬地址, 所以, buddyinfo 往往從 ORDER=0 的隊(duì)列(2^0
pagesize)里為進(jìn)程分配頁(yè)面.即使進(jìn)程需要多個(gè)頁(yè)面,buddyinfo也只是從 ORDER=0 隊(duì)列中分配出相應(yīng)數(shù)量的單一頁(yè)面.
到這里,來(lái)自用戶態(tài)的進(jìn)程申請(qǐng)的虛擬地址空間與物理內(nèi)存關(guān)聯(lián)到了一起,后續(xù)的程序可以訪問(wèn)這個(gè)虛擬地址空間了.
除了用戶態(tài)進(jìn)程需要申請(qǐng)內(nèi)存.來(lái)自內(nèi)核的內(nèi)存申請(qǐng)也需要 buddyinfo進(jìn)行分配.內(nèi)核申請(qǐng)內(nèi)存有兩個(gè)來(lái)源:kmalloc 和 vmalloc .雖然kmalloc和vmalloc申請(qǐng)的仍是虛擬地址,但kmalloc 需要物理地址連續(xù)的空間,而vmalloc并不要求物理地址連續(xù). kmalloc 用于小塊內(nèi)存分配,基于slab實(shí)現(xiàn)的.vmalloc 沒(méi)有特別的限制.
下圖表現(xiàn)了上面提到的各個(gè)術(shù)語(yǔ)之間的關(guān)系:(malloc, glibc, VMA, mm_struct, gdt, kmalloc, slab, vmalloc, buddyinfo)

< ---------------------------------- >

內(nèi)存碎片

雖然執(zhí)行內(nèi)存分配的層次有很多,但記錄了內(nèi)存空間的使用情況,分進(jìn)行管理工作的,只有g(shù)libc庫(kù)和buddyinfo.由于內(nèi)存是以大小不等的方式分配出去,并且,分配的內(nèi)存區(qū)域可以釋放和再分配.這不可避免的會(huì)出現(xiàn)碎片的問(wèn)題.在 buddyinfo 出現(xiàn)內(nèi)存碎片,會(huì)降低系統(tǒng)性能,并有可能影響系統(tǒng)的正常運(yùn)行.

內(nèi)存碎片對(duì)性能的影響

維基在內(nèi)存碎片概念的定義中舉了一個(gè)例子,說(shuō)明了內(nèi)存碎片對(duì)性能的影響。
A subtler problem is that fragmentation may prematurely exhaust a cache, causing thrashing, due to caches holding blocks, not individual data. For example, suppose a program has a working set of 256 KiB, and is running on a computer with a 256 KiB cache (say L2 instruction+data cache), so the entire working set fits in cache and thus executes quickly, at least in terms of cache hits. Suppose further that it has 64 translation lookaside buffer (TLB) entries, each for a 4 KiB page: each memory access requires a virtual-to-physical translation, which is fast if the page is in cache (here TLB). If the working set is unfragmented, then it will fit onto exactly 64 pages (the page working set will be 64 pages), and all memory lookups can be served from cache. However, if the working set is fragmented, then it will not fit into 64 pages, and execution will slow due to thrashing: pages will be repeatedly added and removed from the TLB during operation. Thus cache sizing in system design must include margin to account for fragmentation.
理解這個(gè)例子,需要了解MMU,TLB和頁(yè)表之間的關(guān)系。
前文提到了,Linux 利用 Intel CPU的保護(hù)模式,采用頁(yè)表的方式對(duì)內(nèi)存進(jìn)行管理。 虛擬線性地址對(duì)應(yīng)著某個(gè)頁(yè)。這之間的對(duì)應(yīng)關(guān)系存在于頁(yè)表之中。 由于幾乎每次對(duì)虛擬內(nèi)存中的頁(yè)面訪問(wèn)都必須先解析頁(yè),從而得到物理內(nèi)存中的對(duì)應(yīng)地址,所以頁(yè)表操作的性能非常關(guān)鍵。因此,Intel MMU 系統(tǒng)結(jié)構(gòu)中實(shí)現(xiàn)了一個(gè)TLB(translate lookaside buffer)作為一個(gè)將虛擬地址映射到物理地址的硬件緩存,當(dāng)請(qǐng)求訪問(wèn)一個(gè)虛擬地址時(shí),處理器將首先檢查TLB是否緩存了該虛擬地址到物理地址的映射,如果命中則直接返回,否則,就需要通過(guò)頁(yè)表搜索需要的物理地址。
TLB很小,只有64 entries 。當(dāng)內(nèi)存碎片化后,一個(gè)進(jìn)程的虛擬線性地址空間對(duì)應(yīng)于數(shù)量眾多的小片的頁(yè),TLB不能容納這么多的頁(yè)面表項(xiàng),這就意味在這個(gè)進(jìn)程的運(yùn)行期內(nèi),MMU在尋址時(shí),TLB總是不能命中,而需要不斷的更新。這就大大的降低了執(zhí)行的效率。
上述的性能問(wèn)題非常的隱晦,無(wú)法從一個(gè)或是數(shù)個(gè)直觀的系統(tǒng)運(yùn)行指標(biāo)中發(fā)現(xiàn)。為了驗(yàn)證這個(gè)問(wèn)題,使用了下述的程序進(jìn)行了驗(yàn)證。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define KB (1024)
#define MB (1024 * KB)
#define GB (1024 * MB)

int main(int argc, char *argv[])
{
char p;
int i = 0;

while ((p = (char )malloc(70KB)))
{
memset(p, 0, 70
KB);

if ( i>500000 )
break;

i++;
}

sleep(1);

return 0;
}

a.png

可以明顯看到, LAB 112,dTLB-load-misses 要遠(yuǎn)大于同級(jí)別 benchmark的InternalBeta 132 .142,125,471 >> 3,799,475. 運(yùn)行時(shí)間也多出了5秒 . 這個(gè)程序是沒(méi)有任何高CPU操作的 . 在這個(gè)情況下慢了33%. 這是體現(xiàn)了LAB112的性能下降 .

內(nèi)存碎片對(duì)kmalloc的影響

由于kmalloc要求分配連續(xù)的物理地址空間,當(dāng)buddyinfo中沒(méi)有大塊的頁(yè),那么,無(wú)法得到滿足的kmalloc會(huì)報(bào)錯(cuò)。在內(nèi)核信息中可以看到 kmalloc fail.

操作系統(tǒng)的恢復(fù)機(jī)制 - 回收和緊縮

Q: 如何確定系統(tǒng)已經(jīng)碎片化?
A: cat /proc/buddyinfo
Q: 如何查看dTLB-load-misses?
A: perf stat -e dTLB-load,dTLB-load-misses ./brush_mem; perf top 可以查看當(dāng)前系統(tǒng)最繁忙的函數(shù)
Q: 如何查看缺頁(yè)中斷?
A: ps -o majflt,minflt -C prog
Q: 虛擬線性地址如何對(duì)應(yīng)到物理地址?
A: 頁(yè)表中包含有頁(yè)的起始物理地址,通過(guò)虛擬地址在頁(yè)表中查到對(duì)應(yīng)的頁(yè),就可以得到最終的物理地址

附1 MMU, TLB, Cache之間的關(guān)系

內(nèi)存碎片的檢測(cè)
內(nèi)存碎片原因的定位
ftrace, systemtap,

內(nèi)部碎片 / 外部碎片 , 池化技術(shù)解決不了外部碎片對(duì)性能的影響. 池化有可能分配的還是分散的頁(yè)面. 如果用戶 態(tài)進(jìn)程強(qiáng)制要求分配連續(xù)的物理地址叫呢?

overcommit : Link
內(nèi)存碎片的定義: 維基上的定義

內(nèi)存碎片對(duì)性能影響的說(shuō)明

A subtler problem is that fragmentation may prematurely exhaust a cache, causing thrashing, due to caches holding blocks, not individual data. For example, suppose a program has a working set of 256 KiB, and is running on a computer with a 256 KiB cache (say L2 instruction+data cache), so the entire working set fits in cache and thus executes quickly, at least in terms of cache hits. Suppose further that it has 64 translation lookaside buffer (TLB) entries, each for a 4 KiB page: each memory access requires a virtual-to-physical translation, which is fast if the page is in cache (here TLB). If the working set is unfragmented, then it will fit onto exactly 64 pages (the page working set will be 64 pages), and all memory lookups can be served from cache. However, if the working set is fragmented, then it will not fit into 64 pages, and execution will slow due to thrashing: pages will be repeatedly added and removed from the TLB during operation. Thus cache sizing in system design must include margin to account for fragmentation.
Memory fragmentation is one of the most severe problems faced by system managers.[citation needed] Over time, it leads to degradation of system performance. Eventually, memory fragmentation may lead to complete loss of (application-usable) free memory.
一個(gè)系統(tǒng)進(jìn)入Thrashing的案例:Link
內(nèi)存碎片的查看:cat /proc/buddyinfo -> buddy system : 伙伴系統(tǒng)是什么
cat /proc/buddyinfo
External fragmentation is a problem under some workloads, and buddyinfo is a useful tool for helping diagnose these problems. Buddyinfo will give you a clue as to how big an area you can safely allocate, or why a previous allocation failed.
Each column represents the number of pages of a certain order which are available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in ZONE_DMA, 4 chunks of 2^1PAGE_SIZE in ZONE_DMA, 101 chunks of ^4*PAGE_SIZE available in ZONE_NORMAL, etc...
More information relevant to external fragmentation can be found in pagetypeinfo.
The more detail about buddyinfo Link Link
cat /proc/pagetypeinfo
Fragmentation avoidance in the kernel works by grouping pages of different migrate types into the same contiguous regions of memory called page blocks. A page block is typically the size of the default hugepage size e.g. 2MB on X86-64. By keeping pages grouped based on their ability to move, the kernel can reclaim pages within a page block to satisfy a high-order allocation. The pagetypinfo begins with information on the size of a page block. It then gives the same type of information as buddyinfo except broken down by migrate-type and finishes with details on how many page blocks of each type exist.
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can make an estimate of the likely number of huge pages that can be allocated
at a given point in time. All the "Movable" blocks should be allocatable unless memory has been mlock()'d. Some of the Reclaimable blocks should also be allocatable although a lot of filesystem metadata may have to be reclaimed to achieve this.
Linux proc 文件系統(tǒng)官文說(shuō)明

Linux tool to dump x86 CPUID information about the CPU - 用來(lái)查看CPU信息,可以看到L1, L2 cache.
工具 per f可以提供很多底層的數(shù)據(jù)。它的詳細(xì)的wiki是:Link , 使用說(shuō)明 Link perf stat -e dTLB-load,dTLB-load-misses ./brush_mem; perf top 可以查看當(dāng)前系統(tǒng)最繁忙的函數(shù)
工具 ps 查看進(jìn)程的缺頁(yè)中斷信息。 PS有很多種類的信息可以看到。
ps -o majflt,minflt -C mailto
其中 majflt 與 minflt 的不同是::
majflt 表示需要讀寫磁盤,可能是內(nèi)存對(duì)應(yīng)頁(yè)面在磁盤中需要load 到物理內(nèi)存中,也可能是此時(shí)物理內(nèi)存不足,需要淘汰部分物理頁(yè)面至磁盤中。
注意, 缺頁(yè)中斷與內(nèi)存碎片沒(méi)有關(guān)系。Linux的延遲分配,會(huì)導(dǎo)致訪問(wèn)新分配的空間會(huì)產(chǎn)生缺頁(yè)中斷。
TLB與Cache的關(guān)系:Link。同時(shí)也提供了一個(gè)查看TLB的命令:dmidecode -t cache
MMU,Cache, TLB的關(guān)系:Link, Link2

b.png

一篇帶有圖片,講清楚了虛擬內(nèi)存,物理內(nèi)存, buddy, Slab之間的關(guān)系:Link

c.PNG
d.png
e.png

TLB的數(shù)據(jù),DTLB Load Misses Retired 何解呢? Link, Chinese Link;
Thread Specificity: TS
The number of retired load instructions that experienced data translation lookaside buffer (DTLB) misses. When a 32-bit linear data address is submitted by the processor, it is first submitted to the DTLB. The translation lookaside buffer (TLB) translates the 32-bit linear address produced by the load unit into a 36-bit physical memory address before the cache lookup is performed. DTLB size and organization are processor design-specific. A DTLB miss requires memory accesses to the OS page directory and tables in order translate the 32-bit linear address.
A DTLB miss does not necessarily indicate a cache miss.
CAUTION:
Extra memory accesses could impact CPI.
TIP
To minimize DTLB misses, minimize the size of the data and locality such that:
§ data spans a minimum the number of pages
§ the number of pages the data spans is less than the number of DTLB entries
kmalloc、vmalloc、malloc的區(qū)別 : Link
? kmalloc和vmalloc是分配的是內(nèi)核的內(nèi)存,malloc分配的是用戶的內(nèi)存
? kmalloc保證分配的內(nèi)存在物理上是連續(xù)的,vmalloc保證的是在虛擬地址空間上的連續(xù),malloc不保證任何東西(這點(diǎn)是自己猜測(cè)的,不一定正確)
? kmalloc能分配的大小有限,vmalloc和malloc能分配的大小相對(duì)較大
? 內(nèi)存只有在要被DMA訪問(wèn)的時(shí)候才需要物理上連續(xù)
? vmalloc比kmalloc要慢
malloc底層過(guò)程:Link M_TRIM_THRESHOLD選項(xiàng) : default value for this parameter is 128*1024; Link
Linux內(nèi)存分配的 overcommit 機(jī)制 --> atop's SWP 有兩個(gè)參數(shù)來(lái)反映 ->
This line contains the total amount of swap space on disk ('tot') and the amount of free swap space ('free').
Furthermore the committed virtual memory space ('vmcom') and the maximum limit of the committed space ('vmlim', which is by default swap size plus 50% of memory size) is shown. The committed space is the reserved virtual space for all allocations of private memory space for processes. The kernel only verifies whether the committed space exceeds the limit if strict overcommit handling is configured (vm.overcommit_memory is 2).
cat /proc/sys/vm/overcommit_memory
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit

回收: Link1Link2
Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes in the system.
This is value ORed together of

1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages

zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality.
zone_reclaim may be enabled if it's known that the workload is partitioned such that each partition fits within a NUMA node and that accessing remote memory would cause a measurable performance reduction. The page allocator
will then reclaim easily reusable pages (those page cache pages that are currently not used) before allocating off node pages.
Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone reclaim will write out dirty pages if a zone fills up and so effectively throttle the process. This may decrease the performance of a single process since it cannot use all of system memory to buffer the outgoing writes anymore but it preserve the memory on other nodes so that the performance of other processes running on other nodes will not be affected.
Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations.

cgroup 系統(tǒng)資源管理:Link

sar??
curl - memory poll ?

/proc/sys/vm/zone_reclaim_mode


f.png

TLB(轉(zhuǎn)移后備緩沖區(qū) Translation Lookaside Buffer)

這類緩存主要用于存放物理地址以加速對(duì)線性地址的轉(zhuǎn)換操作。當(dāng)線性地址被第一次使用時(shí),通過(guò)頁(yè)目錄/頁(yè)表計(jì)算得出相應(yīng)的物理地址,這個(gè)地址在使用后將被緩存在TLB中,以備將來(lái)對(duì)同一線性地址引用時(shí)直接從TLB中得到其對(duì)應(yīng)的物理地址。這里還要注意的是,當(dāng)CR3控制寄存器被更新時(shí),硬件將自動(dòng)使TLB中的所有項(xiàng)均無(wú)效,因?yàn)镃R3被更改后將存放新的頁(yè)目錄基地址,所以線性地址轉(zhuǎn)換時(shí)不允許再引用TLB中的表項(xiàng)。

物理地址擴(kuò)展(PAE — Physical Address Extension)下的分頁(yè)機(jī)制

引入物理地址擴(kuò)展,主要是因?yàn)?GB的物理內(nèi)存對(duì)于運(yùn)行數(shù)以千計(jì)的進(jìn)程的大型服務(wù)器構(gòu)成了很大的瓶頸,因此這促使Intel對(duì)x86的物理內(nèi)存進(jìn)行擴(kuò)展。Intel通過(guò)將處理器上的管腳數(shù)從32位增加到36使得尋址能力達(dá)到236=64GB,使其得以滿足高端市場(chǎng)的需求,最后也就導(dǎo)致了對(duì)頁(yè)內(nèi)存尋址的另類解釋。事實(shí)上現(xiàn)在的低端PC市場(chǎng)都已經(jīng)逐漸從32位向64位過(guò)度,所以對(duì)這一機(jī)制簡(jiǎn)單介紹一下。
http://www.kerneltravel.net/journal/v/mem.htm
http://coolshell.cn/articles/7490.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容