伙伴系統(tǒng)
Linux系統(tǒng)在運行過程中頁面不斷的分配回收,慢慢的就會碎片化,當應(yīng)用分配大內(nèi)存時很難找到合適的連續(xù)內(nèi)存,要花費更長的時間分配合適的內(nèi)存,嚴重影響系統(tǒng)性能。因此內(nèi)核設(shè)計了一種伙伴系統(tǒng)緩解頁面碎片化。
伙伴系統(tǒng)設(shè)計的基本思路是把連續(xù)的空閑頁面組織成內(nèi)存塊進行管理,因此系統(tǒng)將連續(xù)的2N次方個頁面組織成一個大塊,掛到對應(yīng)的鏈表上。在頁面管理結(jié)構(gòu)zone中有一個free_area數(shù)組,數(shù)組長度為MAX_ORDER,free_area[0]管理的內(nèi)存塊的大小為20,free_area[1]管理的內(nèi)存塊的大小為21,以此類推,free_area[MAX_ORDER-1]管理的內(nèi)存塊的大小為2(MAX_ORDER-1)。當分配頁面時,根據(jù)申請的頁面的個數(shù)找到比申請頁面大且最接近的鏈表,如果這個鏈表中有空閑內(nèi)存塊則獲取一個內(nèi)存塊,分配器將內(nèi)存塊拆分一部分分配給申請者,另一部分按照2的N次冪拆分掛入到更小的內(nèi)存塊鏈表中,這個過程可能會多次拆分直到頁面全部掛入到鏈表中,拆分的原則是盡量掛入更大的內(nèi)存塊鏈表。如果有內(nèi)存塊釋放則過程相反,先將釋放的內(nèi)存塊插入適合的鏈表,如果發(fā)現(xiàn)新插入的內(nèi)存塊和相鄰的內(nèi)存塊可以合并成更高階的內(nèi)存塊插入到更高階的鏈表,然后重復(fù)上面過程直到無法合并成更大的內(nèi)存塊,這個過程其實就是把碎片化的頁面組織成成片的頁面塊。
#define MAX_ORDER 11
struct zone{
struct free_area free_area[MAX_ORDER];//不同長度的空間區(qū)域
}
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

系統(tǒng)在啟動時根據(jù)物理內(nèi)存的配置信息將內(nèi)存頁構(gòu)建到伙伴系統(tǒng)中,而且會盡量把頁劃分成最大塊鏈入MAX_ORDER鏈表中,這樣做可以保證系統(tǒng)在剛啟動時內(nèi)存碎片化最小。
內(nèi)存遷移類型
伙伴系統(tǒng)是內(nèi)核中工作最優(yōu)秀的系統(tǒng)之一,但是它也存在一些缺陷,首先伙伴系統(tǒng)只能在釋放內(nèi)存時被動的去合并內(nèi)存塊來減小內(nèi)存碎片化,不能在分配階段主動以某種策略來避免產(chǎn)生內(nèi)存碎片,其次一種極端情況下系統(tǒng)只分配了少量的內(nèi)存頁大部分內(nèi)存頁處于空閑狀態(tài),但是已分配的內(nèi)存頁散落在各處,系統(tǒng)無法按2^N次方將內(nèi)存組織成大的內(nèi)存塊。我們很容易想到能不能再適當?shù)臅r機對內(nèi)存做下規(guī)整,將散落在各處集中在一處重新分配或釋放呢,答案時肯定的,因此演化出了內(nèi)存遷移性的概念。類比文件系統(tǒng),磁盤也存在存儲空間碎片化的問題,磁盤就可以用專門的應(yīng)用工具整理文件系統(tǒng)。當然并不是所有內(nèi)存頁都可以重新分配或釋放的,要針對頁面不同使用情況具體分析。
觀察struct free_area結(jié)構(gòu)可以發(fā)現(xiàn)最新的伙伴系統(tǒng)并不是簡單的把內(nèi)存塊掛鏈接到MAX_ORDER個雙向鏈表中,而是每種大小的內(nèi)存塊都有MIGRATE_TYPES個類型的鏈表。內(nèi)核將分配的內(nèi)存按不可移動、可移動、可回收等分成幾種類型,本文只討論前3種遷移類型。
enum {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES, /* the number of types on the pcp lists */
MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
/*
* MIGRATE_CMA migration type is designed to mimic the way
* ZONE_MOVABLE works. Only movable pages can be allocated
* from MIGRATE_CMA pageblocks and page allocator never
* implicitly change migration type of MIGRATE_CMA pageblock.
*
* The way to use it is to change migratetype of a range of
* pageblocks to MIGRATE_CMA which can be done by
* __free_pageblock_cma() function. What is important though
* is that a range of pageblocks must be aligned to
* MAX_ORDER_NR_PAGES should biggest page be bigger then
* a single pageblock.
*/
MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE, /* can't allocate from here */
#endif
MIGRATE_TYPES
};
- 可移動(MIGRATE_UNMOVABLE)類型:應(yīng)用程序分配的內(nèi)存大多是這種類型,用戶空間訪問物理內(nèi)存完全是通過MMU地址映射完成,給定的線性地址映射到何處對應(yīng)用程序來說完全是透明,這樣系統(tǒng)可以在合適的時機為應(yīng)用程序重新分配內(nèi)存頁將原來內(nèi)存頁內(nèi)容拷貝過去,更改進程頁表重新映射地址來完成內(nèi)存規(guī)整,完全不會影響應(yīng)用程序運行。
- 不可移動(MIGRATE_MOVABLE)類型:內(nèi)核中核心數(shù)據(jù)結(jié)構(gòu)及代碼占用的內(nèi)存屬于這種類型,他們一旦分配就不可以更改,除非主動釋放。
- 可回收(MIGRATE_RECLAIMABLE): 應(yīng)用程序的代碼段、打開文件的緩存頁面大多數(shù)屬于這種類型,這些頁面的主要特點是可以通過磁盤或其它存儲介質(zhì)恢復(fù),當內(nèi)存碎片化嚴重或不足時可以直接釋放頁面,需要時重新從備份數(shù)據(jù)恢復(fù)。
在申請的時候表明申請頁的使用傾向,通過在不同區(qū)間中申請頁,防止長期使用的UNMOVABLE類型的頁破壞空閑物理內(nèi)存的連續(xù)性。在申請內(nèi)存失敗的時候首先進行內(nèi)存規(guī)整,主要是移動物理頁的內(nèi)容并更新映射關(guān)系,釋放可回收的頁,這樣通過積極調(diào)整頁的位置來使得物理內(nèi)存連續(xù)。

可移動內(nèi)存分配
既然系統(tǒng)將內(nèi)存頁按照使用性質(zhì)劃分成了不同類型,使用不同鏈表進行管理,那系統(tǒng)在什么時候?qū)?nèi)存掛入不同鏈表呢。從memmap_init_zone函數(shù)中可以看出系統(tǒng)在啟動時所有內(nèi)存都屬于可移動內(nèi)存,在內(nèi)核初始化過程中會不斷的向系統(tǒng)分配頁面,這些頁面大多時不可移動的而且是永久的,這時系統(tǒng)大概率按照最大內(nèi)存塊分配連續(xù)內(nèi)存,這些分配可移動內(nèi)存會被重新標記為不可移動的,這樣系統(tǒng)初始化完后就產(chǎn)生了一批不可移動內(nèi)存而且并不會產(chǎn)生大量的內(nèi)存碎片。至于可回收內(nèi)存大多都是在應(yīng)用進程運行過程中產(chǎn)生的。
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
/* ...... */
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/* ...... */
not_early:
if (!(pfn & (pageblock_nr_pages - 1))) {
struct page *page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid);
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
} else {
__init_single_pfn(pfn, zone, nid);
}
}
}
類似從zone中分配頁面,如果當前節(jié)點沒有足夠的頁面可以分配會根據(jù)pg_data_t中提供的zonelists列表規(guī)定的優(yōu)先級在其它zone中分配內(nèi)存。MIGRATE_TYPES也類似,定義了一個fallbacks數(shù)據(jù)來定義當某種可移動類型內(nèi)存頁面不足時依次從哪種類型頁面申請內(nèi)存。
/*
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
* 該數(shù)組描述了指定遷移類型的空閑列表耗盡時
* 其他空閑列表在備用列表中的次序
*/
static int fallbacks[MIGRATE_TYPES][4] = {
// 分配不可移動頁失敗的備用列表
[MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
// 分配可回收頁失敗時的備用列表
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES },
// 分配可移動頁失敗時的備用列表
[MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
[MIGRATE_CMA] = { MIGRATE_TYPES }, /* Never used */
#endif
#ifdef CONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE] = { MIGRATE_TYPES }, /* Never used */
#endif
};
可移動性分組特性的頁面管理相關(guān)的全局變量和輔助函數(shù)總是編譯到內(nèi)核中,但只有在系統(tǒng)中有足夠內(nèi)存可以分配到多個遷移類型對應(yīng)的鏈表時才有意義的。由于每個遷移鏈表都應(yīng)該有適當數(shù)量的內(nèi)存,內(nèi)核需要定義”適當”的概念. 這是通過兩個全局變量pageblock_order和pageblock_nr_pages提供的. 第一個表示內(nèi)核認為是”大”的一個分配階, pageblock_nr_pages則表示該分配階對應(yīng)的頁數(shù)。如果各遷移類型的鏈表中沒有一塊較大的連續(xù)內(nèi)存, 那么頁面遷移不會提供任何好處, 因此在可用內(nèi)存太少時內(nèi)核會關(guān)閉該特性,build_all_zonelists函數(shù)會檢查內(nèi)存水位, 如果所有內(nèi)存區(qū)域里面高水線以上的物理頁總數(shù)小于(pageblock_nr_pages *遷移類型數(shù)量),那么禁用根據(jù)可移動性分組,全局變量page_group_by_mobility_disabled設(shè)置為0, 否則設(shè)置為1。禁用后申請的所有頁面都是不可移動的。
在zone中有一個字段pageblock_flags,它指向一片bitmap內(nèi)存區(qū)域,這片內(nèi)存的大小是和zone管理的內(nèi)存塊數(shù)量相關(guān),每NR_PAGEBLOCK_BITS位記錄一個內(nèi)存塊屬于哪種遷移類型,在頁面回收時方便將頁面放回原來的伙伴鏈表。
#define PB_migratetype_bits 3
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
PB_migrate,
PB_migrate_end = PB_migrate + PB_migratetype_bits - 1,
/* 3 bits required for migrate types */
PB_migrate_skip,/* If set the block is skipped by compaction */
/*
* Assume the bits will always align on a word. If this assumption
* changes then get/set pageblock needs updating.
*/
NR_PAGEBLOCK_BITS
};
// mm/page_alloc.c
void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
{
if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES))
page_group_by_mobility_disabled = 1;
else
page_group_by_mobility_disabled = 0;
}
/* Convert GFP flags to their corresponding migrate type */
static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
{
WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
/* Group based on mobility */
return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
void set_pageblock_migratetype(struct page *page, int migratetype)
{
if (unlikely(page_group_by_mobility_disabled &&
migratetype < MIGRATE_PCPTYPES))
migratetype = MIGRATE_UNMOVABLE;
set_pageblock_flags_group(page, (unsigned long)migratetype,
PB_migrate, PB_migrate_end);
}
#define get_pageblock_migratetype(page) \
get_pfnblock_flags_mask(page, page_to_pfn(page), \
PB_migrate_end, MIGRATETYPE_MASK)
內(nèi)核定義了__GFP_MOVABLE、__GFP_RECLAIMABLE兩個標志區(qū)分不同的遷移類型,如果這兩個標志都沒有設(shè)置則標識是不可移動類型。
頁面規(guī)整
頁面規(guī)整有兩種模式:異步和同步,在頁面回收失敗的時候首先開啟異步的頁面規(guī)整,如果異步頁面規(guī)整不出滿足要求的內(nèi)存,接下來使用嘗試通過直接內(nèi)存回收方式回收到足夠的內(nèi)存,如果還是獲取不到足夠的連續(xù)內(nèi)存,那么再次嘗試通過同步頁面規(guī)整的方式獲取連續(xù)內(nèi)存。
頁面規(guī)整的對象類型有兩種遷移類型:MIGRATE_RECLAIMABLE,MIGRATE_MOVABLE,從規(guī)整方式來看有兩種:匿名映射和文件映射類型的頁。異步頁面規(guī)整只規(guī)整匿名頁,而同步頁面規(guī)整還會處理文件臟頁,如果頁面正在回寫,它還會等待頁面完成回寫。異步頁面規(guī)整相對來說比較保守,匿名頁的頁面遷移只需要將頁的內(nèi)容遷移到新頁中,創(chuàng)建新的映射關(guān)系,解除舊的映射關(guān)系,全是內(nèi)存的讀寫操作,而同步還涉及到IO回寫,它的耗時更長。
頁面分配
內(nèi)核分配內(nèi)存都是通過struct page *alloc_pages(int nid, gfp_t gfp_mask, unsigned int order)函數(shù)進行的。這個函數(shù)是伙伴系統(tǒng)分配頁面的核心代碼,其它所有的內(nèi)存分配函數(shù)最終都會調(diào)用這個函數(shù)。
- nid:節(jié)點ID,NUMA架構(gòu)多核CPU系統(tǒng)中,每個CPU都是一個節(jié)點
- gfp_mask:頁面分配屬性掩碼
- order: 分配頁面階數(shù),例如2則便是分配2^2=4個頁面
系統(tǒng)中還定義了一個alloc_page宏,自動獲取節(jié)點ID,申請頁面。
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
分配掩碼很重要,它決定了伙伴系統(tǒng)頁面分配的策略。GPF是get free page的簡寫。
/* Plain integer GFP bitmasks. Do not use this directly. */
// 區(qū)域修飾符,指定優(yōu)先從哪個zone分配頁面,如果下面三個標志都為0標識從低端內(nèi)存開始分配頁面
#define ___GFP_DMA 0x01u //從ZONE_DMA分配內(nèi)存
#define ___GFP_HIGHMEM 0x02u //從高端內(nèi)存分配內(nèi)存
#define ___GFP_DMA32 0x04u //從ZONE_DMA32分配內(nèi)存
// 行為修飾符
#define ___GFP_MOVABLE 0x08u /* 頁是可移動的 */
#define ___GFP_RECLAIMABLE 0x10u /* 頁是可回收的 */
#define ___GFP_HIGH 0x20u /* 應(yīng)該訪問緊急分配池? */
#define ___GFP_IO 0x40u /* 可以啟動物理IO? */
#define ___GFP_FS 0x80u /* 可以調(diào)用底層文件系統(tǒng)? */
#define ___GFP_COLD 0x100u /* 需要非緩存的冷頁 */
#define ___GFP_NOWARN 0x200u /* 禁止分配失敗警告 */
#define ___GFP_REPEAT 0x400u /* 重試分配,可能失敗 */
#define ___GFP_NOFAIL 0x800u /* 一直重試,不會失敗 */
#define ___GFP_NORETRY 0x1000u /* 不重試,可能失敗 */
#define ___GFP_MEMALLOC 0x2000u /* 使用緊急分配鏈表 */
#define ___GFP_COMP 0x4000u /* 增加復(fù)合頁元數(shù)據(jù) */
#define ___GFP_ZERO 0x8000u /* 成功則返回填充字節(jié)0的頁 */
// 類型修飾符
#define ___GFP_NOMEMALLOC 0x10000u /* 不使用緊急分配鏈表 */
#define ___GFP_HARDWALL 0x20000u /* 只允許在進程允許運行的CPU所關(guān)聯(lián)的結(jié)點分配內(nèi)存 */
#define ___GFP_THISNODE 0x40000u /* 沒有備用結(jié)點,沒有策略 */
#define ___GFP_ATOMIC 0x80000u /* 用于原子分配,在任何情況下都不能中斷 */
#define ___GFP_ACCOUNT 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_DIRECT_RECLAIM 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
#define ___GFP_KSWAPD_RECLAIM 0x2000000u
/*
* Physical address zone modifiers (see linux/mmzone.h - low four bits)
*
* Do not put any conditional on these. If necessary modify the definitions
* without the underscores and use them consistently. The definitions here may
* be used in bit comparisons.
*/
#define __GFP_DMA ((__force gfp_t)___GFP_DMA)
#define __GFP_HIGHMEM ((__force gfp_t)___GFP_HIGHMEM)
#define __GFP_DMA32 ((__force gfp_t)___GFP_DMA32)
#define __GFP_MOVABLE ((__force gfp_t)___GFP_MOVABLE) /* ZONE_MOVABLE allowed */
#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
/*
* Page mobility and placement hints
*
* These flags provide hints about how mobile the page is. Pages with similar
* mobility are placed within the same pageblocks to minimise problems due
* to external fragmentation.
*
* __GFP_MOVABLE (also a zone modifier) indicates that the page can be
* moved by page migration during memory compaction or can be reclaimed.
*
* __GFP_RECLAIMABLE is used for slab allocations that specify
* SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
*
* __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
* these pages will be spread between local zones to avoid all the dirty
* pages being in one zone (fair zone allocation policy).
*
* __GFP_HARDWALL enforces the cpuset memory allocation policy.
*
* __GFP_THISNODE forces the allocation to be satisified from the requested
* node with no fallbacks or placement policy enforcements.
*
* __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
*/
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT)
/*
* Watermark modifiers -- controls access to emergency reserves
*
* __GFP_HIGH indicates that the caller is high-priority and that granting
* the request is necessary before the system can make forward progress.
* For example, creating an IO context to clean pages.
*
* __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
* high priority. Users are typically interrupt handlers. This may be
* used in conjunction with __GFP_HIGH
*
* __GFP_MEMALLOC allows access to all memory. This should only be used when
* the caller guarantees the allocation will allow more memory to be freed
* very shortly e.g. process exiting or swapping. Users either should
* be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
*
* __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
* This takes precedence over the __GFP_MEMALLOC flag if both are set.
*/
#define __GFP_ATOMIC ((__force gfp_t)___GFP_ATOMIC)
#define __GFP_HIGH ((__force gfp_t)___GFP_HIGH)
#define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)
#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
/*
* Reclaim modifiers
*
* __GFP_IO can start physical IO.
*
* __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
* allocator recursing into the filesystem which might already be holding
* locks.
*
* __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
* This flag can be cleared to avoid unnecessary delays when a fallback
* option is available.
*
* __GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when
* the low watermark is reached and have it reclaim pages until the high
* watermark is reached. A caller may wish to clear this flag when fallback
* options are available and the reclaim is likely to disrupt the system. The
* canonical example is THP allocation where a fallback is cheap but
* reclaim/compaction may cause indirect stalls.
*
* __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
*
* The default allocator behavior depends on the request size. We have a concept
* of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
* !costly allocations are too essential to fail so they are implicitly
* non-failing by default (with some exceptions like OOM victims might fail so
* the caller still has to check for failures) while costly requests try to be
* not disruptive and back off even without invoking the OOM killer.
* The following three modifiers might be used to override some of these
* implicit rules
*
* __GFP_NORETRY: The VM implementation will try only very lightweight
* memory direct reclaim to get some memory under memory pressure (thus
* it can sleep). It will avoid disruptive actions like OOM killer. The
* caller must handle the failure which is quite likely to happen under
* heavy memory pressure. The flag is suitable when failure can easily be
* handled at small cost, such as reduced throughput
*
* __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
* procedures that have previously failed if there is some indication
* that progress has been made else where. It can wait for other
* tasks to attempt high level approaches to freeing memory such as
* compaction (which removes fragmentation) and page-out.
* There is still a definite limit to the number of retries, but it is
* a larger limit than with __GFP_NORETRY.
* Allocations with this flag may fail, but only when there is
* genuinely little unused memory. While these allocations do not
* directly trigger the OOM killer, their failure indicates that
* the system is likely to need to use the OOM killer soon. The
* caller must handle failure, but can reasonably do so by failing
* a higher-level request, or completing it only in a much less
* efficient manner.
* If the allocation does fail, and the caller is in a position to
* free some non-essential memory, doing so could benefit the system
* as a whole.
*
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
* cannot handle allocation failures. The allocation could block
* indefinitely but will never return with failure. Testing for
* failure is pointless.
* New users should be evaluated carefully (and the flag should be
* used only when there is no reasonable failure policy) but it is
* definitely preferable to use the flag rather than opencode endless
* loop around allocator.
* Using this flag for costly allocations is _highly_ discouraged.
*/
#define __GFP_IO ((__force gfp_t)___GFP_IO)
#define __GFP_FS ((__force gfp_t)___GFP_FS)
#define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
#define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL)
#define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL)
#define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY)
/*
* Action modifiers
*
* __GFP_NOWARN suppresses allocation failure reports.
*
* __GFP_COMP address compound page metadata.
*
* __GFP_ZERO returns a zeroed page on success.
*/
#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
/* Disable lockdep for GFP context tracking */
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/*
* Useful GFP flag combinations that are commonly used. It is recommended
* that subsystems start with one of these combinations and then set/clear
* __GFP_FOO flags as necessary.
*
* GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
* watermark is applied to allow access to "atomic reserves"
*
* GFP_KERNEL is typical for kernel-internal allocations. The caller requires
* ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
*
* GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation is
* accounted to kmemcg.
*
* GFP_NOWAIT is for kernel allocations that should not stall for direct
* reclaim, start physical IO or use any filesystem callback.
*
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
* Please try to avoid using this flag directly and instead use
* memalloc_noio_{save,restore} to mark the whole scope which cannot
* perform any IO with a short explanation why. All allocation requests
* will inherit GFP_NOIO implicitly.
*
* GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
* Please try to avoid using this flag directly and instead use
* memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
* recurse into the FS layer with a short explanation why. All allocation
* requests will inherit GFP_NOFS implicitly.
*
* GFP_USER is for userspace allocations that also need to be directly
* accessibly by the kernel or hardware. It is typically used by hardware
* for buffers that are mapped to userspace (e.g. graphics) that hardware
* still must DMA to. cpuset limits are enforced for these allocations.
*
* GFP_DMA exists for historical reasons and should be avoided where possible.
* The flags indicates that the caller requires that the lowest zone be
* used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
* it would require careful auditing as some users really require it and
* others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
* lowest zone as a type of emergency reserve.
*
* GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
* address.
*
* GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
* do not need to be directly accessible by the kernel but that cannot
* move once in use. An example may be a hardware allocation that maps
* data directly into userspace but has no addressing limitations.
*
* GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
* need direct access to but can use kmap() when access is required. They
* are expected to be movable via page reclaim or page migration. Typically,
* pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
*
* GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
* compound allocations that will generally fail quickly if memory is not
* available and will not wake kswapd/kcompactd on failure. The _LIGHT
* version does not attempt reclaim/compaction at all and is by default used
* in page fault path, while the non-light is used by khugepaged.
*/
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
#define GFP_NOIO (__GFP_RECLAIM)
#define GFP_NOFS (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
__GFP_RECLAIMABLE)
#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_DMA __GFP_DMA
#define GFP_DMA32 __GFP_DMA32
#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_TRANSHUGE ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
~__GFP_RECLAIM)
/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
#define GFP_MOVABLE_SHIFT 3
- GFP_ATOMIC:用于原子分配,在任何情況下都不能被中斷, 可能使用緊急分配鏈表中的內(nèi)存, 這個標志用在中斷處理程序, 下半部, 持有自旋鎖以及其他不能睡眠的地方
- GFP_KERNEL:這是一種常規(guī)的分配方式, 可能會被阻塞,不能用在不可中斷的上下文。為了獲取調(diào)用者所需的內(nèi)存, 內(nèi)核會盡力而為. 這個標志應(yīng)該是首選標志
- GFP_NOWAIT:與GFP_ATOMIC類似, 不同之處在于, 調(diào)用不會使用緊急內(nèi)存池, 這就增加了內(nèi)存分配失敗的可能性
- GFP_NOIO:這種分配可以阻塞, 但不會啟動磁盤I/O, 在頁面申請過程中如果發(fā)現(xiàn)內(nèi)存不足時可以啟動磁盤IO將一些頁面交換到磁盤的交換分區(qū)的,有些場景不能啟動IO操作因此要使用此標志。
- GFP_NOFS:這種分配在必要時可以阻塞, 可以啟動磁盤, 但是不能調(diào)用VFS相關(guān)操作, 一般文件系統(tǒng)在申請內(nèi)存頁面時要禁止VFS操作,否則可能在頁面嚴重不足時寫回文件臟頁回收內(nèi)存時導(dǎo)致死鎖。
- GFP_USER:這是一種常規(guī)的分配方式, 可以被阻塞. 通常由硬件分配一片內(nèi)存然后映射用戶空間使用。
- GFP_HIGHUSER:是GFP_USER的一個擴展, 也用于用戶空間. 它允許分配無法直接映射的高端內(nèi)存. 用戶進程使用高端內(nèi)存頁是沒有壞處,因為用戶過程的地址空間總是通過非線性頁表組織的
- GFP_HIGHUSER_MOVABLE:用途類似于GFP_HIGHUSER,優(yōu)先從ZONE_MOVEABLE中分配頁面。
GFP_KERNEL是內(nèi)核中最最常用的分配掩碼,它優(yōu)先從低端內(nèi)存分配頁面,不過它分配的一般都是給內(nèi)核管理數(shù)據(jù)結(jié)構(gòu)、各種緩存使用的頁面。
alloc_pages()最終調(diào)用__alloc_pages_nodemask()函數(shù),它是伙伴系統(tǒng)的核心函數(shù)。
[alloc_pages->alloc_pages_node->__alloc_pages->__alloc_pages_nodemask]
/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
struct alloc_context ac = { };
gfp_mask &= gfp_allowed_mask;
alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
finalise_ac(gfp_mask, order, &ac);
/* First allocation attempt */
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (likely(page))
goto out;
/*
* Apply scoped allocation constraints. This is mainly about GFP_NOFS
* resp. GFP_NOIO which has to be inherited for all allocation requests
* from a particular context which has been marked by
* memalloc_no{fs,io}_{save,restore}.
*/
alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;
/*
* Restore the original nodemask if it was potentially replaced with
* &cpuset_current_mems_allowed to optimize the fast-path attempt.
*/
if (unlikely(ac.nodemask != nodemask))
ac.nodemask = nodemask;
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
out:
if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
__free_pages(page, order);
page = NULL;
}
trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
return page;
}
static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
int preferred_nid, nodemask_t *nodemask,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
{
ac->high_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
ac->migratetype = gfpflags_to_migratetype(gfp_mask);
if (cpusets_enabled()) {
*alloc_mask |= __GFP_HARDWALL;
if (!ac->nodemask)
ac->nodemask = &cpuset_current_mems_allowed;
else
*alloc_flags |= ALLOC_CPUSET;
}
fs_reclaim_acquire(gfp_mask);
fs_reclaim_release(gfp_mask);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
if (should_fail_alloc_page(gfp_mask, order))
return false;
if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
*alloc_flags |= ALLOC_CMA;
return true;
}
alloc_context是內(nèi)存分配的一個中間數(shù)據(jù)結(jié)構(gòu),保存內(nèi)存分配策略的計算值,gfp_zone計算從優(yōu)先從哪個zone開始分配開始分配內(nèi)存,這個函數(shù)中用的宏定義如下,在嵌入式ARM體系中MAX_NR_ZONES為3(ZONE_NORMAL、ZONE_HIGMEM、ZONE_MOVABLE)所以ZONES_SHIFT為2。對于gfp_zone(GFP_KERNEL)計算結(jié)果為0,即high_zoneidx為0。在pglist_data->zonelist->zoneref中定義了從哪個zone分配內(nèi)存的優(yōu)先級,越靠前優(yōu)先級越高。gfpflags_to_migratetype獲取頁面的MIGRATE_TYPES類型,這個決定了優(yōu)先從哪個zone->free_area->free_list中分配內(nèi)存,gfpflags_to_migratetype(GFP_KERNEL)計算結(jié)果為MIGRATE_UNMOVABLE。
static inline enum zone_type gfp_zone(gfp_t flags)
{
enum zone_type z;
int bit = (__force int) (flags & GFP_ZONEMASK);
z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
((1 << ZONES_SHIFT) - 1);
return z;
}
#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
#define GFP_ZONE_TABLE ( \
(ZONE_NORMAL << 0 * ZONES_SHIFT) \
| (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT) \
| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT) \
| (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT) \
| (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT) \
| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT) \
| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT) \
| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT) \
)
#if MAX_NR_ZONES < 2
#define ZONES_SHIFT 0
#elif MAX_NR_ZONES <= 2
#define ZONES_SHIFT 1
#elif MAX_NR_ZONES <= 4
#define ZONES_SHIFT 2
#endif
[__alloc_pages_nodemask->finalise_ac->first_zones_zonelist->next_zones_zonelist->__next_zones_zonelist]
first_zones_zonelist計算preferred_zoneref最先從哪個zone開始分配頁面,這個計算是依據(jù)high_zoneidx進行的,計算方法很簡單,就是從pglist_data->zonelist->zoneref定義的zone順序,從前到后順序查找第一個小于等于high_zoneidx的zone,當然這個過程還要考慮nodemask,因為zoneref中的zone可以來自不同節(jié)點。最終調(diào)用__next_zones_zonelist實現(xiàn)。gfp_zone(GFP_KERNEL)計算為0,也就是GFP_KERNEL只能從zone_idx為0的zone分配內(nèi)存。內(nèi)核中定義的zone類型在enum zone_type中,ARM嵌入式系統(tǒng)中定義了ZONE_NORMAL、ZONE_HIGHMEM、ZONE_MOVABLE 3種類型zone,而且在內(nèi)存中是按照枚舉類型順序排列的,因此通過zone_idx(zone)宏計算出來的ZONE_NORMAL的zone_idx為0,ZONE_HIGHMEM為1,ZONE_MOVABLE為2。這樣GFP_KERNEL掩碼只能從ZONE_NORMAL中分配頁面。
enum zone_type {
#ifdef CONFIG_ZONE_DMA
ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
ZONE_DMA32,
#endif
ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
__MAX_NR_ZONES
};
/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
*/
#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
/* Returns the next zone at or below highest_zoneidx in a zonelist */
struct zoneref *__next_zones_zonelist(struct zoneref *z,
enum zone_type highest_zoneidx,
nodemask_t *nodes)
{
/*
* Find the next suitable zone to use for the allocation.
* Only filter based on nodemask if it's set
*/
if (unlikely(nodes == NULL))
while (zonelist_zone_idx(z) > highest_zoneidx)
z++;
else
while (zonelist_zone_idx(z) > highest_zoneidx ||
(z->zone && !zref_in_nodemask(z, nodes)))
z++;
return z;
}
[__alloc_pages_nodemask->get_page_from_freelist]
在獲取最優(yōu)的zone后就可以具體的從中分配頁面了,get_page_from_freelist函數(shù)完成這個任務(wù),整個代碼邏輯是一個大循環(huán),從最優(yōu)zone開始分配頁面。
- 如果使能了CPU_SET,則要判斷當前進程是否可以從此zone中分配頁面,如果不行跳過這個zone
- 判斷node剩余頁面水位,如果水位低于設(shè)置的值則跳過這個zone。根據(jù)zone->watermark、lowmem_reserve、gpf_mask計算出來的alloc_flags值共同計算出水位值,計算出來的值小于剩余頁面數(shù)直接退出,否則判斷order==0則直接返回成功,如果order>0則伙伴系統(tǒng)中還必須有一合適大小的連續(xù)內(nèi)存塊。系統(tǒng)中定義了 3 種水位,分別是 WMARK_MIN、WMARK_LOW 和 WMARK_HIGH,具體的水位值在系統(tǒng)初始化時計算出來。通常分配物理內(nèi)存頁面的內(nèi)核路徑是檢查 WMARK_LOW 水位,而頁面回收 kswapd 內(nèi)核線程則是檢查 WMARK_HIGH 水位。水位檢測最終通過__zone_watermark_ok函數(shù)完成。
- 當水位滿足條件從伙伴系統(tǒng)申請內(nèi)存,不滿足則先嘗試回收內(nèi)存然后再嘗試從這個zone中分配內(nèi)存
- 從zone中實在無法分配內(nèi)存時,則再zonelists中順序查找下一個zone_idx<=highest_zoneidx的zone重復(fù)上面步驟。
#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
for (zone = z->zone; \
zone; \
z = next_zones_zonelist(++z, highidx, nodemask), \
zone = zonelist_zone(z))
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
{
struct zoneref *z = ac->preferred_zoneref;
struct zone *zone;
struct pglist_data *last_pgdat_dirty_limit = NULL;
/*
* Scan zonelist, looking for a zone with enough free.
* See also __cpuset_node_allowed() comment in kernel/cpuset.c.
*/
for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
struct page *page;
unsigned long mark;
if (cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue;
/*
* When allocating a page cache page for writing, we
* want to get it from a node that is within its dirty
* limit, such that no single node holds more than its
* proportional share of globally allowed dirty pages.
* The dirty limits take into account the node's
* lowmem reserves and high watermark so that kswapd
* should be able to balance it without having to
* write pages from its LRU list.
*
* XXX: For now, allow allocations to potentially
* exceed the per-node dirty limit in the slowpath
* (spread_dirty_pages unset) before going into reclaim,
* which is important when on a NUMA setup the allowed
* nodes are together not big enough to reach the
* global limit. The proper fix for these situations
* will require awareness of nodes in the
* dirty-throttling and the flusher threads.
*/
if (ac->spread_dirty_pages) {
if (last_pgdat_dirty_limit == zone->zone_pgdat)
continue;
if (!node_dirty_ok(zone->zone_pgdat)) {
last_pgdat_dirty_limit = zone->zone_pgdat;
continue;
}
}
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_fast(zone, order, mark,
ac_classzone_idx(ac), alloc_flags)) {
int ret;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* Watermark failed for this zone, but see if we can
* grow this zone if it contains deferred pages.
*/
if (static_branch_unlikely(&deferred_pages)) {
if (_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
/* Checked here to keep the fast path fast */
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
if (node_reclaim_mode == 0 ||
!zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
continue;
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
/* did not scan */
continue;
case NODE_RECLAIM_FULL:
/* scanned but unreclaimable */
continue;
default:
/* did we reclaim enough */
if (zone_watermark_ok(zone, order, mark,
ac_classzone_idx(ac), alloc_flags))
goto try_this_zone;
continue;
}
}
try_this_zone:
page = rmqueue(ac->preferred_zoneref->zone, zone, order,
gfp_mask, alloc_flags, ac->migratetype);
if (page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
/*
* If this is a high-order atomic allocation then check
* if the pageblock should be reserved for the future
*/
if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
reserve_highatomic_pageblock(page, zone, order);
return page;
} else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/* Try again if zone has deferred pages */
if (static_branch_unlikely(&deferred_pages)) {
if (_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
}
}
return NULL;
}
/*
* Return true if free base pages are above 'mark'. For high-order checks it
* will return true of the order-0 watermark is reached and there is at least
* one free page of a suitable size. Checking now avoids taking the zone lock
* to check in the allocation paths if no pages are free.
*/
bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
int classzone_idx, unsigned int alloc_flags,
long free_pages)
{
long min = mark;
int o;
const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
/* free_pages may go negative - that's OK */
free_pages -= (1 << order) - 1;
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
/*
* If the caller does not have rights to ALLOC_HARDER then subtract
* the high-atomic reserves. This will over-estimate the size of the
* atomic reserve but it avoids a search.
*/
if (likely(!alloc_harder)) {
free_pages -= z->nr_reserved_highatomic;
} else {
/*
* OOM victims can try even harder than normal ALLOC_HARDER
* users on the grounds that it's definitely going to be in
* the exit path shortly and free memory. Any allocation it
* makes during the free path will be small and short-lived.
*/
if (alloc_flags & ALLOC_OOM)
min -= min / 2;
else
min -= min / 4;
}
#ifdef CONFIG_CMA
/* If allocation can't use CMA areas don't use free CMA pages */
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif
/*
* Check watermarks for an order-0 allocation request. If these
* are not met, then a high-order request also cannot go ahead
* even if a suitable page happened to be free.
*/
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return false;
/* If this is an order-0 request then the watermark is fine */
if (!order)
return true;
/* For a high-order request, check at least one suitable page is free */
for (o = order; o < MAX_ORDER; o++) {
struct free_area *area = &z->free_area[o];
int mt;
if (!area->nr_free)
continue;
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
if (!list_empty(&area->free_list[mt]))
return true;
}
#ifdef CONFIG_CMA
if ((alloc_flags & ALLOC_CMA) &&
!list_empty(&area->free_list[MIGRATE_CMA])) {
return true;
}
#endif
if (alloc_harder &&
!list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
return true;
}
return false;
}
[__alloc_pages_nodemask->get_page_from_freelist->rmqueue]
/*
* Allocate a page from the given zone. Use pcplists for order-0 allocations.
*/
static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
unsigned long flags;
struct page *page;
if (likely(order == 0)) {
page = rmqueue_pcplist(preferred_zone, zone, order,
gfp_flags, migratetype);
goto out;
}
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
spin_lock_irqsave(&zone->lock, flags);
do {
page = NULL;
if (alloc_flags & ALLOC_HARDER) {
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
if (!page)
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));
spin_unlock(&zone->lock);
if (!page)
goto failed;
__mod_zone_freepage_state(zone, -(1 << order),
get_pcppage_migratetype(page));
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone);
local_irq_restore(flags);
out:
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return page;
failed:
local_irq_restore(flags);
return NULL;
}
/*
* Go through the free lists for the given migratetype and remove
* the smallest available page from the freelists
*/
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order;
struct free_area *area;
struct page *page;
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]);
page = list_first_entry_or_null(&area->free_list[migratetype],
struct page, lru);
if (!page)
continue;
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
}
return NULL;
}
rmqueue時真正分配內(nèi)存的函數(shù)
- 通過調(diào)用rmqueue_pcplist,從每個CPU的緩存頁面中分配內(nèi)存
- 上一步不成功調(diào)用__rmqueue_smallest函數(shù)真正從伙伴系統(tǒng)分配頁面,這個函數(shù)簡單的掃描伙伴系統(tǒng)查找合適的頁面如果找不到直接退出不做發(fā)雜的回收頁面處理
- 如果__rmqueue_smallest分配失敗,則調(diào)用__rmqueue分配頁面,這個函數(shù)會盡力根據(jù)migratetype的值在伙伴系統(tǒng)查找合適的內(nèi)存空間,如果當前migratetype沒找到,則根據(jù)fallback表中定義的優(yōu)先級在其它內(nèi)存塊列表中查找。
當在伙伴系統(tǒng)中找到合適的內(nèi)存塊后就要把這個內(nèi)存塊從伙伴系統(tǒng)取出來,然后調(diào)用expand函數(shù)切出申請的大小,然后再把剩余的內(nèi)存切分成更小2^N大小還回伙伴系統(tǒng),這個過程中會盡量還會order大的空閑鏈表。
回到__alloc_pages_nodemask,如果get_page_from_freelist內(nèi)存分配失敗,會調(diào)用__alloc_pages_slowpath繼續(xù)分配內(nèi)存,這個過程會根據(jù)gfp_mask是否設(shè)置__GFP_KSWAPD_RECLAIM喚醒kswapd內(nèi)核線程回收頁面,緊接著再次調(diào)用get_page_from_freelist分配頁面,如果還不成功則通過直接回收頁面、遷移頁面來釋放部分內(nèi)存后再嘗試分配。還是不成功的話如果gfp_mask設(shè)置了__GFP_NOFAIL將中斷當前進程一直到內(nèi)存申請成功,否則分配失敗直接返回。
頁面回收
釋放頁面的核心函數(shù)是 free_page(),最終還是調(diào)用__free_pages()函數(shù)。__free_pages()函數(shù)會分兩種情況,對于order等于0的情況做特殊處理;對于order大于0的情況,屬于正常處理流程。
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
if (order == 0)
free_unref_page(page);
else
__free_pages_ok(page, order);
}
}
釋放內(nèi)存的核心操作是將頁面還給伙伴系統(tǒng),并計算釋放的頁面和相鄰的內(nèi)存是否可以合并,如果可以合并則合并內(nèi)存將新的內(nèi)存塊鏈入更高一階的內(nèi)存塊鏈表中,然后重復(fù)上面的操作知道內(nèi)存塊不能合并為止。
list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
最終將內(nèi)存塊鏈入空閑內(nèi)存列表中。
/*
* Freeing function for a buddy system allocator.
*
* The concept of a buddy system is to maintain direct-mapped table
* (containing bit values) for memory blocks of various "orders".
* The bottom level table contains the map for the smallest allocatable
* units of memory (here, pages), and each level above it describes
* pairs of units from the levels below, hence, "buddies".
* At a high level, all that happens here is marking the table entry
* at the bottom level available, and propagating the changes upward
* as necessary, plus some accounting needed to play nicely with other
* parts of the VM system.
* At each level, we keep a list of pages, which are heads of continuous
* free pages of length of (1 << order) and marked with _mapcount
* PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
* field.
* So when we are allocating or freeing one, we can derive the state of the
* other. That is, if we allocate a small block, and both were
* free, the remainder of the region must be split into blocks.
* If a block is freed, and its buddy is also free, then this
* triggers coalescing into a block of larger size.
*
* -- nyc
*/
static inline void __free_one_page(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
int migratetype)
{
unsigned long combined_pfn;
unsigned long uninitialized_var(buddy_pfn);
struct page *buddy;
unsigned int max_order;
max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
__mod_zone_freepage_state(zone, 1 << order, migratetype);
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
continue_merging:
while (order < max_order - 1) {
buddy_pfn = __find_buddy_pfn(pfn, order);
buddy = page + (buddy_pfn - pfn);
if (!pfn_valid_within(buddy_pfn))
goto done_merging;
if (!page_is_buddy(page, buddy, order))
goto done_merging;
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
*/
if (page_is_guard(buddy)) {
clear_page_guard(zone, buddy, order, migratetype);
} else {
list_del(&buddy->lru);
zone->free_area[order].nr_free--;
rmv_page_order(buddy);
}
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
order++;
}
if (max_order < MAX_ORDER) {
/* If we are here, it means order is >= pageblock_order.
* We want to prevent merge between freepages on isolate
* pageblock and normal pageblock. Without this, pageblock
* isolation could cause incorrect freepage or CMA accounting.
*
* We don't want to hit this code for the more frequent
* low-order merging.
*/
if (unlikely(has_isolate_pageblock(zone))) {
int buddy_mt;
buddy_pfn = __find_buddy_pfn(pfn, order);
buddy = page + (buddy_pfn - pfn);
buddy_mt = get_pageblock_migratetype(buddy);
if (migratetype != buddy_mt
&& (is_migrate_isolate(migratetype) ||
is_migrate_isolate(buddy_mt)))
goto done_merging;
}
max_order++;
goto continue_merging;
}
done_merging:
set_page_order(page, order);
/*
* If this is not the largest possible page, check if the buddy
* of the next-highest order is free. If it is, it's possible
* that pages are being freed that will coalesce soon. In case,
* that is happening, add the free page to the tail of the list
* so it's less likely to be used soon and more likely to be merged
* as a higher order page
*/
if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
struct page *higher_page, *higher_buddy;
combined_pfn = buddy_pfn & pfn;
higher_page = page + (combined_pfn - pfn);
buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
higher_buddy = higher_page + (buddy_pfn - combined_pfn);
if (pfn_valid_within(buddy_pfn) &&
page_is_buddy(higher_page, higher_buddy, order + 1)) {
list_add_tail(&page->lru,
&zone->free_area[order].free_list[migratetype]);
goto out;
}
}
list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
zone->free_area[order].nr_free++;
}
對于order為0的情況,內(nèi)核做了些特殊處理,zone中有一個變量zone->pageset為每個CPU初始化一個percpu變量 struct per_cpu_pageset。當釋放 order 等于 0 的頁面時,首先頁面釋放到 per_cpu_page->list 對應(yīng)的鏈表中。這樣可以保證頁面被同一個cpu使用避免頻繁沖涮緩存。當然這個鏈表不能一直增長下去,函數(shù)會檢查鏈表中頁面的個數(shù),當頁面大于一定數(shù)量時將部分頁面真正還給伙伴系統(tǒng),最終還是調(diào)用__free_one_page完成頁面釋放。
[free_unref_page_commit->free_pcppages_bulk->__free_one_page]
struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};
- count 表示當前 zone 中的 per_cpu_pages 的頁面。
- high 表示當緩存的頁面高于這水位時,會回收頁面到伙伴系統(tǒng)。
- batch 表示一次回收頁面到伙伴系統(tǒng)的頁面數(shù)量。
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
*
* If the zone was previously in an "all pages pinned" state then look to
* see if this freeing clears that state.
*
* And clear the zone's pages_scanned counter, to hold off the "all pages are
* pinned" detection logic.
*/
static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
int migratetype = 0;
int batch_free = 0;
int prefetch_nr = 0;
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
while (count) {
struct list_head *list;
/*
* Remove pages from lists in a round-robin fashion. A
* batch_free count is maintained that is incremented when an
* empty list is encountered. This is so more pages are freed
* off fuller lists instead of spinning excessively around empty
* lists
*/
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));
/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = count;
do {
page = list_last_entry(list, struct page, lru);
/* must delete to avoid corrupting pcp list */
list_del(&page->lru);
pcp->count--;
if (bulkfree_pcp_prepare(page))
continue;
list_add_tail(&page->lru, &head);
/*
* We are going to put the page back to the global
* pool, prefetch its buddy to speed up later access
* under zone->lock. It is believed the overhead of
* an additional test and calculating buddy_pfn here
* can be offset by reduced memory latency later. To
* avoid excessive prefetching due to large count, only
* prefetch buddy for the first pcp->batch nr of pages.
*/
if (prefetch_nr++ < pcp->batch)
prefetch_buddy(page);
} while (--count && --batch_free && !list_empty(list));
}
spin_lock(&zone->lock);
isolated_pageblocks = has_isolate_pageblock(zone);
/*
* Use safe version since after __free_one_page(),
* page->lru.next will not point to original list.
*/
list_for_each_entry_safe(page, tmp, &head, lru) {
int mt = get_pcppage_migratetype(page);
/* MIGRATE_ISOLATE page should not go to pcplists */
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
/* Pageblock could have been isolated meanwhile */
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
}
spin_unlock(&zone->lock);
}
```c
本文中部分圖片參考來及互聯(lián)網(wǎng)在此標識感謝!
參考文章:
https://blog.csdn.net/WANGYONGZIXUE/article/details/124518138
https://zhuanlan.zhihu.com/p/468829050
https://blog.csdn.net/u012294613/article/details/124151163
https://blog.csdn.net/faxiang1230/article/details/106557298