Linux頁面分配及伙伴系統(tǒng)

伙伴系統(tǒng)

Linux系統(tǒng)在運行過程中頁面不斷的分配回收,慢慢的就會碎片化,當應(yīng)用分配大內(nèi)存時很難找到合適的連續(xù)內(nèi)存,要花費更長的時間分配合適的內(nèi)存,嚴重影響系統(tǒng)性能。因此內(nèi)核設(shè)計了一種伙伴系統(tǒng)緩解頁面碎片化。

伙伴系統(tǒng)設(shè)計的基本思路是把連續(xù)的空閑頁面組織成內(nèi)存塊進行管理,因此系統(tǒng)將連續(xù)的2N次方個頁面組織成一個大塊,掛到對應(yīng)的鏈表上。在頁面管理結(jié)構(gòu)zone中有一個free_area數(shù)組,數(shù)組長度為MAX_ORDER,free_area[0]管理的內(nèi)存塊的大小為20,free_area[1]管理的內(nèi)存塊的大小為21,以此類推,free_area[MAX_ORDER-1]管理的內(nèi)存塊的大小為2(MAX_ORDER-1)。當分配頁面時,根據(jù)申請的頁面的個數(shù)找到比申請頁面大且最接近的鏈表,如果這個鏈表中有空閑內(nèi)存塊則獲取一個內(nèi)存塊,分配器將內(nèi)存塊拆分一部分分配給申請者,另一部分按照2的N次冪拆分掛入到更小的內(nèi)存塊鏈表中,這個過程可能會多次拆分直到頁面全部掛入到鏈表中,拆分的原則是盡量掛入更大的內(nèi)存塊鏈表。如果有內(nèi)存塊釋放則過程相反,先將釋放的內(nèi)存塊插入適合的鏈表,如果發(fā)現(xiàn)新插入的內(nèi)存塊和相鄰的內(nèi)存塊可以合并成更高階的內(nèi)存塊插入到更高階的鏈表,然后重復(fù)上面過程直到無法合并成更大的內(nèi)存塊,這個過程其實就是把碎片化的頁面組織成成片的頁面塊。

#define MAX_ORDER 11
 
struct zone{
    struct free_area    free_area[MAX_ORDER];//不同長度的空間區(qū)域
}

struct free_area {
    struct list_head        free_list[MIGRATE_TYPES];
    unsigned long           nr_free;
};
伙伴系統(tǒng)內(nèi)存塊

系統(tǒng)在啟動時根據(jù)物理內(nèi)存的配置信息將內(nèi)存頁構(gòu)建到伙伴系統(tǒng)中,而且會盡量把頁劃分成最大塊鏈入MAX_ORDER鏈表中,這樣做可以保證系統(tǒng)在剛啟動時內(nèi)存碎片化最小。

內(nèi)存遷移類型

伙伴系統(tǒng)是內(nèi)核中工作最優(yōu)秀的系統(tǒng)之一,但是它也存在一些缺陷,首先伙伴系統(tǒng)只能在釋放內(nèi)存時被動的去合并內(nèi)存塊來減小內(nèi)存碎片化,不能在分配階段主動以某種策略來避免產(chǎn)生內(nèi)存碎片,其次一種極端情況下系統(tǒng)只分配了少量的內(nèi)存頁大部分內(nèi)存頁處于空閑狀態(tài),但是已分配的內(nèi)存頁散落在各處,系統(tǒng)無法按2^N次方將內(nèi)存組織成大的內(nèi)存塊。我們很容易想到能不能再適當?shù)臅r機對內(nèi)存做下規(guī)整,將散落在各處集中在一處重新分配或釋放呢,答案時肯定的,因此演化出了內(nèi)存遷移性的概念。類比文件系統(tǒng),磁盤也存在存儲空間碎片化的問題,磁盤就可以用專門的應(yīng)用工具整理文件系統(tǒng)。當然并不是所有內(nèi)存頁都可以重新分配或釋放的,要針對頁面不同使用情況具體分析。

觀察struct free_area結(jié)構(gòu)可以發(fā)現(xiàn)最新的伙伴系統(tǒng)并不是簡單的把內(nèi)存塊掛鏈接到MAX_ORDER個雙向鏈表中,而是每種大小的內(nèi)存塊都有MIGRATE_TYPES個類型的鏈表。內(nèi)核將分配的內(nèi)存按不可移動、可移動、可回收等分成幾種類型,本文只討論前3種遷移類型。

enum {
        MIGRATE_UNMOVABLE,
        MIGRATE_MOVABLE,
        MIGRATE_RECLAIMABLE,
        MIGRATE_PCPTYPES,       /* the number of types on the pcp lists */
        MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
        /*
         * MIGRATE_CMA migration type is designed to mimic the way
         * ZONE_MOVABLE works.  Only movable pages can be allocated
         * from MIGRATE_CMA pageblocks and page allocator never
         * implicitly change migration type of MIGRATE_CMA pageblock.
         *
         * The way to use it is to change migratetype of a range of
         * pageblocks to MIGRATE_CMA which can be done by
         * __free_pageblock_cma() function.  What is important though
         * is that a range of pageblocks must be aligned to
         * MAX_ORDER_NR_PAGES should biggest page be bigger then
         * a single pageblock.
         */
        MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
        MIGRATE_ISOLATE,        /* can't allocate from here */
#endif
        MIGRATE_TYPES
};
  • 可移動(MIGRATE_UNMOVABLE)類型:應(yīng)用程序分配的內(nèi)存大多是這種類型,用戶空間訪問物理內(nèi)存完全是通過MMU地址映射完成,給定的線性地址映射到何處對應(yīng)用程序來說完全是透明,這樣系統(tǒng)可以在合適的時機為應(yīng)用程序重新分配內(nèi)存頁將原來內(nèi)存頁內(nèi)容拷貝過去,更改進程頁表重新映射地址來完成內(nèi)存規(guī)整,完全不會影響應(yīng)用程序運行。
  • 不可移動(MIGRATE_MOVABLE)類型:內(nèi)核中核心數(shù)據(jù)結(jié)構(gòu)及代碼占用的內(nèi)存屬于這種類型,他們一旦分配就不可以更改,除非主動釋放。
  • 可回收(MIGRATE_RECLAIMABLE): 應(yīng)用程序的代碼段、打開文件的緩存頁面大多數(shù)屬于這種類型,這些頁面的主要特點是可以通過磁盤或其它存儲介質(zhì)恢復(fù),當內(nèi)存碎片化嚴重或不足時可以直接釋放頁面,需要時重新從備份數(shù)據(jù)恢復(fù)。

在申請的時候表明申請頁的使用傾向,通過在不同區(qū)間中申請頁,防止長期使用的UNMOVABLE類型的頁破壞空閑物理內(nèi)存的連續(xù)性。在申請內(nèi)存失敗的時候首先進行內(nèi)存規(guī)整,主要是移動物理頁的內(nèi)容并更新映射關(guān)系,釋放可回收的頁,這樣通過積極調(diào)整頁的位置來使得物理內(nèi)存連續(xù)。

伙伴系統(tǒng)可移動頁面管理.png

可移動內(nèi)存分配

既然系統(tǒng)將內(nèi)存頁按照使用性質(zhì)劃分成了不同類型,使用不同鏈表進行管理,那系統(tǒng)在什么時候?qū)?nèi)存掛入不同鏈表呢。從memmap_init_zone函數(shù)中可以看出系統(tǒng)在啟動時所有內(nèi)存都屬于可移動內(nèi)存,在內(nèi)核初始化過程中會不斷的向系統(tǒng)分配頁面,這些頁面大多時不可移動的而且是永久的,這時系統(tǒng)大概率按照最大內(nèi)存塊分配連續(xù)內(nèi)存,這些分配可移動內(nèi)存會被重新標記為不可移動的,這樣系統(tǒng)初始化完后就產(chǎn)生了一批不可移動內(nèi)存而且并不會產(chǎn)生大量的內(nèi)存碎片。至于可回收內(nèi)存大多都是在應(yīng)用進程運行過程中產(chǎn)生的。

/*
 * Initially all pages are reserved - free ones are freed
 * up by free_all_bootmem() once the early boot process is
 * done. Non-atomic initialization, single-pass.
 */
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
        unsigned long start_pfn, enum memmap_context context)
{
    /*  ......  */
    for (pfn = start_pfn; pfn < end_pfn; pfn++) {
        /*  ......  */
not_early:
        if (!(pfn & (pageblock_nr_pages - 1))) {
            struct page *page = pfn_to_page(pfn);

            __init_single_page(page, pfn, zone, nid);
            set_pageblock_migratetype(page, MIGRATE_MOVABLE);
        } else {
            __init_single_pfn(pfn, zone, nid);
        }
    }
}

類似從zone中分配頁面,如果當前節(jié)點沒有足夠的頁面可以分配會根據(jù)pg_data_t中提供的zonelists列表規(guī)定的優(yōu)先級在其它zone中分配內(nèi)存。MIGRATE_TYPES也類似,定義了一個fallbacks數(shù)據(jù)來定義當某種可移動類型內(nèi)存頁面不足時依次從哪種類型頁面申請內(nèi)存。

/*
 * This array describes the order lists are fallen back to when
 * the free lists for the desirable migrate type are depleted
 * 該數(shù)組描述了指定遷移類型的空閑列表耗盡時
 * 其他空閑列表在備用列表中的次序
 */
static int fallbacks[MIGRATE_TYPES][4] = {
    //  分配不可移動頁失敗的備用列表
    [MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
    //  分配可回收頁失敗時的備用列表
    [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
    //  分配可移動頁失敗時的備用列表
    [MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
#ifdef CONFIG_CMA
    [MIGRATE_CMA]     = { MIGRATE_TYPES }, /* Never used */
#endif
#ifdef CONFIG_MEMORY_ISOLATION
    [MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
#endif
};

可移動性分組特性的頁面管理相關(guān)的全局變量和輔助函數(shù)總是編譯到內(nèi)核中,但只有在系統(tǒng)中有足夠內(nèi)存可以分配到多個遷移類型對應(yīng)的鏈表時才有意義的。由于每個遷移鏈表都應(yīng)該有適當數(shù)量的內(nèi)存,內(nèi)核需要定義”適當”的概念. 這是通過兩個全局變量pageblock_order和pageblock_nr_pages提供的. 第一個表示內(nèi)核認為是”大”的一個分配階, pageblock_nr_pages則表示該分配階對應(yīng)的頁數(shù)。如果各遷移類型的鏈表中沒有一塊較大的連續(xù)內(nèi)存, 那么頁面遷移不會提供任何好處, 因此在可用內(nèi)存太少時內(nèi)核會關(guān)閉該特性,build_all_zonelists函數(shù)會檢查內(nèi)存水位, 如果所有內(nèi)存區(qū)域里面高水線以上的物理頁總數(shù)小于(pageblock_nr_pages *遷移類型數(shù)量),那么禁用根據(jù)可移動性分組,全局變量page_group_by_mobility_disabled設(shè)置為0, 否則設(shè)置為1。禁用后申請的所有頁面都是不可移動的。

在zone中有一個字段pageblock_flags,它指向一片bitmap內(nèi)存區(qū)域,這片內(nèi)存的大小是和zone管理的內(nèi)存塊數(shù)量相關(guān),每NR_PAGEBLOCK_BITS位記錄一個內(nèi)存塊屬于哪種遷移類型,在頁面回收時方便將頁面放回原來的伙伴鏈表。

#define PB_migratetype_bits 3
/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
    PB_migrate,
    PB_migrate_end = PB_migrate + PB_migratetype_bits - 1,
            /* 3 bits required for migrate types */
    PB_migrate_skip,/* If set the block is skipped by compaction */

    /*
     * Assume the bits will always align on a word. If this assumption
     * changes then get/set pageblock needs updating.
     */
    NR_PAGEBLOCK_BITS
};
// mm/page_alloc.c 
void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) 
{ 
 if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) 
    page_group_by_mobility_disabled = 1; 
 else 
    page_group_by_mobility_disabled = 0; 
}

/* Convert GFP flags to their corresponding migrate type */
static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
{
    WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);

    if (unlikely(page_group_by_mobility_disabled))
        return MIGRATE_UNMOVABLE;

    /* Group based on mobility */
    return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
        ((gfp_flags & __GFP_RECLAIMABLE) != 0);
}

void set_pageblock_migratetype(struct page *page, int migratetype)
{
    if (unlikely(page_group_by_mobility_disabled &&
             migratetype < MIGRATE_PCPTYPES))
        migratetype = MIGRATE_UNMOVABLE;

    set_pageblock_flags_group(page, (unsigned long)migratetype,
                    PB_migrate, PB_migrate_end);
}

#define get_pageblock_migratetype(page)                                 \
        get_pfnblock_flags_mask(page, page_to_pfn(page),                \
                        PB_migrate_end, MIGRATETYPE_MASK)

內(nèi)核定義了__GFP_MOVABLE、__GFP_RECLAIMABLE兩個標志區(qū)分不同的遷移類型,如果這兩個標志都沒有設(shè)置則標識是不可移動類型。

頁面規(guī)整

頁面規(guī)整有兩種模式:異步和同步,在頁面回收失敗的時候首先開啟異步的頁面規(guī)整,如果異步頁面規(guī)整不出滿足要求的內(nèi)存,接下來使用嘗試通過直接內(nèi)存回收方式回收到足夠的內(nèi)存,如果還是獲取不到足夠的連續(xù)內(nèi)存,那么再次嘗試通過同步頁面規(guī)整的方式獲取連續(xù)內(nèi)存。

頁面規(guī)整的對象類型有兩種遷移類型:MIGRATE_RECLAIMABLE,MIGRATE_MOVABLE,從規(guī)整方式來看有兩種:匿名映射和文件映射類型的頁。異步頁面規(guī)整只規(guī)整匿名頁,而同步頁面規(guī)整還會處理文件臟頁,如果頁面正在回寫,它還會等待頁面完成回寫。異步頁面規(guī)整相對來說比較保守,匿名頁的頁面遷移只需要將頁的內(nèi)容遷移到新頁中,創(chuàng)建新的映射關(guān)系,解除舊的映射關(guān)系,全是內(nèi)存的讀寫操作,而同步還涉及到IO回寫,它的耗時更長。

頁面分配

內(nèi)核分配內(nèi)存都是通過struct page *alloc_pages(int nid, gfp_t gfp_mask, unsigned int order)函數(shù)進行的。這個函數(shù)是伙伴系統(tǒng)分配頁面的核心代碼,其它所有的內(nèi)存分配函數(shù)最終都會調(diào)用這個函數(shù)。

  • nid:節(jié)點ID,NUMA架構(gòu)多核CPU系統(tǒng)中,每個CPU都是一個節(jié)點
  • gfp_mask:頁面分配屬性掩碼
  • order: 分配頁面階數(shù),例如2則便是分配2^2=4個頁面

系統(tǒng)中還定義了一個alloc_page宏,自動獲取節(jié)點ID,申請頁面。

#define alloc_pages(gfp_mask, order) \ 
 alloc_pages_node(numa_node_id(), gfp_mask, order)

分配掩碼很重要,它決定了伙伴系統(tǒng)頁面分配的策略。GPF是get free page的簡寫。

/* Plain integer GFP bitmasks. Do not use this directly. */
//  區(qū)域修飾符,指定優(yōu)先從哪個zone分配頁面,如果下面三個標志都為0標識從低端內(nèi)存開始分配頁面
#define ___GFP_DMA              0x01u       //從ZONE_DMA分配內(nèi)存
#define ___GFP_HIGHMEM          0x02u       //從高端內(nèi)存分配內(nèi)存
#define ___GFP_DMA32            0x04u       //從ZONE_DMA32分配內(nèi)存

//  行為修飾符
#define ___GFP_MOVABLE          0x08u       /* 頁是可移動的 */
#define ___GFP_RECLAIMABLE      0x10u       /* 頁是可回收的 */
#define ___GFP_HIGH             0x20u       /* 應(yīng)該訪問緊急分配池? */
#define ___GFP_IO               0x40u       /* 可以啟動物理IO? */
#define ___GFP_FS               0x80u       /* 可以調(diào)用底層文件系統(tǒng)? */
#define ___GFP_COLD             0x100u      /* 需要非緩存的冷頁 */
#define ___GFP_NOWARN           0x200u      /* 禁止分配失敗警告 */
#define ___GFP_REPEAT           0x400u      /* 重試分配,可能失敗 */
#define ___GFP_NOFAIL           0x800u      /* 一直重試,不會失敗 */
#define ___GFP_NORETRY          0x1000u     /* 不重試,可能失敗 */
#define ___GFP_MEMALLOC         0x2000u     /* 使用緊急分配鏈表 */
#define ___GFP_COMP             0x4000u     /* 增加復(fù)合頁元數(shù)據(jù) */
#define ___GFP_ZERO             0x8000u     /* 成功則返回填充字節(jié)0的頁 */
//  類型修飾符
#define ___GFP_NOMEMALLOC       0x10000u    /* 不使用緊急分配鏈表 */
#define ___GFP_HARDWALL         0x20000u    /* 只允許在進程允許運行的CPU所關(guān)聯(lián)的結(jié)點分配內(nèi)存 */
#define ___GFP_THISNODE         0x40000u    /* 沒有備用結(jié)點,沒有策略 */
#define ___GFP_ATOMIC           0x80000u    /* 用于原子分配,在任何情況下都不能中斷  */
#define ___GFP_ACCOUNT          0x100000u
#define ___GFP_NOTRACK          0x200000u
#define ___GFP_DIRECT_RECLAIM   0x400000u
#define ___GFP_OTHER_NODE       0x800000u
#define ___GFP_WRITE            0x1000000u
#define ___GFP_KSWAPD_RECLAIM   0x2000000u
/*
 * Physical address zone modifiers (see linux/mmzone.h - low four bits)
 *
 * Do not put any conditional on these. If necessary modify the definitions
 * without the underscores and use them consistently. The definitions here may
 * be used in bit comparisons.
 */
#define __GFP_DMA       ((__force gfp_t)___GFP_DMA)
#define __GFP_HIGHMEM   ((__force gfp_t)___GFP_HIGHMEM)
#define __GFP_DMA32     ((__force gfp_t)___GFP_DMA32)
#define __GFP_MOVABLE   ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
#define GFP_ZONEMASK    (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)

/*
 * Page mobility and placement hints
 *
 * These flags provide hints about how mobile the page is. Pages with similar
 * mobility are placed within the same pageblocks to minimise problems due
 * to external fragmentation.
 *
 * __GFP_MOVABLE (also a zone modifier) indicates that the page can be
 *   moved by page migration during memory compaction or can be reclaimed.
 *
 * __GFP_RECLAIMABLE is used for slab allocations that specify
 *   SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
 *
 * __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
 *   these pages will be spread between local zones to avoid all the dirty
 *   pages being in one zone (fair zone allocation policy).
 *
 * __GFP_HARDWALL enforces the cpuset memory allocation policy.
 *
 * __GFP_THISNODE forces the allocation to be satisified from the requested
 *   node with no fallbacks or placement policy enforcements.
 *
 * __GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
 */
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE)
#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
#define __GFP_THISNODE  ((__force gfp_t)___GFP_THISNODE)
#define __GFP_ACCOUNT   ((__force gfp_t)___GFP_ACCOUNT)

/*
 * Watermark modifiers -- controls access to emergency reserves
 *
 * __GFP_HIGH indicates that the caller is high-priority and that granting
 *   the request is necessary before the system can make forward progress.
 *   For example, creating an IO context to clean pages.
 *
 * __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
 *   high priority. Users are typically interrupt handlers. This may be
 *   used in conjunction with __GFP_HIGH
 *
 * __GFP_MEMALLOC allows access to all memory. This should only be used when
 *   the caller guarantees the allocation will allow more memory to be freed
 *   very shortly e.g. process exiting or swapping. Users either should
 *   be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
 *
 * __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
 *   This takes precedence over the __GFP_MEMALLOC flag if both are set.
 */
#define __GFP_ATOMIC    ((__force gfp_t)___GFP_ATOMIC)
#define __GFP_HIGH  ((__force gfp_t)___GFP_HIGH)
#define __GFP_MEMALLOC  ((__force gfp_t)___GFP_MEMALLOC)
#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)

/*
 * Reclaim modifiers
 *
 * __GFP_IO can start physical IO.
 *
 * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
 *   allocator recursing into the filesystem which might already be holding
 *   locks.
 *
 * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
 *   This flag can be cleared to avoid unnecessary delays when a fallback
 *   option is available.
 *
 * __GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when
 *   the low watermark is reached and have it reclaim pages until the high
 *   watermark is reached. A caller may wish to clear this flag when fallback
 *   options are available and the reclaim is likely to disrupt the system. The
 *   canonical example is THP allocation where a fallback is cheap but
 *   reclaim/compaction may cause indirect stalls.
 *
 * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
 *
 * The default allocator behavior depends on the request size. We have a concept
 * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
 * !costly allocations are too essential to fail so they are implicitly
 * non-failing by default (with some exceptions like OOM victims might fail so
 * the caller still has to check for failures) while costly requests try to be
 * not disruptive and back off even without invoking the OOM killer.
 * The following three modifiers might be used to override some of these
 * implicit rules
 *
 * __GFP_NORETRY: The VM implementation will try only very lightweight
 *   memory direct reclaim to get some memory under memory pressure (thus
 *   it can sleep). It will avoid disruptive actions like OOM killer. The
 *   caller must handle the failure which is quite likely to happen under
 *   heavy memory pressure. The flag is suitable when failure can easily be
 *   handled at small cost, such as reduced throughput
 *
 * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
 *   procedures that have previously failed if there is some indication
 *   that progress has been made else where.  It can wait for other
 *   tasks to attempt high level approaches to freeing memory such as
 *   compaction (which removes fragmentation) and page-out.
 *   There is still a definite limit to the number of retries, but it is
 *   a larger limit than with __GFP_NORETRY.
 *   Allocations with this flag may fail, but only when there is
 *   genuinely little unused memory. While these allocations do not
 *   directly trigger the OOM killer, their failure indicates that
 *   the system is likely to need to use the OOM killer soon.  The
 *   caller must handle failure, but can reasonably do so by failing
 *   a higher-level request, or completing it only in a much less
 *   efficient manner.
 *   If the allocation does fail, and the caller is in a position to
 *   free some non-essential memory, doing so could benefit the system
 *   as a whole.
 *
 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 *   cannot handle allocation failures. The allocation could block
 *   indefinitely but will never return with failure. Testing for
 *   failure is pointless.
 *   New users should be evaluated carefully (and the flag should be
 *   used only when there is no reasonable failure policy) but it is
 *   definitely preferable to use the flag rather than opencode endless
 *   loop around allocator.
 *   Using this flag for costly allocations is _highly_ discouraged.
 */
#define __GFP_IO    ((__force gfp_t)___GFP_IO)
#define __GFP_FS    ((__force gfp_t)___GFP_FS)
#define __GFP_DIRECT_RECLAIM    ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
#define __GFP_KSWAPD_RECLAIM    ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL)
#define __GFP_NOFAIL    ((__force gfp_t)___GFP_NOFAIL)
#define __GFP_NORETRY   ((__force gfp_t)___GFP_NORETRY)

/*
 * Action modifiers
 *
 * __GFP_NOWARN suppresses allocation failure reports.
 *
 * __GFP_COMP address compound page metadata.
 *
 * __GFP_ZERO returns a zeroed page on success.
 */
#define __GFP_NOWARN    ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP  ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO  ((__force gfp_t)___GFP_ZERO)

/* Disable lockdep for GFP context tracking */
#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)

/* Room for N __GFP_FOO bits */
#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
/*
 * Useful GFP flag combinations that are commonly used. It is recommended
 * that subsystems start with one of these combinations and then set/clear
 * __GFP_FOO flags as necessary.
 *
 * GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
 *   watermark is applied to allow access to "atomic reserves"
 *
 * GFP_KERNEL is typical for kernel-internal allocations. The caller requires
 *   ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
 *
 * GFP_KERNEL_ACCOUNT is the same as GFP_KERNEL, except the allocation is
 *   accounted to kmemcg.
 *
 * GFP_NOWAIT is for kernel allocations that should not stall for direct
 *   reclaim, start physical IO or use any filesystem callback.
 *
 * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
 *   that do not require the starting of any physical IO.
 *   Please try to avoid using this flag directly and instead use
 *   memalloc_noio_{save,restore} to mark the whole scope which cannot
 *   perform any IO with a short explanation why. All allocation requests
 *   will inherit GFP_NOIO implicitly.
 *
 * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
 *   Please try to avoid using this flag directly and instead use
 *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
 *   recurse into the FS layer with a short explanation why. All allocation
 *   requests will inherit GFP_NOFS implicitly.
 *
 * GFP_USER is for userspace allocations that also need to be directly
 *   accessibly by the kernel or hardware. It is typically used by hardware
 *   for buffers that are mapped to userspace (e.g. graphics) that hardware
 *   still must DMA to. cpuset limits are enforced for these allocations.
 *
 * GFP_DMA exists for historical reasons and should be avoided where possible.
 *   The flags indicates that the caller requires that the lowest zone be
 *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
 *   it would require careful auditing as some users really require it and
 *   others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
 *   lowest zone as a type of emergency reserve.
 *
 * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
 *   address.
 *
 * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
 *   do not need to be directly accessible by the kernel but that cannot
 *   move once in use. An example may be a hardware allocation that maps
 *   data directly into userspace but has no addressing limitations.
 *
 * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
 *   need direct access to but can use kmap() when access is required. They
 *   are expected to be movable via page reclaim or page migration. Typically,
 *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
 *
 * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
 *   compound allocations that will generally fail quickly if memory is not
 *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
 *   version does not attempt reclaim/compaction at all and is by default used
 *   in page fault path, while the non-light is used by khugepaged.
 */
#define GFP_ATOMIC      (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
#define GFP_NOWAIT      (__GFP_KSWAPD_RECLAIM)
#define GFP_NOIO        (__GFP_RECLAIM)
#define GFP_NOFS        (__GFP_RECLAIM | __GFP_IO)
#define GFP_TEMPORARY   (__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
                         __GFP_RECLAIMABLE)
#define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_DMA         __GFP_DMA
#define GFP_DMA32       __GFP_DMA32
#define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE)
#define GFP_TRANSHUGE   ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
                         __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
                         ~__GFP_RECLAIM)

/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
#define GFP_MOVABLE_SHIFT 3
  • GFP_ATOMIC:用于原子分配,在任何情況下都不能被中斷, 可能使用緊急分配鏈表中的內(nèi)存, 這個標志用在中斷處理程序, 下半部, 持有自旋鎖以及其他不能睡眠的地方
  • GFP_KERNEL:這是一種常規(guī)的分配方式, 可能會被阻塞,不能用在不可中斷的上下文。為了獲取調(diào)用者所需的內(nèi)存, 內(nèi)核會盡力而為. 這個標志應(yīng)該是首選標志
  • GFP_NOWAIT:與GFP_ATOMIC類似, 不同之處在于, 調(diào)用不會使用緊急內(nèi)存池, 這就增加了內(nèi)存分配失敗的可能性
  • GFP_NOIO:這種分配可以阻塞, 但不會啟動磁盤I/O, 在頁面申請過程中如果發(fā)現(xiàn)內(nèi)存不足時可以啟動磁盤IO將一些頁面交換到磁盤的交換分區(qū)的,有些場景不能啟動IO操作因此要使用此標志。
  • GFP_NOFS:這種分配在必要時可以阻塞, 可以啟動磁盤, 但是不能調(diào)用VFS相關(guān)操作, 一般文件系統(tǒng)在申請內(nèi)存頁面時要禁止VFS操作,否則可能在頁面嚴重不足時寫回文件臟頁回收內(nèi)存時導(dǎo)致死鎖。
  • GFP_USER:這是一種常規(guī)的分配方式, 可以被阻塞. 通常由硬件分配一片內(nèi)存然后映射用戶空間使用。
  • GFP_HIGHUSER:是GFP_USER的一個擴展, 也用于用戶空間. 它允許分配無法直接映射的高端內(nèi)存. 用戶進程使用高端內(nèi)存頁是沒有壞處,因為用戶過程的地址空間總是通過非線性頁表組織的
  • GFP_HIGHUSER_MOVABLE:用途類似于GFP_HIGHUSER,優(yōu)先從ZONE_MOVEABLE中分配頁面。

GFP_KERNEL是內(nèi)核中最最常用的分配掩碼,它優(yōu)先從低端內(nèi)存分配頁面,不過它分配的一般都是給內(nèi)核管理數(shù)據(jù)結(jié)構(gòu)、各種緩存使用的頁面。

alloc_pages()最終調(diào)用__alloc_pages_nodemask()函數(shù),它是伙伴系統(tǒng)的核心函數(shù)。

[alloc_pages->alloc_pages_node->__alloc_pages->__alloc_pages_nodemask] 
/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
                            nodemask_t *nodemask)
{
    struct page *page;
    unsigned int alloc_flags = ALLOC_WMARK_LOW;
    gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
    struct alloc_context ac = { };

    gfp_mask &= gfp_allowed_mask;
    alloc_mask = gfp_mask;
    if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
        return NULL;

    finalise_ac(gfp_mask, order, &ac);

    /* First allocation attempt */
    page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
    if (likely(page))
        goto out;

    /*
     * Apply scoped allocation constraints. This is mainly about GFP_NOFS
     * resp. GFP_NOIO which has to be inherited for all allocation requests
     * from a particular context which has been marked by
     * memalloc_no{fs,io}_{save,restore}.
     */
    alloc_mask = current_gfp_context(gfp_mask);
    ac.spread_dirty_pages = false;

    /*
     * Restore the original nodemask if it was potentially replaced with
     * &cpuset_current_mems_allowed to optimize the fast-path attempt.
     */
    if (unlikely(ac.nodemask != nodemask))
        ac.nodemask = nodemask;

    page = __alloc_pages_slowpath(alloc_mask, order, &ac);

out:
    if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
        unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
        __free_pages(page, order);
        page = NULL;
    }

    trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);

    return page;
} 

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
        int preferred_nid, nodemask_t *nodemask,
        struct alloc_context *ac, gfp_t *alloc_mask,
        unsigned int *alloc_flags)
{
    ac->high_zoneidx = gfp_zone(gfp_mask);
    ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
    ac->nodemask = nodemask;
    ac->migratetype = gfpflags_to_migratetype(gfp_mask);

    if (cpusets_enabled()) {
        *alloc_mask |= __GFP_HARDWALL;
        if (!ac->nodemask)
            ac->nodemask = &cpuset_current_mems_allowed;
        else
            *alloc_flags |= ALLOC_CPUSET;
    }

    fs_reclaim_acquire(gfp_mask);
    fs_reclaim_release(gfp_mask);

    might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

    if (should_fail_alloc_page(gfp_mask, order))
        return false;

    if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE)
        *alloc_flags |= ALLOC_CMA;

    return true;
}

alloc_context是內(nèi)存分配的一個中間數(shù)據(jù)結(jié)構(gòu),保存內(nèi)存分配策略的計算值,gfp_zone計算從優(yōu)先從哪個zone開始分配開始分配內(nèi)存,這個函數(shù)中用的宏定義如下,在嵌入式ARM體系中MAX_NR_ZONES為3(ZONE_NORMAL、ZONE_HIGMEM、ZONE_MOVABLE)所以ZONES_SHIFT為2。對于gfp_zone(GFP_KERNEL)計算結(jié)果為0,即high_zoneidx為0。在pglist_data->zonelist->zoneref中定義了從哪個zone分配內(nèi)存的優(yōu)先級,越靠前優(yōu)先級越高。gfpflags_to_migratetype獲取頁面的MIGRATE_TYPES類型,這個決定了優(yōu)先從哪個zone->free_area->free_list中分配內(nèi)存,gfpflags_to_migratetype(GFP_KERNEL)計算結(jié)果為MIGRATE_UNMOVABLE。

static inline enum zone_type gfp_zone(gfp_t flags) 
{ 
    enum zone_type z; 
    int bit = (__force int) (flags & GFP_ZONEMASK); 
    z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
    ((1 << ZONES_SHIFT) - 1); 
    return z; 
}

#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE) 
#define GFP_ZONE_TABLE ( \ 
 (ZONE_NORMAL << 0 * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA << ___GFP_DMA * ZONES_SHIFT) \ 
 | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA32 << ___GFP_DMA32 * ZONES_SHIFT) \ 
 | (ZONE_NORMAL << ___GFP_MOVABLE * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * ZONES_SHIFT) \ 
 | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * ZONES_SHIFT) \ 
 | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * ZONES_SHIFT) \ 
) 
#if MAX_NR_ZONES < 2 
#define ZONES_SHIFT 0 
#elif MAX_NR_ZONES <= 2 
#define ZONES_SHIFT 1 
#elif MAX_NR_ZONES <= 4 
#define ZONES_SHIFT 2
#endif

[__alloc_pages_nodemask->finalise_ac->first_zones_zonelist->next_zones_zonelist->__next_zones_zonelist]

first_zones_zonelist計算preferred_zoneref最先從哪個zone開始分配頁面,這個計算是依據(jù)high_zoneidx進行的,計算方法很簡單,就是從pglist_data->zonelist->zoneref定義的zone順序,從前到后順序查找第一個小于等于high_zoneidx的zone,當然這個過程還要考慮nodemask,因為zoneref中的zone可以來自不同節(jié)點。最終調(diào)用__next_zones_zonelist實現(xiàn)。gfp_zone(GFP_KERNEL)計算為0,也就是GFP_KERNEL只能從zone_idx為0的zone分配內(nèi)存。內(nèi)核中定義的zone類型在enum zone_type中,ARM嵌入式系統(tǒng)中定義了ZONE_NORMAL、ZONE_HIGHMEM、ZONE_MOVABLE 3種類型zone,而且在內(nèi)存中是按照枚舉類型順序排列的,因此通過zone_idx(zone)宏計算出來的ZONE_NORMAL的zone_idx為0,ZONE_HIGHMEM為1,ZONE_MOVABLE為2。這樣GFP_KERNEL掩碼只能從ZONE_NORMAL中分配頁面。

enum zone_type {
#ifdef CONFIG_ZONE_DMA
  ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
  ZONE_DMA32,
#endif
  ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
  ZONE_HIGHMEM,
#endif
  ZONE_MOVABLE,
  __MAX_NR_ZONES
};

/*
 * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
 */
#define zone_idx(zone)      ((zone) - (zone)->zone_pgdat->node_zones)

/* Returns the next zone at or below highest_zoneidx in a zonelist */
struct zoneref *__next_zones_zonelist(struct zoneref *z,
                    enum zone_type highest_zoneidx,
                    nodemask_t *nodes)
{
    /*
     * Find the next suitable zone to use for the allocation.
     * Only filter based on nodemask if it's set
     */
    if (unlikely(nodes == NULL))
        while (zonelist_zone_idx(z) > highest_zoneidx)
            z++;
    else
        while (zonelist_zone_idx(z) > highest_zoneidx ||
                (z->zone && !zref_in_nodemask(z, nodes)))
            z++;

    return z;
}

[__alloc_pages_nodemask->get_page_from_freelist]

在獲取最優(yōu)的zone后就可以具體的從中分配頁面了,get_page_from_freelist函數(shù)完成這個任務(wù),整個代碼邏輯是一個大循環(huán),從最優(yōu)zone開始分配頁面。

  • 如果使能了CPU_SET,則要判斷當前進程是否可以從此zone中分配頁面,如果不行跳過這個zone
  • 判斷node剩余頁面水位,如果水位低于設(shè)置的值則跳過這個zone。根據(jù)zone->watermark、lowmem_reserve、gpf_mask計算出來的alloc_flags值共同計算出水位值,計算出來的值小于剩余頁面數(shù)直接退出,否則判斷order==0則直接返回成功,如果order>0則伙伴系統(tǒng)中還必須有一合適大小的連續(xù)內(nèi)存塊。系統(tǒng)中定義了 3 種水位,分別是 WMARK_MIN、WMARK_LOW 和 WMARK_HIGH,具體的水位值在系統(tǒng)初始化時計算出來。通常分配物理內(nèi)存頁面的內(nèi)核路徑是檢查 WMARK_LOW 水位,而頁面回收 kswapd 內(nèi)核線程則是檢查 WMARK_HIGH 水位。水位檢測最終通過__zone_watermark_ok函數(shù)完成。
  • 當水位滿足條件從伙伴系統(tǒng)申請內(nèi)存,不滿足則先嘗試回收內(nèi)存然后再嘗試從這個zone中分配內(nèi)存
  • 從zone中實在無法分配內(nèi)存時,則再zonelists中順序查找下一個zone_idx<=highest_zoneidx的zone重復(fù)上面步驟。
#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
    for (zone = z->zone;    \
        zone;                           \
        z = next_zones_zonelist(++z, highidx, nodemask),    \
            zone = zonelist_zone(z))

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                        const struct alloc_context *ac)
{
    struct zoneref *z = ac->preferred_zoneref;
    struct zone *zone;
    struct pglist_data *last_pgdat_dirty_limit = NULL;

    /*
     * Scan zonelist, looking for a zone with enough free.
     * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
     */
    for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
                                ac->nodemask) {
        struct page *page;
        unsigned long mark;

        if (cpusets_enabled() &&
            (alloc_flags & ALLOC_CPUSET) &&
            !__cpuset_zone_allowed(zone, gfp_mask))
                continue;
        /*
         * When allocating a page cache page for writing, we
         * want to get it from a node that is within its dirty
         * limit, such that no single node holds more than its
         * proportional share of globally allowed dirty pages.
         * The dirty limits take into account the node's
         * lowmem reserves and high watermark so that kswapd
         * should be able to balance it without having to
         * write pages from its LRU list.
         *
         * XXX: For now, allow allocations to potentially
         * exceed the per-node dirty limit in the slowpath
         * (spread_dirty_pages unset) before going into reclaim,
         * which is important when on a NUMA setup the allowed
         * nodes are together not big enough to reach the
         * global limit.  The proper fix for these situations
         * will require awareness of nodes in the
         * dirty-throttling and the flusher threads.
         */
        if (ac->spread_dirty_pages) {
            if (last_pgdat_dirty_limit == zone->zone_pgdat)
                continue;

            if (!node_dirty_ok(zone->zone_pgdat)) {
                last_pgdat_dirty_limit = zone->zone_pgdat;
                continue;
            }
        }

        mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
        if (!zone_watermark_fast(zone, order, mark,
                       ac_classzone_idx(ac), alloc_flags)) {
            int ret;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
            /*
             * Watermark failed for this zone, but see if we can
             * grow this zone if it contains deferred pages.
             */
            if (static_branch_unlikely(&deferred_pages)) {
                if (_deferred_grow_zone(zone, order))
                    goto try_this_zone;
            }
#endif
            /* Checked here to keep the fast path fast */
            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
            if (alloc_flags & ALLOC_NO_WATERMARKS)
                goto try_this_zone;

            if (node_reclaim_mode == 0 ||
                !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
                continue;

            ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
            switch (ret) {
            case NODE_RECLAIM_NOSCAN:
                /* did not scan */
                continue;
            case NODE_RECLAIM_FULL:
                /* scanned but unreclaimable */
                continue;
            default:
                /* did we reclaim enough */
                if (zone_watermark_ok(zone, order, mark,
                        ac_classzone_idx(ac), alloc_flags))
                    goto try_this_zone;

                continue;
            }
        }

try_this_zone:
        page = rmqueue(ac->preferred_zoneref->zone, zone, order,
                gfp_mask, alloc_flags, ac->migratetype);
        if (page) {
            prep_new_page(page, order, gfp_mask, alloc_flags);

            /*
             * If this is a high-order atomic allocation then check
             * if the pageblock should be reserved for the future
             */
            if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                reserve_highatomic_pageblock(page, zone, order);

            return page;
        } else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
            /* Try again if zone has deferred pages */
            if (static_branch_unlikely(&deferred_pages)) {
                if (_deferred_grow_zone(zone, order))
                    goto try_this_zone;
            }
#endif
        }
    }

    return NULL;
}

/*
 * Return true if free base pages are above 'mark'. For high-order checks it
 * will return true of the order-0 watermark is reached and there is at least
 * one free page of a suitable size. Checking now avoids taking the zone lock
 * to check in the allocation paths if no pages are free.
 */
bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
             int classzone_idx, unsigned int alloc_flags,
             long free_pages)
{
    long min = mark;
    int o;
    const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));

    /* free_pages may go negative - that's OK */
    free_pages -= (1 << order) - 1;

    if (alloc_flags & ALLOC_HIGH)
        min -= min / 2;

    /*
     * If the caller does not have rights to ALLOC_HARDER then subtract
     * the high-atomic reserves. This will over-estimate the size of the
     * atomic reserve but it avoids a search.
     */
    if (likely(!alloc_harder)) {
        free_pages -= z->nr_reserved_highatomic;
    } else {
        /*
         * OOM victims can try even harder than normal ALLOC_HARDER
         * users on the grounds that it's definitely going to be in
         * the exit path shortly and free memory. Any allocation it
         * makes during the free path will be small and short-lived.
         */
        if (alloc_flags & ALLOC_OOM)
            min -= min / 2;
        else
            min -= min / 4;
    }


#ifdef CONFIG_CMA
    /* If allocation can't use CMA areas don't use free CMA pages */
    if (!(alloc_flags & ALLOC_CMA))
        free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
#endif

    /*
     * Check watermarks for an order-0 allocation request. If these
     * are not met, then a high-order request also cannot go ahead
     * even if a suitable page happened to be free.
     */
    if (free_pages <= min + z->lowmem_reserve[classzone_idx])
        return false;

    /* If this is an order-0 request then the watermark is fine */
    if (!order)
        return true;

    /* For a high-order request, check at least one suitable page is free */
    for (o = order; o < MAX_ORDER; o++) {
        struct free_area *area = &z->free_area[o];
        int mt;

        if (!area->nr_free)
            continue;

        for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
            if (!list_empty(&area->free_list[mt]))
                return true;
        }

#ifdef CONFIG_CMA
        if ((alloc_flags & ALLOC_CMA) &&
            !list_empty(&area->free_list[MIGRATE_CMA])) {
            return true;
        }
#endif
        if (alloc_harder &&
            !list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
            return true;
    }
    return false;
}

[__alloc_pages_nodemask->get_page_from_freelist->rmqueue]

/*
 * Allocate a page from the given zone. Use pcplists for order-0 allocations.
 */
static inline
struct page *rmqueue(struct zone *preferred_zone,
            struct zone *zone, unsigned int order,
            gfp_t gfp_flags, unsigned int alloc_flags,
            int migratetype)
{
    unsigned long flags;
    struct page *page;

    if (likely(order == 0)) {
        page = rmqueue_pcplist(preferred_zone, zone, order,
                gfp_flags, migratetype);
        goto out;
    }

    /*
     * We most definitely don't want callers attempting to
     * allocate greater than order-1 page units with __GFP_NOFAIL.
     */
    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
    spin_lock_irqsave(&zone->lock, flags);

    do {
        page = NULL;
        if (alloc_flags & ALLOC_HARDER) {
            page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
            if (page)
                trace_mm_page_alloc_zone_locked(page, order, migratetype);
        }
        if (!page)
            page = __rmqueue(zone, order, migratetype);
    } while (page && check_new_pages(page, order));
    spin_unlock(&zone->lock);
    if (!page)
        goto failed;
    __mod_zone_freepage_state(zone, -(1 << order),
                  get_pcppage_migratetype(page));

    __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
    zone_statistics(preferred_zone, zone);
    local_irq_restore(flags);

out:
    VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
    return page;

failed:
    local_irq_restore(flags);
    return NULL;
}

/*
 * Go through the free lists for the given migratetype and remove
 * the smallest available page from the freelists
 */
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
                        int migratetype)
{
    unsigned int current_order;
    struct free_area *area;
    struct page *page;

    /* Find a page of the appropriate size in the preferred list */
    for (current_order = order; current_order < MAX_ORDER; ++current_order) {
        area = &(zone->free_area[current_order]);
        page = list_first_entry_or_null(&area->free_list[migratetype],
                            struct page, lru);
        if (!page)
            continue;
        list_del(&page->lru);
        rmv_page_order(page);
        area->nr_free--;
        expand(zone, page, order, current_order, area, migratetype);
        set_pcppage_migratetype(page, migratetype);
        return page;
    }

    return NULL;
}

rmqueue時真正分配內(nèi)存的函數(shù)

  • 通過調(diào)用rmqueue_pcplist,從每個CPU的緩存頁面中分配內(nèi)存
  • 上一步不成功調(diào)用__rmqueue_smallest函數(shù)真正從伙伴系統(tǒng)分配頁面,這個函數(shù)簡單的掃描伙伴系統(tǒng)查找合適的頁面如果找不到直接退出不做發(fā)雜的回收頁面處理
  • 如果__rmqueue_smallest分配失敗,則調(diào)用__rmqueue分配頁面,這個函數(shù)會盡力根據(jù)migratetype的值在伙伴系統(tǒng)查找合適的內(nèi)存空間,如果當前migratetype沒找到,則根據(jù)fallback表中定義的優(yōu)先級在其它內(nèi)存塊列表中查找。

當在伙伴系統(tǒng)中找到合適的內(nèi)存塊后就要把這個內(nèi)存塊從伙伴系統(tǒng)取出來,然后調(diào)用expand函數(shù)切出申請的大小,然后再把剩余的內(nèi)存切分成更小2^N大小還回伙伴系統(tǒng),這個過程中會盡量還會order大的空閑鏈表。

回到__alloc_pages_nodemask,如果get_page_from_freelist內(nèi)存分配失敗,會調(diào)用__alloc_pages_slowpath繼續(xù)分配內(nèi)存,這個過程會根據(jù)gfp_mask是否設(shè)置__GFP_KSWAPD_RECLAIM喚醒kswapd內(nèi)核線程回收頁面,緊接著再次調(diào)用get_page_from_freelist分配頁面,如果還不成功則通過直接回收頁面、遷移頁面來釋放部分內(nèi)存后再嘗試分配。還是不成功的話如果gfp_mask設(shè)置了__GFP_NOFAIL將中斷當前進程一直到內(nèi)存申請成功,否則分配失敗直接返回。

頁面回收

釋放頁面的核心函數(shù)是 free_page(),最終還是調(diào)用__free_pages()函數(shù)。__free_pages()函數(shù)會分兩種情況,對于order等于0的情況做特殊處理;對于order大于0的情況,屬于正常處理流程。

void __free_pages(struct page *page, unsigned int order)
{
    if (put_page_testzero(page)) {
        if (order == 0)
            free_unref_page(page);
        else
            __free_pages_ok(page, order);
    }
}

釋放內(nèi)存的核心操作是將頁面還給伙伴系統(tǒng),并計算釋放的頁面和相鄰的內(nèi)存是否可以合并,如果可以合并則合并內(nèi)存將新的內(nèi)存塊鏈入更高一階的內(nèi)存塊鏈表中,然后重復(fù)上面的操作知道內(nèi)存塊不能合并為止。

list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);

最終將內(nèi)存塊鏈入空閑內(nèi)存列表中。

/*
 * Freeing function for a buddy system allocator.
 *
 * The concept of a buddy system is to maintain direct-mapped table
 * (containing bit values) for memory blocks of various "orders".
 * The bottom level table contains the map for the smallest allocatable
 * units of memory (here, pages), and each level above it describes
 * pairs of units from the levels below, hence, "buddies".
 * At a high level, all that happens here is marking the table entry
 * at the bottom level available, and propagating the changes upward
 * as necessary, plus some accounting needed to play nicely with other
 * parts of the VM system.
 * At each level, we keep a list of pages, which are heads of continuous
 * free pages of length of (1 << order) and marked with _mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
 * field.
 * So when we are allocating or freeing one, we can derive the state of the
 * other.  That is, if we allocate a small block, and both were
 * free, the remainder of the region must be split into blocks.
 * If a block is freed, and its buddy is also free, then this
 * triggers coalescing into a block of larger size.
 *
 * -- nyc
 */

static inline void __free_one_page(struct page *page,
        unsigned long pfn,
        struct zone *zone, unsigned int order,
        int migratetype)
{
    unsigned long combined_pfn;
    unsigned long uninitialized_var(buddy_pfn);
    struct page *buddy;
    unsigned int max_order;

    max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);

    VM_BUG_ON(!zone_is_initialized(zone));
    VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

    VM_BUG_ON(migratetype == -1);
    if (likely(!is_migrate_isolate(migratetype)))
        __mod_zone_freepage_state(zone, 1 << order, migratetype);

    VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
    VM_BUG_ON_PAGE(bad_range(zone, page), page);

continue_merging:
    while (order < max_order - 1) {
        buddy_pfn = __find_buddy_pfn(pfn, order);
        buddy = page + (buddy_pfn - pfn);

        if (!pfn_valid_within(buddy_pfn))
            goto done_merging;
        if (!page_is_buddy(page, buddy, order))
            goto done_merging;
        /*
         * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
         * merge with it and move up one order.
         */
        if (page_is_guard(buddy)) {
            clear_page_guard(zone, buddy, order, migratetype);
        } else {
            list_del(&buddy->lru);
            zone->free_area[order].nr_free--;
            rmv_page_order(buddy);
        }
        combined_pfn = buddy_pfn & pfn;
        page = page + (combined_pfn - pfn);
        pfn = combined_pfn;
        order++;
    }
    if (max_order < MAX_ORDER) {
        /* If we are here, it means order is >= pageblock_order.
         * We want to prevent merge between freepages on isolate
         * pageblock and normal pageblock. Without this, pageblock
         * isolation could cause incorrect freepage or CMA accounting.
         *
         * We don't want to hit this code for the more frequent
         * low-order merging.
         */
        if (unlikely(has_isolate_pageblock(zone))) {
            int buddy_mt;

            buddy_pfn = __find_buddy_pfn(pfn, order);
            buddy = page + (buddy_pfn - pfn);
            buddy_mt = get_pageblock_migratetype(buddy);

            if (migratetype != buddy_mt
                    && (is_migrate_isolate(migratetype) ||
                        is_migrate_isolate(buddy_mt)))
                goto done_merging;
        }
        max_order++;
        goto continue_merging;
    }

done_merging:
    set_page_order(page, order);

    /*
     * If this is not the largest possible page, check if the buddy
     * of the next-highest order is free. If it is, it's possible
     * that pages are being freed that will coalesce soon. In case,
     * that is happening, add the free page to the tail of the list
     * so it's less likely to be used soon and more likely to be merged
     * as a higher order page
     */
    if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)) {
        struct page *higher_page, *higher_buddy;
        combined_pfn = buddy_pfn & pfn;
        higher_page = page + (combined_pfn - pfn);
        buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
        higher_buddy = higher_page + (buddy_pfn - combined_pfn);
        if (pfn_valid_within(buddy_pfn) &&
            page_is_buddy(higher_page, higher_buddy, order + 1)) {
            list_add_tail(&page->lru,
                &zone->free_area[order].free_list[migratetype]);
            goto out;
        }
    }

    list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
out:
    zone->free_area[order].nr_free++;
}

對于order為0的情況,內(nèi)核做了些特殊處理,zone中有一個變量zone->pageset為每個CPU初始化一個percpu變量 struct per_cpu_pageset。當釋放 order 等于 0 的頁面時,首先頁面釋放到 per_cpu_page->list 對應(yīng)的鏈表中。這樣可以保證頁面被同一個cpu使用避免頻繁沖涮緩存。當然這個鏈表不能一直增長下去,函數(shù)會檢查鏈表中頁面的個數(shù),當頁面大于一定數(shù)量時將部分頁面真正還給伙伴系統(tǒng),最終還是調(diào)用__free_one_page完成頁面釋放。

[free_unref_page_commit->free_pcppages_bulk->__free_one_page]

struct per_cpu_pages {
    int count;      /* number of pages in the list */
    int high;       /* high watermark, emptying needed */
    int batch;      /* chunk size for buddy add/remove */

    /* Lists of pages, one per migrate type stored on the pcp-lists */
    struct list_head lists[MIGRATE_PCPTYPES];
};
  • count 表示當前 zone 中的 per_cpu_pages 的頁面。
  • high 表示當緩存的頁面高于這水位時,會回收頁面到伙伴系統(tǒng)。
  • batch 表示一次回收頁面到伙伴系統(tǒng)的頁面數(shù)量。
/*
 * Frees a number of pages from the PCP lists
 * Assumes all pages on list are in same zone, and of same order.
 * count is the number of pages to free.
 *
 * If the zone was previously in an "all pages pinned" state then look to
 * see if this freeing clears that state.
 *
 * And clear the zone's pages_scanned counter, to hold off the "all pages are
 * pinned" detection logic.
 */
static void free_pcppages_bulk(struct zone *zone, int count,
                    struct per_cpu_pages *pcp)
{
    int migratetype = 0;
    int batch_free = 0;
    int prefetch_nr = 0;
    bool isolated_pageblocks;
    struct page *page, *tmp;
    LIST_HEAD(head);

    while (count) {
        struct list_head *list;

        /*
         * Remove pages from lists in a round-robin fashion. A
         * batch_free count is maintained that is incremented when an
         * empty list is encountered.  This is so more pages are freed
         * off fuller lists instead of spinning excessively around empty
         * lists
         */
        do {
            batch_free++;
            if (++migratetype == MIGRATE_PCPTYPES)
                migratetype = 0;
            list = &pcp->lists[migratetype];
        } while (list_empty(list));

        /* This is the only non-empty list. Free them all. */
        if (batch_free == MIGRATE_PCPTYPES)
            batch_free = count;

        do {
            page = list_last_entry(list, struct page, lru);
            /* must delete to avoid corrupting pcp list */
            list_del(&page->lru);
            pcp->count--;

            if (bulkfree_pcp_prepare(page))
                continue;

            list_add_tail(&page->lru, &head);

            /*
             * We are going to put the page back to the global
             * pool, prefetch its buddy to speed up later access
             * under zone->lock. It is believed the overhead of
             * an additional test and calculating buddy_pfn here
             * can be offset by reduced memory latency later. To
             * avoid excessive prefetching due to large count, only
             * prefetch buddy for the first pcp->batch nr of pages.
             */
            if (prefetch_nr++ < pcp->batch)
                prefetch_buddy(page);
        } while (--count && --batch_free && !list_empty(list));
    }

    spin_lock(&zone->lock);
    isolated_pageblocks = has_isolate_pageblock(zone);

    /*
     * Use safe version since after __free_one_page(),
     * page->lru.next will not point to original list.
     */
    list_for_each_entry_safe(page, tmp, &head, lru) {
        int mt = get_pcppage_migratetype(page);
        /* MIGRATE_ISOLATE page should not go to pcplists */
        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
        /* Pageblock could have been isolated meanwhile */
        if (unlikely(isolated_pageblocks))
            mt = get_pageblock_migratetype(page);

        __free_one_page(page, page_to_pfn(page), zone, 0, mt);
        trace_mm_page_pcpu_drain(page, 0, mt);
    }
    spin_unlock(&zone->lock);
}
```c

本文中部分圖片參考來及互聯(lián)網(wǎng)在此標識感謝!

參考文章:

https://blog.csdn.net/WANGYONGZIXUE/article/details/124518138

https://zhuanlan.zhihu.com/p/468829050

https://blog.csdn.net/u012294613/article/details/124151163

https://blog.csdn.net/faxiang1230/article/details/106557298
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容