進(jìn)程調(diào)度

目標(biāo)

本章將討論Linux內(nèi)核是如何進(jìn)行進(jìn)程調(diào)度的,進(jìn)程調(diào)度程序(也稱(chēng)為調(diào)度器)的工作與實(shí)現(xiàn)原理。

進(jìn)程調(diào)度程序負(fù)責(zé)決定運(yùn)行哪個(gè)進(jìn)程、什么時(shí)候運(yùn)行、運(yùn)行多長(zhǎng)時(shí)間。進(jìn)程調(diào)度程序可以看作是一個(gè)負(fù)責(zé)給進(jìn)程分配有限處理器時(shí)間資源的子系統(tǒng)。一個(gè)合理的調(diào)度程序可以更充分的發(fā)揮系統(tǒng)資源的利用。

本文代碼是基于linux-2.6.34版本。

多任務(wù)與搶占

操作系統(tǒng)中可以同時(shí)并發(fā)地運(yùn)行多個(gè)進(jìn)程。當(dāng)進(jìn)程數(shù)大于處理器數(shù)量時(shí),在某一時(shí)刻,總有一些進(jìn)程沒(méi)有被cpu真正執(zhí)行。一部分進(jìn)程可能因?yàn)闆](méi)有分配到cpu,處于就緒狀態(tài);一部分進(jìn)程可能在等待I/O事件的發(fā)生被內(nèi)核阻塞著或者睡眠著,處于等待狀態(tài)。

在多任務(wù)模式下,為了避免某些進(jìn)程獨(dú)占資源導(dǎo)致其它進(jìn)程一直無(wú)法運(yùn)行或者有更緊急的進(jìn)程需要立即運(yùn)行,進(jìn)程調(diào)度程序需要強(qiáng)制將某些進(jìn)程掛起,讓其它等待的進(jìn)程運(yùn)行,這個(gè)過(guò)程被稱(chēng)為搶占

調(diào)度策略

I/O消耗型進(jìn)程和CPU消耗型進(jìn)程

進(jìn)程可分為I/O消耗型和CPU消耗型。

  • I/O消耗型:指大部分時(shí)間用來(lái)提供I/O請(qǐng)求或等待I/O請(qǐng)求
  • CPU消耗型:大部分時(shí)間用于執(zhí)行代碼,沒(méi)有太多的I/O需求

I/O消耗型的進(jìn)程往往不需要太多的CPU資源,但期望有快速的I/O響應(yīng)。而CPU消耗型的進(jìn)程則期望占用更長(zhǎng)CPU時(shí)間。

進(jìn)程優(yōu)先級(jí)

Linux采用兩種不同的優(yōu)先級(jí)范圍:

  • 用nice值:范圍從-20到+19,默認(rèn)值為0,nice值越大,優(yōu)先級(jí)越低。
  • 實(shí)時(shí)優(yōu)先級(jí):范圍從0到99,值越大,優(yōu)先級(jí)越高。

時(shí)間片

時(shí)間片表示進(jìn)程在被搶占前持續(xù)運(yùn)行的時(shí)間。

時(shí)間片過(guò)長(zhǎng)會(huì)導(dǎo)致系統(tǒng)對(duì)交互式響應(yīng)(I/O型進(jìn)程)表現(xiàn)欠佳,時(shí)間片過(guò)短會(huì)增大進(jìn)程切換帶來(lái)的cpu消耗。

調(diào)度算法

調(diào)度器類(lèi)

Linux內(nèi)核通過(guò)抽象調(diào)度器的公共數(shù)據(jù)結(jié)構(gòu),以模塊化的方式組織所有的調(diào)度算法,保證多個(gè)調(diào)度算法可以并存。這個(gè)公共數(shù)據(jù)結(jié)構(gòu)稱(chēng)為調(diào)度器類(lèi)。每一種調(diào)度器類(lèi)調(diào)度屬于自己范疇的進(jìn)程。

調(diào)度器類(lèi)數(shù)據(jù)結(jié)構(gòu)struct sched_class定義在<linux/sched.h>文件中。

struct sched_class {
    const struct sched_class *next; //所有調(diào)度器類(lèi)以鏈表連接,排在鏈表開(kāi)始的優(yōu)先級(jí)最高

    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup,
                  bool head);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
    void (*yield_task) (struct rq *rq);

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    struct task_struct * (*pick_next_task) (struct rq *rq);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

    void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
    void (*post_schedule) (struct rq *this_rq);
    void (*task_waking) (struct rq *this_rq, struct task_struct *task);
    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);

    void (*switched_from) (struct rq *this_rq, struct task_struct *task,
                   int running);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task,
                 int running);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio, int running);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*moved_group) (struct task_struct *p, int on_rq);
#endif
};

每個(gè)進(jìn)程描述符都包含了調(diào)度器類(lèi)的成員指針(const struct sched_class *sched_class),表明該進(jìn)程所屬的調(diào)度器類(lèi)。

調(diào)度器類(lèi)數(shù)據(jù)結(jié)構(gòu)struct sched_class聲明了一個(gè)具體調(diào)度器類(lèi)實(shí)例對(duì)象需要實(shí)現(xiàn)的通用的操作接口:

  • enqueue_task:將進(jìn)程添加到就緒隊(duì)列中,在進(jìn)程從睡眠狀態(tài)變?yōu)榭蛇\(yùn)行狀態(tài)時(shí),調(diào)用該操作。
  • dequeue_task:將進(jìn)程從就緒隊(duì)列中移除。如果進(jìn)程從可運(yùn)行狀態(tài)轉(zhuǎn)換到不可運(yùn)行狀態(tài),調(diào)用該操作。
  • yield_task:當(dāng)進(jìn)程使用系統(tǒng)調(diào)用sched_yield自愿放棄對(duì)cpu的控制時(shí),調(diào)用該操作。
  • check_preempt_curr:?jiǎn)拘研逻M(jìn)程來(lái)?yè)屨籍?dāng)前進(jìn)程。例如當(dāng)fork新的進(jìn)程時(shí),會(huì)調(diào)用wake_up_new_task()函數(shù)來(lái)喚醒新的子進(jìn)程,就會(huì)調(diào)用該操作。
  • pick_next_task:選擇下一個(gè)將要運(yùn)行的進(jìn)程。
  • put_prev_task:該函數(shù)是在另一個(gè)進(jìn)程代替當(dāng)前運(yùn)行的進(jìn)程之前調(diào)用。主要負(fù)責(zé)底層的上下文切換。
  • set_curr_task:在進(jìn)程的調(diào)度策略發(fā)生變化時(shí),調(diào)用此操作。
  • task_tick:每次激活周期性調(diào)度器時(shí),由周期性調(diào)度器調(diào)用。
  • task_fork:每次fork出新進(jìn)程后,調(diào)用此操作通知調(diào)度器。

調(diào)度實(shí)體

調(diào)度器類(lèi)調(diào)度的對(duì)象被抽象為調(diào)度實(shí)體,調(diào)度實(shí)體的數(shù)據(jù)結(jié)構(gòu)中維護(hù)了與調(diào)度相關(guān)的統(tǒng)計(jì)信息。
調(diào)度實(shí)體的數(shù)據(jù)結(jié)構(gòu)在<linux/sched.h>中定義。

/*
 * CFS stats for a schedulable entity (task, task-group etc)
 *
 * Current field usage histogram:
 *
 *     4 se->block_start
 *     4 se->run_node
 *     4 se->sleep_start
 *     6 se->load.weight
 */
struct sched_entity {
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;
    struct list_head    group_node;
    unsigned int        on_rq; //1-表示在就緒隊(duì)列中,0-未在就緒隊(duì)列中

    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;

    u64         last_wakeup;
    u64         avg_overlap;

    u64         nr_migrations;

    u64         start_runtime;
    u64         avg_wakeup;

#ifdef CONFIG_SCHEDSTATS
    u64         wait_start;
    u64         wait_max;
    u64         wait_count;
    u64         wait_sum;
    u64         iowait_count;
    u64         iowait_sum;

    u64         sleep_start;
    u64         sleep_max;
    s64         sum_sleep_runtime;

    u64         block_start;
    u64         block_max;
    u64         exec_max;
    u64         slice_max;

    u64         nr_migrations_cold;
    u64         nr_failed_migrations_affine;
    u64         nr_failed_migrations_running;
    u64         nr_failed_migrations_hot;
    u64         nr_forced_migrations;

    u64         nr_wakeups;
    u64         nr_wakeups_sync;
    u64         nr_wakeups_migrate;
    u64         nr_wakeups_local;
    u64         nr_wakeups_remote;
    u64         nr_wakeups_affine;
    u64         nr_wakeups_affine_attempts;
    u64         nr_wakeups_passive;
    u64         nr_wakeups_idle;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif
};

進(jìn)程描述符結(jié)構(gòu)中包含了一個(gè)struct sched_entity的成員變量se,使進(jìn)程成為可調(diào)度實(shí)體。

就緒隊(duì)列

就緒隊(duì)列主要保存了所有活動(dòng)進(jìn)程。其數(shù)據(jù)結(jié)構(gòu)struct rq定義在kernel/sched.c文件中。

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
    /* runqueue lock: */
    raw_spinlock_t lock;

    /*
     * nr_running and cpu_load should be in the same cacheline because
     * remote CPUs use both these fields when doing load calculation.
     */
    unsigned long nr_running; //隊(duì)列上可運(yùn)行的進(jìn)程總數(shù)
    #define CPU_LOAD_IDX_MAX 5
    unsigned long cpu_load[CPU_LOAD_IDX_MAX]; //用于跟蹤負(fù)載狀態(tài)
#ifdef CONFIG_NO_HZ
    unsigned char in_nohz_recently;
#endif
    /* capture load from *all* tasks on this cpu: */
    struct load_weight load; //負(fù)載信息
    unsigned long nr_load_updates;
    u64 nr_switches;

    struct cfs_rq cfs; //CFS調(diào)度類(lèi)就緒隊(duì)列
    struct rt_rq rt; //實(shí)時(shí)調(diào)度類(lèi)就緒隊(duì)列

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* list of leaf cfs_rq on this cpu: */
    struct list_head leaf_cfs_rq_list;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
    struct list_head leaf_rt_rq_list;
#endif

    /*
     * This is part of a global counter where only the total sum
     * over all CPUs matters. A task can increase this counter on
     * one CPU and if it got migrated afterwards it may decrease
     * it on another CPU. Always updated under the runqueue lock:
     */
    unsigned long nr_uninterruptible;

    struct task_struct *curr, *idle; //curr-當(dāng)前運(yùn)行的進(jìn)程,idle-空閑進(jìn)程
    unsigned long next_balance;
    struct mm_struct *prev_mm;

    u64 clock; //時(shí)鐘,周期性調(diào)度器時(shí)會(huì)定時(shí)更新該值

    atomic_t nr_iowait;

#ifdef CONFIG_SMP
    struct root_domain *rd;
    struct sched_domain *sd;

    unsigned char idle_at_tick;
    /* For active balancing */
    int post_schedule;
    int active_balance;
    int push_cpu;
    /* cpu of this runqueue: */
    int cpu;
    int online;

    unsigned long avg_load_per_task;

    struct task_struct *migration_thread;
    struct list_head migration_queue;

    u64 rt_avg;
    u64 age_stamp;
    u64 idle_stamp;
    u64 avg_idle;
#endif

    /* calc_load related fields */
    unsigned long calc_load_update;
    long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
    int hrtick_csd_pending;
    struct call_single_data hrtick_csd;
#endif
    struct hrtimer hrtick_timer;
#endif

#ifdef CONFIG_SCHEDSTATS
    /* latency stats */
    struct sched_info rq_sched_info;
    unsigned long long rq_cpu_time;
    /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

    /* sys_sched_yield() stats */
    unsigned int yld_count;

    /* schedule() stats */
    unsigned int sched_switch;
    unsigned int sched_count;
    unsigned int sched_goidle;

    /* try_to_wake_up() stats */
    unsigned int ttwu_count;
    unsigned int ttwu_local;

    /* BKL stats */
    unsigned int bkl_count;
#endif
};

每個(gè)CPU都有各自的就緒隊(duì)列。每一個(gè)活動(dòng)進(jìn)程只出現(xiàn)一個(gè)就緒隊(duì)列中。就緒隊(duì)列并沒(méi)有直接管理進(jìn)程,而是由嵌入的特定調(diào)度類(lèi)的就緒隊(duì)列來(lái)負(fù)責(zé)的,比如struct cfs_rq cfsstruct rt_rq rt。

內(nèi)核中的所有就緒隊(duì)列都在runqueues[]數(shù)組中,該數(shù)組中的每一個(gè)元素分別對(duì)應(yīng)于系統(tǒng)中一個(gè)cpu。該數(shù)組定義在kernel/sched.c文件中。

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

//針對(duì)runqueues的一些操作宏
#define cpu_rq(cpu)     (&per_cpu(runqueues, (cpu)))
#define this_rq()       (&__get_cpu_var(runqueues))
#define task_rq(p)      cpu_rq(task_cpu(p))
#define cpu_curr(cpu)       (cpu_rq(cpu)->curr)
#define raw_rq()        (&__raw_get_cpu_var(runqueues))

完全公平調(diào)度類(lèi)CFS

CFS調(diào)度類(lèi)在kernel/sched_fair.c文件中定義。

/*
 * All the scheduling class methods:
 */
static const struct sched_class fair_sched_class = {
    .next           = &idle_sched_class,
    .enqueue_task       = enqueue_task_fair,
    .dequeue_task       = dequeue_task_fair,
    .yield_task     = yield_task_fair,

    .check_preempt_curr = check_preempt_wakeup,

    .pick_next_task     = pick_next_task_fair,
    .put_prev_task      = put_prev_task_fair,

#ifdef CONFIG_SMP
    .select_task_rq     = select_task_rq_fair,

    .rq_online      = rq_online_fair,
    .rq_offline     = rq_offline_fair,

    .task_waking        = task_waking_fair,
#endif

    .set_curr_task          = set_curr_task_fair,
    .task_tick      = task_tick_fair,
    .task_fork      = task_fork_fair,

    .prio_changed       = prio_changed_fair,
    .switched_to        = switched_to_fair,

    .get_rr_interval    = get_rr_interval_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
    .moved_group        = moved_group_fair,
#endif
};

CFS就緒隊(duì)列

CFS就緒隊(duì)列數(shù)據(jù)結(jié)構(gòu)struct cfs_rq嵌入在就緒隊(duì)列數(shù)據(jù)結(jié)構(gòu)struct rq中,提供給CFS調(diào)度算法來(lái)使用,在kernel/sched.c文件定義。

/* CFS-related fields in a runqueue */
struct cfs_rq {
    struct load_weight load; //負(fù)載信息
    unsigned long nr_running; //隊(duì)列上可運(yùn)行的進(jìn)程數(shù)目

    u64 exec_clock;
    u64 min_vruntime; //所有進(jìn)程中最小的虛擬運(yùn)行時(shí)間,該值是逐漸遞增的

    struct rb_root tasks_timeline; //按時(shí)間排序的紅黑樹(shù)
    struct rb_node *rb_leftmost; //指向樹(shù)上最左邊的節(jié)點(diǎn)(即最需要被調(diào)度的節(jié)點(diǎn))

    struct list_head tasks;
    struct list_head *balance_iterator;

    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr, *next, *last;//curr-表示當(dāng)前正在執(zhí)行的調(diào)度實(shí)體

    unsigned int nr_spread_over;

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct rq *rq;  /* cpu runqueue to which this cfs_rq is attached */

    /*
     * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
     * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
     * (like users, containers etc.)
     *
     * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
     * list is used during load balance.
     */
    struct list_head leaf_cfs_rq_list;
    struct task_group *tg;  /* group that "owns" this runqueue */

#ifdef CONFIG_SMP
    /*
     * the part of load.weight contributed by tasks
     */
    unsigned long task_weight;

    /*
     *   h_load = weight * f(tg)
     *
     * Where f(tg) is the recursive weight fraction assigned to
     * this group.
     */
    unsigned long h_load;

    /*
     * this cpu's part of tg->shares
     */
    unsigned long shares;

    /*
     * load.weight at the time we set shares
     */
    unsigned long rq_weight;
#endif
#endif
};

虛擬時(shí)間

CFS調(diào)度依賴(lài)虛擬時(shí)間,虛擬時(shí)間用于度量進(jìn)程能夠得到的CPU時(shí)間。與虛擬時(shí)間相關(guān)的計(jì)算都是在update_curr()函數(shù)中執(zhí)行,該函數(shù)在系統(tǒng)中各個(gè)不同地方調(diào)用,包含周期性調(diào)度器之內(nèi)。

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock;
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;

    /*
     * Get the amount of time the current task was running
     * since the last time we changed load (this cannot
     * overflow on 32 bits):
     */
    //計(jì)算當(dāng)前進(jìn)程從上次更新exec_start到現(xiàn)在的執(zhí)行時(shí)間差
    delta_exec = (unsigned long)(now - curr->exec_start);
    if (!delta_exec)
        return;

    __update_curr(cfs_rq, curr, delta_exec);//統(tǒng)計(jì)當(dāng)前進(jìn)程的運(yùn)行時(shí)間和vruntime
    curr->exec_start = now;

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}

/*
 * Update the current task's runtime statistics. Skip current tasks that
 * are not in our scheduling class.
 */
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

    curr->sum_exec_runtime += delta_exec;//累積總的運(yùn)行時(shí)間
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);//計(jì)算虛擬時(shí)間權(quán)重增量

    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);//更新cfs_rq的min_vruntime
}

update_curr()函數(shù)主要作用是更新當(dāng)前進(jìn)程的累積物理運(yùn)行時(shí)間和虛擬時(shí)間,以及cfs_rq的min_vruntime。

虛擬時(shí)間權(quán)重增量delta_exec_weighted的計(jì)算公式:

delta_exec_weighted = delta_exec * (NICE_0_LOAD / curr->load.weight)

系統(tǒng)內(nèi)部定義了一個(gè)優(yōu)化級(jí)與權(quán)重之間的換算關(guān)系,優(yōu)化級(jí)越高(nice越低),權(quán)重就越大,每次計(jì)算虛擬時(shí)間權(quán)重時(shí),累積增量也就越小,進(jìn)程的vruntime就增長(zhǎng)的越緩慢。

vruntime增長(zhǎng)越快的進(jìn)程將會(huì)向紅黑樹(shù)右邊移動(dòng),則增長(zhǎng)越緩慢則向左邊移動(dòng)。這樣就可以保證優(yōu)化級(jí)越高的進(jìn)程會(huì)優(yōu)先被調(diào)度。

入隊(duì)

當(dāng)進(jìn)程變成可運(yùn)行狀態(tài)(被喚醒)或者通過(guò)fork()創(chuàng)建的新進(jìn)程時(shí),CFS通過(guò)入隊(duì)操作將進(jìn)程添加到紅黑樹(shù)中,并緩存最左葉子節(jié)點(diǎn)。

enqueue_task_fair()函數(shù)是CFS調(diào)度器類(lèi)入隊(duì)操作的入口,它調(diào)用enqueue_entity()函數(shù)來(lái)執(zhí)行入隊(duì)真正的入隊(duì)操作。

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, bool head)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;
    int flags = 0;

    if (wakeup)
        flags |= ENQUEUE_WAKEUP;
    if (p->state == TASK_WAKING)
        flags |= ENQUEUE_MIGRATE;

    for_each_sched_entity(se) {
        if (se->on_rq) //如果已經(jīng)在就緒隊(duì)列中,則不用再入隊(duì)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);
        flags = ENQUEUE_WAKEUP;
    }

    hrtick_update(rq);
}

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through callig update_curr().
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);//更新當(dāng)前進(jìn)程的運(yùn)行時(shí)間和vruntime
    //更新實(shí)體入隊(duì)的一些統(tǒng)計(jì)信息和狀態(tài)
    //如累加cfs_rq->load,cfs_rq->nr_running++,se->on_rq = 1
    account_entity_enqueue(cfs_rq, se);

    if (flags & ENQUEUE_WAKEUP) {
        place_entity(cfs_rq, se, 0);
        enqueue_sleeper(cfs_rq, se);
    }

    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)//如果需要入隊(duì)的實(shí)體并非當(dāng)前正在執(zhí)行的實(shí)體
        __enqueue_entity(cfs_rq, se);//將實(shí)體添加到rbtree中
}

enqueue_entity()函數(shù)更新入隊(duì)實(shí)體的vruntime和其它一些統(tǒng)計(jì)數(shù)據(jù),然后再調(diào)用__enqueue_entity()函數(shù)將實(shí)體插入到rbtree中。

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    s64 key = entity_key(cfs_rq, se);//key = se->vruntime - cfs_rq->min_vruntime
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     * 查找插入新節(jié)點(diǎn)的合適位置
     */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (key < entity_key(cfs_rq, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    //不存在key值比新添加的節(jié)點(diǎn)更小的節(jié)點(diǎn)
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;//如果新添加的節(jié)點(diǎn)的key最小,則緩存起來(lái)

    rb_link_node(&se->run_node, parent, link);//將實(shí)體節(jié)點(diǎn)添加到紅黑樹(shù)中
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);//對(duì)添加的新節(jié)點(diǎn)著色平衡
}

從上面的算法可以看出,rbtime節(jié)點(diǎn)組織方式是左節(jié)點(diǎn)的key比右節(jié)點(diǎn)的key小。而key的值相當(dāng)于實(shí)體的vruntime值。vruntime值越小,則越向rbtree左邊移動(dòng)。rb_leftmost緩存了vruntime值最小的葉子節(jié)點(diǎn)。

出隊(duì)

當(dāng)進(jìn)程變成不可運(yùn)行狀態(tài)(被阻塞)或者終止時(shí),CFS將進(jìn)程從紅黑樹(shù)中移除。

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        dequeue_entity(cfs_rq, se, sleep);//將實(shí)體從就緒隊(duì)列中移除
        /* Don't dequeue parent if it has other entities besides us */
        if (cfs_rq->load.weight)
            break;
        sleep = 1;
    }

    hrtick_update(rq);
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

    update_stats_dequeue(cfs_rq, se);
    if (sleep) {
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
            struct task_struct *tsk = task_of(se);

            if (tsk->state & TASK_INTERRUPTIBLE)
                se->sleep_start = rq_of(cfs_rq)->clock;
            if (tsk->state & TASK_UNINTERRUPTIBLE)
                se->block_start = rq_of(cfs_rq)->clock;
        }
#endif
    }

    clear_buddies(cfs_rq, se);

    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    //更新實(shí)體入隊(duì)的一些統(tǒng)計(jì)信息和狀態(tài)
    //如累減cfs_rq->load,cfs_rq->nr_running--,se->on_rq = 0
    account_entity_dequeue(cfs_rq, se);
    update_min_vruntime(cfs_rq);

    /*
     * Normalize the entity after updating the min_vruntime because the
     * update can refer to the ->curr item and we need to reflect this
     * movement in our normalized position.
     */
    if (!sleep)
        se->vruntime -= cfs_rq->min_vruntime;
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    //如果要移除的節(jié)點(diǎn)恰好也是緩存的最左節(jié)點(diǎn),則需要更新新的最左節(jié)點(diǎn)
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node; //更新rb_leftmost
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);//將節(jié)點(diǎn)從rbtree中移除
}

選擇下一個(gè)進(jìn)程

CFS選取下一個(gè)待運(yùn)行的進(jìn)程,是所有進(jìn)程中vruntime最小的那個(gè),也就是CFS緩存起來(lái)的最左的葉子節(jié)點(diǎn)。

當(dāng)前執(zhí)行的進(jìn)程不能存放在rbtree中,所以需要將進(jìn)程從rbtree中移除,并通過(guò)cfs_rq->curr成員來(lái)引用跟蹤。

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;

    //如果隊(duì)列中沒(méi)有可運(yùn)行的進(jìn)程,則立即返回
    if (!cfs_rq->nr_running)
        return NULL;

    do {
        se = pick_next_entity(cfs_rq);//獲取一下待運(yùn)行的進(jìn)程
        set_next_entity(cfs_rq, se);//將實(shí)體從rbtree中移除,并更新實(shí)體的開(kāi)始執(zhí)行時(shí)間
        cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);

    p = task_of(se);
    hrtick_start_fair(rq, p);

    return p;
}

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct sched_entity *se = __pick_next_entity(cfs_rq);
    struct sched_entity *left = se;

    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    clear_buddies(cfs_rq, se);

    return se;
}

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = cfs_rq->rb_leftmost;//獲取紅黑樹(shù)上緩存的最左節(jié)點(diǎn)

    if (!left)
        return NULL;

    return rb_entry(left, struct sched_entity, run_node);//獲取節(jié)點(diǎn)對(duì)應(yīng)的調(diào)度實(shí)體
}

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    /* 'current' is not kept within the tree. */
    if (se->on_rq) {
        /*
         * Any task has to be enqueued before it get to execute on
         * a CPU. So account for the time it spent waiting on the
         * runqueue.
         */
        update_stats_wait_end(cfs_rq, se);
        __dequeue_entity(cfs_rq, se);//將調(diào)度實(shí)體從就緒隊(duì)列中移出
    }

    update_stats_curr_start(cfs_rq, se);//更新進(jìn)程當(dāng)前開(kāi)始執(zhí)行的時(shí)間
    cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
    /*
     * Track our maximum slice length, if the CPU's load is at
     * least twice that of our own weight (i.e. dont track it
     * when there are only lesser-weight tasks around):
     */
    if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
        se->slice_max = max(se->slice_max,
            se->sum_exec_runtime - se->prev_sum_exec_runtime);
    }
#endif
    se->prev_sum_exec_runtime = se->sum_exec_runtime;//保存總的運(yùn)行時(shí)間,后面用于計(jì)算進(jìn)程在cpu上的運(yùn)行時(shí)間來(lái)判斷進(jìn)程是否已經(jīng)超過(guò)了期望時(shí)長(zhǎng),是否需要重新調(diào)度
}

周期性調(diào)度

周期性調(diào)度器定時(shí)調(diào)用特定調(diào)度器的task_tick函數(shù),用于定時(shí)更新當(dāng)前執(zhí)行進(jìn)程的執(zhí)行時(shí)間,同時(shí)跟蹤當(dāng)前進(jìn)程執(zhí)行時(shí)間是否超過(guò)期望的時(shí)長(zhǎng),如果超過(guò)則發(fā)起進(jìn)程調(diào)度請(qǐng)求(設(shè)置重調(diào)標(biāo)志TIF_NEED_RESCHED),核心調(diào)度器會(huì)在下一個(gè)適當(dāng)時(shí)機(jī)發(fā)起調(diào)度。

CFS的task_tick函數(shù)為task_tick_fair():

/*
 * scheduler tick hitting a task of our scheduling class:
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);
    }
}

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK
    /*
     * queued ticks are scheduled to match the slice, so don't bother
     * validating it and just reschedule.
     */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }
    /*
     * don't let the period tick interfere with the hrtick preemption
     */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr);//檢查當(dāng)前進(jìn)程運(yùn)行時(shí)長(zhǎng),判斷是否需要搶占
}

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;

    ideal_runtime = sched_slice(cfs_rq, curr);//計(jì)算當(dāng)前進(jìn)程可運(yùn)行的期望時(shí)長(zhǎng)
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;//進(jìn)程的運(yùn)行時(shí)長(zhǎng)
    //如果進(jìn)程運(yùn)行時(shí)長(zhǎng)已經(jīng)超過(guò)了期望時(shí)長(zhǎng),則發(fā)起調(diào)度請(qǐng)求
    if (delta_exec > ideal_runtime) {
        resched_task(rq_of(cfs_rq)->curr);//發(fā)起調(diào)度請(qǐng)求
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         */
        clear_buddies(cfs_rq, curr);
        return;
    }

    /*
     * Ensure that a task that missed wakeup preemption by a
     * narrow margin doesn't have to wait for a full slice.
     * This also mitigates buddy induced latencies under load.
     */
    if (!sched_feat(WAKEUP_PREEMPT))
        return;

    if (delta_exec < sysctl_sched_min_granularity)
        return;

    if (cfs_rq->nr_running > 1) {
        struct sched_entity *se = __pick_next_entity(cfs_rq);
        s64 delta = curr->vruntime - se->vruntime;

        if (delta > ideal_runtime)
            resched_task(rq_of(cfs_rq)->curr);
    }
}

static void resched_task(struct task_struct *p)
{
    int cpu;

    assert_raw_spin_locked(&task_rq(p)->lock);

    if (test_tsk_need_resched(p))
        return;

    set_tsk_need_resched(p);//設(shè)置重新調(diào)度標(biāo)志TIF_NEED_RESCHED

    cpu = task_cpu(p);
    if (cpu == smp_processor_id())
        return;

    /* NEED_RESCHED must be visible before we test polling */
    smp_mb();
    if (!tsk_is_polling(p))
        smp_send_reschedule(cpu);
}

喚醒搶占

當(dāng)調(diào)用try_to_wake_up()wake_up_new_task()函數(shù)來(lái)喚醒新進(jìn)程時(shí),內(nèi)核通過(guò)調(diào)用調(diào)度器的check_preempt_curr函數(shù)來(lái)檢查新的進(jìn)程是否可以搶占當(dāng)前運(yùn)行的進(jìn)程。

CFS通過(guò)check_preempt_wakeup()函數(shù)來(lái)完成搶占檢測(cè)的操作。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
    struct cfs_rq *cfs_rq = task_cfs_rq(curr);
    int sync = wake_flags & WF_SYNC;
    int scale = cfs_rq->nr_running >= sched_nr_latency;

    //如果新進(jìn)程是實(shí)時(shí)進(jìn)程,則立即請(qǐng)求重新調(diào)度
    if (unlikely(rt_prio(p->prio)))
        goto preempt;

    if (unlikely(p->sched_class != &fair_sched_class))
        return;

    if (unlikely(se == pse))
        return;

    if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
        set_next_buddy(pse);

    /*
     * We can come here with TIF_NEED_RESCHED already set from new task
     * wake up path.
     */
    //已經(jīng)設(shè)置了TIF_NEED_RESCHED標(biāo)志,則不用再設(shè)置了
    if (test_tsk_need_resched(curr))
        return;

    /*
     * Batch and idle tasks do not preempt (their preemption is driven by
     * the tick):
     */
    if (unlikely(p->policy != SCHED_NORMAL))
        return;

    /* Idle tasks are by definition preempted by everybody. */
    //如果當(dāng)前執(zhí)行的進(jìn)程是空閑進(jìn)程,則直接請(qǐng)求重新調(diào)度
    if (unlikely(curr->policy == SCHED_IDLE))
        goto preempt;

    if (sched_feat(WAKEUP_SYNC) && sync)
        goto preempt;

    if (sched_feat(WAKEUP_OVERLAP) &&
            se->avg_overlap < sysctl_sched_migration_cost &&
            pse->avg_overlap < sysctl_sched_migration_cost)
        goto preempt;

    if (!sched_feat(WAKEUP_PREEMPT))
        return;

    update_curr(cfs_rq);
    find_matching_se(&se, &pse);
    BUG_ON(!pse);
    //如果se->vruntime > pse->vruntime + gran,則請(qǐng)求重新調(diào)度
    //gran表示最小時(shí)間限額,用來(lái)確保進(jìn)程不至于切換得太頻繁而影響性能
    if (wakeup_preempt_entity(se, pse) == 1)
        goto preempt;

    return;

preempt:
    resched_task(curr);
    /*
     * Only set the backward buddy when the current task is still
     * on the rq. This can happen when a wakeup gets interleaved
     * with schedule on the ->pre_schedule() or idle_balance()
     * point, either of which can * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class,
     * for obvious reasons its a bad idea to schedule back to it.
     */
    if (unlikely(!se->on_rq || curr == rq->idle))
        return;

    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
        set_last_buddy(se);
}

進(jìn)程調(diào)度入口

schedule()函數(shù)

進(jìn)程調(diào)度向外界提供的功能接口就是schedule()函數(shù),它定義在kernel/sched.c文件中。
schedule()作為通用的進(jìn)程調(diào)度入口,隱藏了具體特定調(diào)度器類(lèi)調(diào)度細(xì)節(jié)。該函數(shù)通過(guò)調(diào)用pick_next_task()函數(shù)委托給特定調(diào)度類(lèi)來(lái)獲取優(yōu)化級(jí)最高的進(jìn)程,再調(diào)用context_switch()函數(shù)完成上下文切換。

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_sched_qs(cpu);
    prev = rq->curr;
    switch_count = &prev->nivcsw;

    release_kernel_lock(prev);
need_resched_nonpreemptible:

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    raw_spin_lock_irq(&rq->lock);
    update_rq_clock(rq);
    clear_tsk_need_resched(prev); //清除TIF_NEED_RESCHED標(biāo)志
    //如果當(dāng)前進(jìn)程處于睡眠狀態(tài)(TASK_RUNNING值為0)
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        //如果接收到信號(hào)(有未處理的信號(hào)),則將進(jìn)程設(shè)置為運(yùn)行狀態(tài)
        if (unlikely(signal_pending_state(prev->state, prev)))
            prev->state = TASK_RUNNING;
        else
            deactivate_task(rq, prev, 1);//停止進(jìn)程并將其從就緒隊(duì)列移除
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);

    put_prev_task(rq, prev); //通過(guò)調(diào)度器當(dāng)前運(yùn)行進(jìn)程將要被另一個(gè)進(jìn)程代替
    next = pick_next_task(rq);//獲取優(yōu)化級(jí)最高的進(jìn)程(任務(wù))

    if (likely(prev != next)) {
        sched_info_switch(prev, next);
        perf_event_task_sched_out(prev, next);

        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;
        //執(zhí)行上下文切換
        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * the context switch might have flipped the stack from under
         * us, hence refresh the local variables.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    if (unlikely(reacquire_kernel_lock(current) < 0)) {
        prev = rq->curr;
        switch_count = &prev->nivcsw;
        goto need_resched_nonpreemptible;
    }

    preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}
EXPORT_SYMBOL(schedule);

pick_next_task()函數(shù)依次遍歷調(diào)度器類(lèi)鏈表,直到找到優(yōu)先級(jí)最高的進(jìn)程為止。

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;

    /*
     * Optimization: we know that if all tasks are in
     * the fair class we can call that function directly:
     */
    //如果隊(duì)列中所有的任務(wù)都屬于cfs調(diào)度器類(lèi)型,則直接調(diào)用相關(guān)的函數(shù)
    if (likely(rq->nr_running == rq->cfs.nr_running)) {
        p = fair_sched_class.pick_next_task(rq);
        if (likely(p))
            return p;
    }
    //遍歷調(diào)度器類(lèi)鏈表,直到找到優(yōu)先級(jí)最高的進(jìn)程為止
    class = sched_class_highest;
    for ( ; ; ) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
        /*
         * Will never be NULL as the idle class always
         * returns a non-NULL p:
         */
        class = class->next;
    }
}

schedule()函數(shù)一般在如下情況被調(diào)用:

  • 從系統(tǒng)調(diào)用返回到用戶空間之前,檢查是否設(shè)置了重調(diào)標(biāo)志TIF_NEED_RESCHED
  • 從中斷處理程序返回到用戶空間之前
  • 從中斷處理程序退出,返回到內(nèi)核空間之前
  • 內(nèi)核代碼再次變?yōu)榭蓳屨紩r(shí)
  • 進(jìn)程在內(nèi)核中顯式調(diào)用了schedule()函數(shù)
  • 進(jìn)程在內(nèi)核中阻塞

上下文切換

上下文切換是指將一個(gè)可運(yùn)行的進(jìn)程切換到另一個(gè)進(jìn)程,切換過(guò)程包含虛擬內(nèi)存和寄存器狀態(tài)等信息的切換。該操作是由定義在kernel/sched.c文件中的context_switch()函數(shù)來(lái)完成。

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next)
{
    //......
    switch_mm(oldmm, mm, next);

    //......
    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);

    //......
}

switch_mm()函數(shù)在<asm/mmu_context.h>文件中聲明,該函數(shù)負(fù)責(zé)將上一個(gè)進(jìn)程的虛擬內(nèi)存映射切換到新的進(jìn)程的虛擬內(nèi)存映射。

switch_to()函數(shù)在<asm/system.h>文件中聲明,該函數(shù)負(fù)責(zé)處理器狀態(tài)的切換。包括保存和恢復(fù)棧信息、處理器寄存器以及與特定體系架構(gòu)相關(guān)的狀態(tài)信息。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容