目標(biāo)
本章將討論Linux內(nèi)核是如何進(jìn)行進(jìn)程調(diào)度的,進(jìn)程調(diào)度程序(也稱(chēng)為調(diào)度器)的工作與實(shí)現(xiàn)原理。
進(jìn)程調(diào)度程序負(fù)責(zé)決定運(yùn)行哪個(gè)進(jìn)程、什么時(shí)候運(yùn)行、運(yùn)行多長(zhǎng)時(shí)間。進(jìn)程調(diào)度程序可以看作是一個(gè)負(fù)責(zé)給進(jìn)程分配有限處理器時(shí)間資源的子系統(tǒng)。一個(gè)合理的調(diào)度程序可以更充分的發(fā)揮系統(tǒng)資源的利用。
本文代碼是基于linux-2.6.34版本。
多任務(wù)與搶占
操作系統(tǒng)中可以同時(shí)并發(fā)地運(yùn)行多個(gè)進(jìn)程。當(dāng)進(jìn)程數(shù)大于處理器數(shù)量時(shí),在某一時(shí)刻,總有一些進(jìn)程沒(méi)有被cpu真正執(zhí)行。一部分進(jìn)程可能因?yàn)闆](méi)有分配到cpu,處于就緒狀態(tài);一部分進(jìn)程可能在等待I/O事件的發(fā)生被內(nèi)核阻塞著或者睡眠著,處于等待狀態(tài)。
在多任務(wù)模式下,為了避免某些進(jìn)程獨(dú)占資源導(dǎo)致其它進(jìn)程一直無(wú)法運(yùn)行或者有更緊急的進(jìn)程需要立即運(yùn)行,進(jìn)程調(diào)度程序需要強(qiáng)制將某些進(jìn)程掛起,讓其它等待的進(jìn)程運(yùn)行,這個(gè)過(guò)程被稱(chēng)為搶占。
調(diào)度策略
I/O消耗型進(jìn)程和CPU消耗型進(jìn)程
進(jìn)程可分為I/O消耗型和CPU消耗型。
- I/O消耗型:指大部分時(shí)間用來(lái)提供I/O請(qǐng)求或等待I/O請(qǐng)求
- CPU消耗型:大部分時(shí)間用于執(zhí)行代碼,沒(méi)有太多的I/O需求
I/O消耗型的進(jìn)程往往不需要太多的CPU資源,但期望有快速的I/O響應(yīng)。而CPU消耗型的進(jìn)程則期望占用更長(zhǎng)CPU時(shí)間。
進(jìn)程優(yōu)先級(jí)
Linux采用兩種不同的優(yōu)先級(jí)范圍:
- 用nice值:范圍從-20到+19,默認(rèn)值為0,nice值越大,優(yōu)先級(jí)越低。
- 實(shí)時(shí)優(yōu)先級(jí):范圍從0到99,值越大,優(yōu)先級(jí)越高。
時(shí)間片
時(shí)間片表示進(jìn)程在被搶占前持續(xù)運(yùn)行的時(shí)間。
時(shí)間片過(guò)長(zhǎng)會(huì)導(dǎo)致系統(tǒng)對(duì)交互式響應(yīng)(I/O型進(jìn)程)表現(xiàn)欠佳,時(shí)間片過(guò)短會(huì)增大進(jìn)程切換帶來(lái)的cpu消耗。
調(diào)度算法
調(diào)度器類(lèi)
Linux內(nèi)核通過(guò)抽象調(diào)度器的公共數(shù)據(jù)結(jié)構(gòu),以模塊化的方式組織所有的調(diào)度算法,保證多個(gè)調(diào)度算法可以并存。這個(gè)公共數(shù)據(jù)結(jié)構(gòu)稱(chēng)為調(diào)度器類(lèi)。每一種調(diào)度器類(lèi)調(diào)度屬于自己范疇的進(jìn)程。
調(diào)度器類(lèi)數(shù)據(jù)結(jié)構(gòu)struct sched_class定義在<linux/sched.h>文件中。
struct sched_class {
const struct sched_class *next; //所有調(diào)度器類(lèi)以鏈表連接,排在鏈表開(kāi)始的優(yōu)先級(jí)最高
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup,
bool head);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
void (*task_waking) (struct rq *this_rq, struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);
void (*set_cpus_allowed)(struct task_struct *p,
const struct cpumask *newmask);
void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
#endif
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
void (*task_fork) (struct task_struct *p);
void (*switched_from) (struct rq *this_rq, struct task_struct *task,
int running);
void (*switched_to) (struct rq *this_rq, struct task_struct *task,
int running);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio, int running);
unsigned int (*get_rr_interval) (struct rq *rq,
struct task_struct *task);
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*moved_group) (struct task_struct *p, int on_rq);
#endif
};
每個(gè)進(jìn)程描述符都包含了調(diào)度器類(lèi)的成員指針(const struct sched_class *sched_class),表明該進(jìn)程所屬的調(diào)度器類(lèi)。
調(diào)度器類(lèi)數(shù)據(jù)結(jié)構(gòu)struct sched_class聲明了一個(gè)具體調(diào)度器類(lèi)實(shí)例對(duì)象需要實(shí)現(xiàn)的通用的操作接口:
-
enqueue_task:將進(jìn)程添加到就緒隊(duì)列中,在進(jìn)程從睡眠狀態(tài)變?yōu)榭蛇\(yùn)行狀態(tài)時(shí),調(diào)用該操作。 -
dequeue_task:將進(jìn)程從就緒隊(duì)列中移除。如果進(jìn)程從可運(yùn)行狀態(tài)轉(zhuǎn)換到不可運(yùn)行狀態(tài),調(diào)用該操作。 -
yield_task:當(dāng)進(jìn)程使用系統(tǒng)調(diào)用sched_yield自愿放棄對(duì)cpu的控制時(shí),調(diào)用該操作。 -
check_preempt_curr:?jiǎn)拘研逻M(jìn)程來(lái)?yè)屨籍?dāng)前進(jìn)程。例如當(dāng)fork新的進(jìn)程時(shí),會(huì)調(diào)用wake_up_new_task()函數(shù)來(lái)喚醒新的子進(jìn)程,就會(huì)調(diào)用該操作。 -
pick_next_task:選擇下一個(gè)將要運(yùn)行的進(jìn)程。 -
put_prev_task:該函數(shù)是在另一個(gè)進(jìn)程代替當(dāng)前運(yùn)行的進(jìn)程之前調(diào)用。主要負(fù)責(zé)底層的上下文切換。 -
set_curr_task:在進(jìn)程的調(diào)度策略發(fā)生變化時(shí),調(diào)用此操作。 -
task_tick:每次激活周期性調(diào)度器時(shí),由周期性調(diào)度器調(diào)用。 -
task_fork:每次fork出新進(jìn)程后,調(diào)用此操作通知調(diào)度器。
調(diào)度實(shí)體
調(diào)度器類(lèi)調(diào)度的對(duì)象被抽象為調(diào)度實(shí)體,調(diào)度實(shí)體的數(shù)據(jù)結(jié)構(gòu)中維護(hù)了與調(diào)度相關(guān)的統(tǒng)計(jì)信息。
調(diào)度實(shí)體的數(shù)據(jù)結(jié)構(gòu)在<linux/sched.h>中定義。
/*
* CFS stats for a schedulable entity (task, task-group etc)
*
* Current field usage histogram:
*
* 4 se->block_start
* 4 se->run_node
* 4 se->sleep_start
* 6 se->load.weight
*/
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq; //1-表示在就緒隊(duì)列中,0-未在就緒隊(duì)列中
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 last_wakeup;
u64 avg_overlap;
u64 nr_migrations;
u64 start_runtime;
u64 avg_wakeup;
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
u64 wait_count;
u64 wait_sum;
u64 iowait_count;
u64 iowait_sum;
u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
u64 block_start;
u64 block_max;
u64 exec_max;
u64 slice_max;
u64 nr_migrations_cold;
u64 nr_failed_migrations_affine;
u64 nr_failed_migrations_running;
u64 nr_failed_migrations_hot;
u64 nr_forced_migrations;
u64 nr_wakeups;
u64 nr_wakeups_sync;
u64 nr_wakeups_migrate;
u64 nr_wakeups_local;
u64 nr_wakeups_remote;
u64 nr_wakeups_affine;
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
};
進(jìn)程描述符結(jié)構(gòu)中包含了一個(gè)struct sched_entity的成員變量se,使進(jìn)程成為可調(diào)度實(shí)體。
就緒隊(duì)列
就緒隊(duì)列主要保存了所有活動(dòng)進(jìn)程。其數(shù)據(jù)結(jié)構(gòu)struct rq定義在kernel/sched.c文件中。
/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
*/
struct rq {
/* runqueue lock: */
raw_spinlock_t lock;
/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
unsigned long nr_running; //隊(duì)列上可運(yùn)行的進(jìn)程總數(shù)
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX]; //用于跟蹤負(fù)載狀態(tài)
#ifdef CONFIG_NO_HZ
unsigned char in_nohz_recently;
#endif
/* capture load from *all* tasks on this cpu: */
struct load_weight load; //負(fù)載信息
unsigned long nr_load_updates;
u64 nr_switches;
struct cfs_rq cfs; //CFS調(diào)度類(lèi)就緒隊(duì)列
struct rt_rq rt; //實(shí)時(shí)調(diào)度類(lèi)就緒隊(duì)列
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list;
#endif
/*
* This is part of a global counter where only the total sum
* over all CPUs matters. A task can increase this counter on
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
*/
unsigned long nr_uninterruptible;
struct task_struct *curr, *idle; //curr-當(dāng)前運(yùn)行的進(jìn)程,idle-空閑進(jìn)程
unsigned long next_balance;
struct mm_struct *prev_mm;
u64 clock; //時(shí)鐘,周期性調(diào)度器時(shí)會(huì)定時(shí)更新該值
atomic_t nr_iowait;
#ifdef CONFIG_SMP
struct root_domain *rd;
struct sched_domain *sd;
unsigned char idle_at_tick;
/* For active balancing */
int post_schedule;
int active_balance;
int push_cpu;
/* cpu of this runqueue: */
int cpu;
int online;
unsigned long avg_load_per_task;
struct task_struct *migration_thread;
struct list_head migration_queue;
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
#endif
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
int hrtick_csd_pending;
struct call_single_data hrtick_csd;
#endif
struct hrtimer hrtick_timer;
#endif
#ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
unsigned long long rq_cpu_time;
/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
/* sys_sched_yield() stats */
unsigned int yld_count;
/* schedule() stats */
unsigned int sched_switch;
unsigned int sched_count;
unsigned int sched_goidle;
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
/* BKL stats */
unsigned int bkl_count;
#endif
};
每個(gè)CPU都有各自的就緒隊(duì)列。每一個(gè)活動(dòng)進(jìn)程只出現(xiàn)一個(gè)就緒隊(duì)列中。就緒隊(duì)列并沒(méi)有直接管理進(jìn)程,而是由嵌入的特定調(diào)度類(lèi)的就緒隊(duì)列來(lái)負(fù)責(zé)的,比如struct cfs_rq cfs和struct rt_rq rt。
內(nèi)核中的所有就緒隊(duì)列都在runqueues[]數(shù)組中,該數(shù)組中的每一個(gè)元素分別對(duì)應(yīng)于系統(tǒng)中一個(gè)cpu。該數(shù)組定義在kernel/sched.c文件中。
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
//針對(duì)runqueues的一些操作宏
#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))
#define this_rq() (&__get_cpu_var(runqueues))
#define task_rq(p) cpu_rq(task_cpu(p))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))
完全公平調(diào)度類(lèi)CFS
CFS調(diào)度類(lèi)在kernel/sched_fair.c文件中定義。
/*
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
.check_preempt_curr = check_preempt_wakeup,
.pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,
.task_waking = task_waking_fair,
#endif
.set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair,
.task_fork = task_fork_fair,
.prio_changed = prio_changed_fair,
.switched_to = switched_to_fair,
.get_rr_interval = get_rr_interval_fair,
#ifdef CONFIG_FAIR_GROUP_SCHED
.moved_group = moved_group_fair,
#endif
};
CFS就緒隊(duì)列
CFS就緒隊(duì)列數(shù)據(jù)結(jié)構(gòu)struct cfs_rq嵌入在就緒隊(duì)列數(shù)據(jù)結(jié)構(gòu)struct rq中,提供給CFS調(diào)度算法來(lái)使用,在kernel/sched.c文件定義。
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load; //負(fù)載信息
unsigned long nr_running; //隊(duì)列上可運(yùn)行的進(jìn)程數(shù)目
u64 exec_clock;
u64 min_vruntime; //所有進(jìn)程中最小的虛擬運(yùn)行時(shí)間,該值是逐漸遞增的
struct rb_root tasks_timeline; //按時(shí)間排序的紅黑樹(shù)
struct rb_node *rb_leftmost; //指向樹(shù)上最左邊的節(jié)點(diǎn)(即最需要被調(diào)度的節(jié)點(diǎn))
struct list_head tasks;
struct list_head *balance_iterator;
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr, *next, *last;//curr-表示當(dāng)前正在執(zhí)行的調(diào)度實(shí)體
unsigned int nr_spread_over;
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */
/*
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
*
* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
* list is used during load balance.
*/
struct list_head leaf_cfs_rq_list;
struct task_group *tg; /* group that "owns" this runqueue */
#ifdef CONFIG_SMP
/*
* the part of load.weight contributed by tasks
*/
unsigned long task_weight;
/*
* h_load = weight * f(tg)
*
* Where f(tg) is the recursive weight fraction assigned to
* this group.
*/
unsigned long h_load;
/*
* this cpu's part of tg->shares
*/
unsigned long shares;
/*
* load.weight at the time we set shares
*/
unsigned long rq_weight;
#endif
#endif
};
虛擬時(shí)間
CFS調(diào)度依賴(lài)虛擬時(shí)間,虛擬時(shí)間用于度量進(jìn)程能夠得到的CPU時(shí)間。與虛擬時(shí)間相關(guān)的計(jì)算都是在update_curr()函數(shù)中執(zhí)行,該函數(shù)在系統(tǒng)中各個(gè)不同地方調(diào)用,包含周期性調(diào)度器之內(nèi)。
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_of(cfs_rq)->clock;
unsigned long delta_exec;
if (unlikely(!curr))
return;
/*
* Get the amount of time the current task was running
* since the last time we changed load (this cannot
* overflow on 32 bits):
*/
//計(jì)算當(dāng)前進(jìn)程從上次更新exec_start到現(xiàn)在的執(zhí)行時(shí)間差
delta_exec = (unsigned long)(now - curr->exec_start);
if (!delta_exec)
return;
__update_curr(cfs_rq, curr, delta_exec);//統(tǒng)計(jì)當(dāng)前進(jìn)程的運(yùn)行時(shí)間和vruntime
curr->exec_start = now;
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
}
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
unsigned long delta_exec)
{
unsigned long delta_exec_weighted;
schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));
curr->sum_exec_runtime += delta_exec;//累積總的運(yùn)行時(shí)間
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = calc_delta_fair(delta_exec, curr);//計(jì)算虛擬時(shí)間權(quán)重增量
curr->vruntime += delta_exec_weighted;
update_min_vruntime(cfs_rq);//更新cfs_rq的min_vruntime
}
update_curr()函數(shù)主要作用是更新當(dāng)前進(jìn)程的累積物理運(yùn)行時(shí)間和虛擬時(shí)間,以及cfs_rq的min_vruntime。
虛擬時(shí)間權(quán)重增量delta_exec_weighted的計(jì)算公式:
delta_exec_weighted = delta_exec * (NICE_0_LOAD / curr->load.weight)
系統(tǒng)內(nèi)部定義了一個(gè)優(yōu)化級(jí)與權(quán)重之間的換算關(guān)系,優(yōu)化級(jí)越高(nice越低),權(quán)重就越大,每次計(jì)算虛擬時(shí)間權(quán)重時(shí),累積增量也就越小,進(jìn)程的vruntime就增長(zhǎng)的越緩慢。
vruntime增長(zhǎng)越快的進(jìn)程將會(huì)向紅黑樹(shù)右邊移動(dòng),則增長(zhǎng)越緩慢則向左邊移動(dòng)。這樣就可以保證優(yōu)化級(jí)越高的進(jìn)程會(huì)優(yōu)先被調(diào)度。
入隊(duì)
當(dāng)進(jìn)程變成可運(yùn)行狀態(tài)(被喚醒)或者通過(guò)fork()創(chuàng)建的新進(jìn)程時(shí),CFS通過(guò)入隊(duì)操作將進(jìn)程添加到紅黑樹(shù)中,并緩存最左葉子節(jié)點(diǎn)。
enqueue_task_fair()函數(shù)是CFS調(diào)度器類(lèi)入隊(duì)操作的入口,它調(diào)用enqueue_entity()函數(shù)來(lái)執(zhí)行入隊(duì)真正的入隊(duì)操作。
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, bool head)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int flags = 0;
if (wakeup)
flags |= ENQUEUE_WAKEUP;
if (p->state == TASK_WAKING)
flags |= ENQUEUE_MIGRATE;
for_each_sched_entity(se) {
if (se->on_rq) //如果已經(jīng)在就緒隊(duì)列中,則不用再入隊(duì)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
flags = ENQUEUE_WAKEUP;
}
hrtick_update(rq);
}
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update the normalized vruntime before updating min_vruntime
* through callig update_curr().
*/
if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
se->vruntime += cfs_rq->min_vruntime;
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);//更新當(dāng)前進(jìn)程的運(yùn)行時(shí)間和vruntime
//更新實(shí)體入隊(duì)的一些統(tǒng)計(jì)信息和狀態(tài)
//如累加cfs_rq->load,cfs_rq->nr_running++,se->on_rq = 1
account_entity_enqueue(cfs_rq, se);
if (flags & ENQUEUE_WAKEUP) {
place_entity(cfs_rq, se, 0);
enqueue_sleeper(cfs_rq, se);
}
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)//如果需要入隊(duì)的實(shí)體并非當(dāng)前正在執(zhí)行的實(shí)體
__enqueue_entity(cfs_rq, se);//將實(shí)體添加到rbtree中
}
enqueue_entity()函數(shù)更新入隊(duì)實(shí)體的vruntime和其它一些統(tǒng)計(jì)數(shù)據(jù),然后再調(diào)用__enqueue_entity()函數(shù)將實(shí)體插入到rbtree中。
/*
* Enqueue an entity into the rb-tree:
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
s64 key = entity_key(cfs_rq, se);//key = se->vruntime - cfs_rq->min_vruntime
int leftmost = 1;
/*
* Find the right place in the rbtree:
* 查找插入新節(jié)點(diǎn)的合適位置
*/
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
if (key < entity_key(cfs_rq, entry)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = 0;
}
}
/*
* Maintain a cache of leftmost tree entries (it is frequently
* used):
*/
//不存在key值比新添加的節(jié)點(diǎn)更小的節(jié)點(diǎn)
if (leftmost)
cfs_rq->rb_leftmost = &se->run_node;//如果新添加的節(jié)點(diǎn)的key最小,則緩存起來(lái)
rb_link_node(&se->run_node, parent, link);//將實(shí)體節(jié)點(diǎn)添加到紅黑樹(shù)中
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);//對(duì)添加的新節(jié)點(diǎn)著色平衡
}
從上面的算法可以看出,rbtime節(jié)點(diǎn)組織方式是左節(jié)點(diǎn)的key比右節(jié)點(diǎn)的key小。而key的值相當(dāng)于實(shí)體的vruntime值。vruntime值越小,則越向rbtree左邊移動(dòng)。rb_leftmost緩存了vruntime值最小的葉子節(jié)點(diǎn)。
出隊(duì)
當(dāng)進(jìn)程變成不可運(yùn)行狀態(tài)(被阻塞)或者終止時(shí),CFS將進(jìn)程從紅黑樹(shù)中移除。
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);//將實(shí)體從就緒隊(duì)列中移除
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
break;
sleep = 1;
}
hrtick_update(rq);
}
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
update_stats_dequeue(cfs_rq, se);
if (sleep) {
#ifdef CONFIG_SCHEDSTATS
if (entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
if (tsk->state & TASK_INTERRUPTIBLE)
se->sleep_start = rq_of(cfs_rq)->clock;
if (tsk->state & TASK_UNINTERRUPTIBLE)
se->block_start = rq_of(cfs_rq)->clock;
}
#endif
}
clear_buddies(cfs_rq, se);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
//更新實(shí)體入隊(duì)的一些統(tǒng)計(jì)信息和狀態(tài)
//如累減cfs_rq->load,cfs_rq->nr_running--,se->on_rq = 0
account_entity_dequeue(cfs_rq, se);
update_min_vruntime(cfs_rq);
/*
* Normalize the entity after updating the min_vruntime because the
* update can refer to the ->curr item and we need to reflect this
* movement in our normalized position.
*/
if (!sleep)
se->vruntime -= cfs_rq->min_vruntime;
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
//如果要移除的節(jié)點(diǎn)恰好也是緩存的最左節(jié)點(diǎn),則需要更新新的最左節(jié)點(diǎn)
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;
next_node = rb_next(&se->run_node);
cfs_rq->rb_leftmost = next_node; //更新rb_leftmost
}
rb_erase(&se->run_node, &cfs_rq->tasks_timeline);//將節(jié)點(diǎn)從rbtree中移除
}
選擇下一個(gè)進(jìn)程
CFS選取下一個(gè)待運(yùn)行的進(jìn)程,是所有進(jìn)程中vruntime最小的那個(gè),也就是CFS緩存起來(lái)的最左的葉子節(jié)點(diǎn)。
當(dāng)前執(zhí)行的進(jìn)程不能存放在rbtree中,所以需要將進(jìn)程從rbtree中移除,并通過(guò)cfs_rq->curr成員來(lái)引用跟蹤。
static struct task_struct *pick_next_task_fair(struct rq *rq)
{
struct task_struct *p;
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
//如果隊(duì)列中沒(méi)有可運(yùn)行的進(jìn)程,則立即返回
if (!cfs_rq->nr_running)
return NULL;
do {
se = pick_next_entity(cfs_rq);//獲取一下待運(yùn)行的進(jìn)程
set_next_entity(cfs_rq, se);//將實(shí)體從rbtree中移除,并更新實(shí)體的開(kāi)始執(zhí)行時(shí)間
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
p = task_of(se);
hrtick_start_fair(rq, p);
return p;
}
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *left = se;
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;
clear_buddies(cfs_rq, se);
return se;
}
static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *left = cfs_rq->rb_leftmost;//獲取紅黑樹(shù)上緩存的最左節(jié)點(diǎn)
if (!left)
return NULL;
return rb_entry(left, struct sched_entity, run_node);//獲取節(jié)點(diǎn)對(duì)應(yīng)的調(diào)度實(shí)體
}
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/* 'current' is not kept within the tree. */
if (se->on_rq) {
/*
* Any task has to be enqueued before it get to execute on
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);//將調(diào)度實(shí)體從就緒隊(duì)列中移出
}
update_stats_curr_start(cfs_rq, se);//更新進(jìn)程當(dāng)前開(kāi)始執(zhí)行的時(shí)間
cfs_rq->curr = se;
#ifdef CONFIG_SCHEDSTATS
/*
* Track our maximum slice length, if the CPU's load is at
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
if (rq_of(cfs_rq)->load.weight >= 2*se->load.weight) {
se->slice_max = max(se->slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime);
}
#endif
se->prev_sum_exec_runtime = se->sum_exec_runtime;//保存總的運(yùn)行時(shí)間,后面用于計(jì)算進(jìn)程在cpu上的運(yùn)行時(shí)間來(lái)判斷進(jìn)程是否已經(jīng)超過(guò)了期望時(shí)長(zhǎng),是否需要重新調(diào)度
}
周期性調(diào)度
周期性調(diào)度器定時(shí)調(diào)用特定調(diào)度器的task_tick函數(shù),用于定時(shí)更新當(dāng)前執(zhí)行進(jìn)程的執(zhí)行時(shí)間,同時(shí)跟蹤當(dāng)前進(jìn)程執(zhí)行時(shí)間是否超過(guò)期望的時(shí)長(zhǎng),如果超過(guò)則發(fā)起進(jìn)程調(diào)度請(qǐng)求(設(shè)置重調(diào)標(biāo)志TIF_NEED_RESCHED),核心調(diào)度器會(huì)在下一個(gè)適當(dāng)時(shí)機(jī)發(fā)起調(diào)度。
CFS的task_tick函數(shù)為task_tick_fair():
/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
}
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued) {
resched_task(rq_of(cfs_rq)->curr);
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
check_preempt_tick(cfs_rq, curr);//檢查當(dāng)前進(jìn)程運(yùn)行時(shí)長(zhǎng),判斷是否需要搶占
}
/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;
ideal_runtime = sched_slice(cfs_rq, curr);//計(jì)算當(dāng)前進(jìn)程可運(yùn)行的期望時(shí)長(zhǎng)
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;//進(jìn)程的運(yùn)行時(shí)長(zhǎng)
//如果進(jìn)程運(yùn)行時(shí)長(zhǎng)已經(jīng)超過(guò)了期望時(shí)長(zhǎng),則發(fā)起調(diào)度請(qǐng)求
if (delta_exec > ideal_runtime) {
resched_task(rq_of(cfs_rq)->curr);//發(fā)起調(diào)度請(qǐng)求
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
*/
clear_buddies(cfs_rq, curr);
return;
}
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
if (!sched_feat(WAKEUP_PREEMPT))
return;
if (delta_exec < sysctl_sched_min_granularity)
return;
if (cfs_rq->nr_running > 1) {
struct sched_entity *se = __pick_next_entity(cfs_rq);
s64 delta = curr->vruntime - se->vruntime;
if (delta > ideal_runtime)
resched_task(rq_of(cfs_rq)->curr);
}
}
static void resched_task(struct task_struct *p)
{
int cpu;
assert_raw_spin_locked(&task_rq(p)->lock);
if (test_tsk_need_resched(p))
return;
set_tsk_need_resched(p);//設(shè)置重新調(diào)度標(biāo)志TIF_NEED_RESCHED
cpu = task_cpu(p);
if (cpu == smp_processor_id())
return;
/* NEED_RESCHED must be visible before we test polling */
smp_mb();
if (!tsk_is_polling(p))
smp_send_reschedule(cpu);
}
喚醒搶占
當(dāng)調(diào)用try_to_wake_up()和wake_up_new_task()函數(shù)來(lái)喚醒新進(jìn)程時(shí),內(nèi)核通過(guò)調(diào)用調(diào)度器的check_preempt_curr函數(shù)來(lái)檢查新的進(jìn)程是否可以搶占當(dāng)前運(yùn)行的進(jìn)程。
CFS通過(guò)check_preempt_wakeup()函數(shù)來(lái)完成搶占檢測(cè)的操作。
/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int sync = wake_flags & WF_SYNC;
int scale = cfs_rq->nr_running >= sched_nr_latency;
//如果新進(jìn)程是實(shí)時(shí)進(jìn)程,則立即請(qǐng)求重新調(diào)度
if (unlikely(rt_prio(p->prio)))
goto preempt;
if (unlikely(p->sched_class != &fair_sched_class))
return;
if (unlikely(se == pse))
return;
if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
set_next_buddy(pse);
/*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
*/
//已經(jīng)設(shè)置了TIF_NEED_RESCHED標(biāo)志,則不用再設(shè)置了
if (test_tsk_need_resched(curr))
return;
/*
* Batch and idle tasks do not preempt (their preemption is driven by
* the tick):
*/
if (unlikely(p->policy != SCHED_NORMAL))
return;
/* Idle tasks are by definition preempted by everybody. */
//如果當(dāng)前執(zhí)行的進(jìn)程是空閑進(jìn)程,則直接請(qǐng)求重新調(diào)度
if (unlikely(curr->policy == SCHED_IDLE))
goto preempt;
if (sched_feat(WAKEUP_SYNC) && sync)
goto preempt;
if (sched_feat(WAKEUP_OVERLAP) &&
se->avg_overlap < sysctl_sched_migration_cost &&
pse->avg_overlap < sysctl_sched_migration_cost)
goto preempt;
if (!sched_feat(WAKEUP_PREEMPT))
return;
update_curr(cfs_rq);
find_matching_se(&se, &pse);
BUG_ON(!pse);
//如果se->vruntime > pse->vruntime + gran,則請(qǐng)求重新調(diào)度
//gran表示最小時(shí)間限額,用來(lái)確保進(jìn)程不至于切換得太頻繁而影響性能
if (wakeup_preempt_entity(se, pse) == 1)
goto preempt;
return;
preempt:
resched_task(curr);
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
* with schedule on the ->pre_schedule() or idle_balance()
* point, either of which can * drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class,
* for obvious reasons its a bad idea to schedule back to it.
*/
if (unlikely(!se->on_rq || curr == rq->idle))
return;
if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
set_last_buddy(se);
}
進(jìn)程調(diào)度入口
schedule()函數(shù)
進(jìn)程調(diào)度向外界提供的功能接口就是schedule()函數(shù),它定義在kernel/sched.c文件中。
schedule()作為通用的進(jìn)程調(diào)度入口,隱藏了具體特定調(diào)度器類(lèi)調(diào)度細(xì)節(jié)。該函數(shù)通過(guò)調(diào)用pick_next_task()函數(shù)委托給特定調(diào)度類(lèi)來(lái)獲取優(yōu)化級(jí)最高的進(jìn)程,再調(diào)用context_switch()函數(shù)完成上下文切換。
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
int cpu;
need_resched:
preempt_disable();
cpu = smp_processor_id();
rq = cpu_rq(cpu);
rcu_sched_qs(cpu);
prev = rq->curr;
switch_count = &prev->nivcsw;
release_kernel_lock(prev);
need_resched_nonpreemptible:
schedule_debug(prev);
if (sched_feat(HRTICK))
hrtick_clear(rq);
raw_spin_lock_irq(&rq->lock);
update_rq_clock(rq);
clear_tsk_need_resched(prev); //清除TIF_NEED_RESCHED標(biāo)志
//如果當(dāng)前進(jìn)程處于睡眠狀態(tài)(TASK_RUNNING值為0)
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
//如果接收到信號(hào)(有未處理的信號(hào)),則將進(jìn)程設(shè)置為運(yùn)行狀態(tài)
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
deactivate_task(rq, prev, 1);//停止進(jìn)程并將其從就緒隊(duì)列移除
switch_count = &prev->nvcsw;
}
pre_schedule(rq, prev);
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
put_prev_task(rq, prev); //通過(guò)調(diào)度器當(dāng)前運(yùn)行進(jìn)程將要被另一個(gè)進(jìn)程代替
next = pick_next_task(rq);//獲取優(yōu)化級(jí)最高的進(jìn)程(任務(wù))
if (likely(prev != next)) {
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next);
rq->nr_switches++;
rq->curr = next;
++*switch_count;
//執(zhí)行上下文切換
context_switch(rq, prev, next); /* unlocks the rq */
/*
* the context switch might have flipped the stack from under
* us, hence refresh the local variables.
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
} else
raw_spin_unlock_irq(&rq->lock);
post_schedule(rq);
if (unlikely(reacquire_kernel_lock(current) < 0)) {
prev = rq->curr;
switch_count = &prev->nivcsw;
goto need_resched_nonpreemptible;
}
preempt_enable_no_resched();
if (need_resched())
goto need_resched;
}
EXPORT_SYMBOL(schedule);
pick_next_task()函數(shù)依次遍歷調(diào)度器類(lèi)鏈表,直到找到優(yōu)先級(jí)最高的進(jìn)程為止。
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq)
{
const struct sched_class *class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
//如果隊(duì)列中所有的任務(wù)都屬于cfs調(diào)度器類(lèi)型,則直接調(diào)用相關(guān)的函數(shù)
if (likely(rq->nr_running == rq->cfs.nr_running)) {
p = fair_sched_class.pick_next_task(rq);
if (likely(p))
return p;
}
//遍歷調(diào)度器類(lèi)鏈表,直到找到優(yōu)先級(jí)最高的進(jìn)程為止
class = sched_class_highest;
for ( ; ; ) {
p = class->pick_next_task(rq);
if (p)
return p;
/*
* Will never be NULL as the idle class always
* returns a non-NULL p:
*/
class = class->next;
}
}
schedule()函數(shù)一般在如下情況被調(diào)用:
- 從系統(tǒng)調(diào)用返回到用戶空間之前,檢查是否設(shè)置了重調(diào)標(biāo)志
TIF_NEED_RESCHED - 從中斷處理程序返回到用戶空間之前
- 從中斷處理程序退出,返回到內(nèi)核空間之前
- 內(nèi)核代碼再次變?yōu)榭蓳屨紩r(shí)
- 進(jìn)程在內(nèi)核中顯式調(diào)用了
schedule()函數(shù) - 進(jìn)程在內(nèi)核中阻塞
上下文切換
上下文切換是指將一個(gè)可運(yùn)行的進(jìn)程切換到另一個(gè)進(jìn)程,切換過(guò)程包含虛擬內(nèi)存和寄存器狀態(tài)等信息的切換。該操作是由定義在kernel/sched.c文件中的context_switch()函數(shù)來(lái)完成。
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
//......
switch_mm(oldmm, mm, next);
//......
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
//......
}
switch_mm()函數(shù)在<asm/mmu_context.h>文件中聲明,該函數(shù)負(fù)責(zé)將上一個(gè)進(jìn)程的虛擬內(nèi)存映射切換到新的進(jìn)程的虛擬內(nèi)存映射。
switch_to()函數(shù)在<asm/system.h>文件中聲明,該函數(shù)負(fù)責(zé)處理器狀態(tài)的切換。包括保存和恢復(fù)棧信息、處理器寄存器以及與特定體系架構(gòu)相關(guān)的狀態(tài)信息。