TL;DR

CFS picks the task with the smallest runtime to balance tasks’ runtime. Implemented by red-black tree with the leftmost node cached, picking the next task is !

Linux Implementation

Time Accounting

Task’s schedule info is defined in struct sched_entity. There is a vruntime field recording the virtual runtime of the task, which is the actual runtime normalized by the number of runnable tasks.

When update_curr(), a weighted delta will be added to the vruntime. The calculation defined in calc_delta_fair() and __calc_delta() is as follow:

where is the weight decided by nice value in sched_prio_to_weight[40] and is the weight of nice 0 (i.e. sched_prio_to_weight[20]).

const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

Task Selection

Runnable tasks are placed in a red-black tree, with the vruntime as the key. Adding a runnable task is the same as inserting a node into that red-black tree, while removing a runnable task is the same as deleting. And these two operations are , where is the number of nodes.

Picking the next task to run is the same as picking the leftmost node in the tree, which has the smallest vruntime. This operation is so frequent that Linux makes an optimization which is caching the leftmost node and updating the cache while inserting and removing. By this optimization, picking the next task is !

Timeslice

Timeslice is how long a task runs. CFS calculates how long a task should run as a function of the total number of runnable tasks. Each task runs for a timeslice proportional to its weight divided by the total weight of all runnable threads.

The actual timeslice is called the targeted latency. Smaller latency yield better interactivity and a closer approximation to perfect multitasking, at the expense of higher switching costs and thus worse overall throughput.

CFS imposes a floor on the timeslice assigned to each task. This floor is called the minimum granularity which is 1 millisecond by default. Thus, even as the number of runnable tasks approaches infinity, each will run for at least 1 millisecond, to ensure there is a ceiling on the incurred switching costs.

Overall, a task will get timeslice as follow:

Scheduler Classes

In fact, CFS is not the only scheduler in Linux. There are other algorithms for scheduling different types of tasks (e.g. real-time tasks). To enable different, pluggable algorithms to coexist, Linux uses the modularity called scheduler classes.

The base schedule iterates over each scheduler class in order of priority. The highest priority scheduler class that has a runnable process wins, selecting who runs next. The CFS is registered as SCHED_NORMAL, which has the lowest priority.

Comparing with Traditional Scheduler

Modern process schedulers have two common concepts: priority and timeslice. Timeslice is how long a process runs. Processes with a higher priority run more frequently and/or (on many systems) receive a higher timeslice.

Priority in Unix is in the form of nice values, but in practice it leads to several pathological problem.

Mapping nice values onto absolute timeslices leads to frequent context switching of tasks with lower priority.

Let’s say we map nice 0 to timeslice of 100 ms, while nice 20 to 5 ms. When there are two tasks both has nice 0, each task has a timeslice of 100 ms and context-switching happens every 200 ms. But for two tasks with nice 20, even though each has a half of the CPU time, context-switching happens every 10 ms! Considering tasks with high nice value are usually CPU-bounded, this leads to a low overall throughput.
Mapping nice values onto timeslices in a arithmetic sequence leads to wildly different effects depending on the nice value.

Let’s say we map nice 0 to timeslice of 100 ms, nice 1 to 95 ms, etc. A task with nice 0 and another with nice 1 are almost the same, while a task with nice 18 (10 ms) has twice timeslices as another with nice 19 (5 ms).
Timeslices are limited by the timer tick.

The timeslice must be some integer multiple of the timer tick. The minimum timeslice has a floor of the period of the timer tick. The system timer limits the difference between two timeslices. Timeslices change with different timer ticks.
Optimition of running a freshly woken-up tasks immediately for interactice tasks might be abused.

The approach taken by CFS is a radical (for process schedulers) rethinking of timeslice allotment: Do away with timeslices completely and assign each process a proportion of the processor. CFS thus yields constant fairness but a variable switching rate.

The Completely Fair Scheduler