| 		   Goals, Design and Implementation of the | 
 | 		      new ultra-scalable O(1) scheduler | 
 |  | 
 |  | 
 |   This is an edited version of an email Ingo Molnar sent to | 
 |   lkml on 4 Jan 2002.  It describes the goals, design, and | 
 |   implementation of Ingo's new ultra-scalable O(1) scheduler. | 
 |   Last Updated: 18 April 2002. | 
 |  | 
 |  | 
 | Goal | 
 | ==== | 
 |  | 
 | The main goal of the new scheduler is to keep all the good things we know | 
 | and love about the current Linux scheduler: | 
 |  | 
 |  - good interactive performance even during high load: if the user | 
 |    types or clicks then the system must react instantly and must execute | 
 |    the user tasks smoothly, even during considerable background load. | 
 |  | 
 |  - good scheduling/wakeup performance with 1-2 runnable processes. | 
 |  | 
 |  - fairness: no process should stay without any timeslice for any | 
 |    unreasonable amount of time. No process should get an unjustly high | 
 |    amount of CPU time. | 
 |  | 
 |  - priorities: less important tasks can be started with lower priority, | 
 |    more important tasks with higher priority. | 
 |  | 
 |  - SMP efficiency: no CPU should stay idle if there is work to do. | 
 |  | 
 |  - SMP affinity: processes which run on one CPU should stay affine to | 
 |    that CPU. Processes should not bounce between CPUs too frequently. | 
 |  | 
 |  - plus additional scheduler features: RT scheduling, CPU binding. | 
 |  | 
 | and the goal is also to add a few new things: | 
 |  | 
 |  - fully O(1) scheduling. Are you tired of the recalculation loop | 
 |    blowing the L1 cache away every now and then? Do you think the goodness | 
 |    loop is taking a bit too long to finish if there are lots of runnable | 
 |    processes? This new scheduler takes no prisoners: wakeup(), schedule(), | 
 |    the timer interrupt are all O(1) algorithms. There is no recalculation | 
 |    loop. There is no goodness loop either. | 
 |  | 
 |  - 'perfect' SMP scalability. With the new scheduler there is no 'big' | 
 |    runqueue_lock anymore - it's all per-CPU runqueues and locks - two | 
 |    tasks on two separate CPUs can wake up, schedule and context-switch | 
 |    completely in parallel, without any interlocking. All | 
 |    scheduling-relevant data is structured for maximum scalability. | 
 |  | 
 |  - better SMP affinity. The old scheduler has a particular weakness that | 
 |    causes the random bouncing of tasks between CPUs if/when higher | 
 |    priority/interactive tasks, this was observed and reported by many | 
 |    people. The reason is that the timeslice recalculation loop first needs | 
 |    every currently running task to consume its timeslice. But when this | 
 |    happens on eg. an 8-way system, then this property starves an | 
 |    increasing number of CPUs from executing any process. Once the last | 
 |    task that has a timeslice left has finished using up that timeslice, | 
 |    the recalculation loop is triggered and other CPUs can start executing | 
 |    tasks again - after having idled around for a number of timer ticks. | 
 |    The more CPUs, the worse this effect. | 
 |  | 
 |    Furthermore, this same effect causes the bouncing effect as well: | 
 |    whenever there is such a 'timeslice squeeze' of the global runqueue, | 
 |    idle processors start executing tasks which are not affine to that CPU. | 
 |    (because the affine tasks have finished off their timeslices already.) | 
 |  | 
 |    The new scheduler solves this problem by distributing timeslices on a | 
 |    per-CPU basis, without having any global synchronization or | 
 |    recalculation. | 
 |  | 
 |  - batch scheduling. A significant proportion of computing-intensive tasks | 
 |    benefit from batch-scheduling, where timeslices are long and processes | 
 |    are roundrobin scheduled. The new scheduler does such batch-scheduling | 
 |    of the lowest priority tasks - so nice +19 jobs will get | 
 |    'batch-scheduled' automatically. With this scheduler, nice +19 jobs are | 
 |    in essence SCHED_IDLE, from an interactiveness point of view. | 
 |  | 
 |  - handle extreme loads more smoothly, without breakdown and scheduling | 
 |    storms. | 
 |  | 
 |  - O(1) RT scheduling. For those RT folks who are paranoid about the | 
 |    O(nr_running) property of the goodness loop and the recalculation loop. | 
 |  | 
 |  - run fork()ed children before the parent. Andrea has pointed out the | 
 |    advantages of this a few months ago, but patches for this feature | 
 |    do not work with the old scheduler as well as they should, | 
 |    because idle processes often steal the new child before the fork()ing | 
 |    CPU gets to execute it. | 
 |  | 
 |  | 
 | Design | 
 | ====== | 
 |  | 
 | The core of the new scheduler contains the following mechanisms: | 
 |  | 
 |  - *two* priority-ordered 'priority arrays' per CPU. There is an 'active' | 
 |    array and an 'expired' array. The active array contains all tasks that | 
 |    are affine to this CPU and have timeslices left. The expired array | 
 |    contains all tasks which have used up their timeslices - but this array | 
 |    is kept sorted as well. The active and expired array is not accessed | 
 |    directly, it's accessed through two pointers in the per-CPU runqueue | 
 |    structure. If all active tasks are used up then we 'switch' the two | 
 |    pointers and from now on the ready-to-go (former-) expired array is the | 
 |    active array - and the empty active array serves as the new collector | 
 |    for expired tasks. | 
 |  | 
 |  - there is a 64-bit bitmap cache for array indices. Finding the highest | 
 |    priority task is thus a matter of two x86 BSFL bit-search instructions. | 
 |  | 
 | the split-array solution enables us to have an arbitrary number of active | 
 | and expired tasks, and the recalculation of timeslices can be done | 
 | immediately when the timeslice expires. Because the arrays are always | 
 | access through the pointers in the runqueue, switching the two arrays can | 
 | be done very quickly. | 
 |  | 
 | this is a hybride priority-list approach coupled with roundrobin | 
 | scheduling and the array-switch method of distributing timeslices. | 
 |  | 
 |  - there is a per-task 'load estimator'. | 
 |  | 
 | one of the toughest things to get right is good interactive feel during | 
 | heavy system load. While playing with various scheduler variants i found | 
 | that the best interactive feel is achieved not by 'boosting' interactive | 
 | tasks, but by 'punishing' tasks that want to use more CPU time than there | 
 | is available. This method is also much easier to do in an O(1) fashion. | 
 |  | 
 | to establish the actual 'load' the task contributes to the system, a | 
 | complex-looking but pretty accurate method is used: there is a 4-entry | 
 | 'history' ringbuffer of the task's activities during the last 4 seconds. | 
 | This ringbuffer is operated without much overhead. The entries tell the | 
 | scheduler a pretty accurate load-history of the task: has it used up more | 
 | CPU time or less during the past N seconds. [the size '4' and the interval | 
 | of 4x 1 seconds was found by lots of experimentation - this part is | 
 | flexible and can be changed in both directions.] | 
 |  | 
 | the penalty a task gets for generating more load than the CPU can handle | 
 | is a priority decrease - there is a maximum amount to this penalty | 
 | relative to their static priority, so even fully CPU-bound tasks will | 
 | observe each other's priorities, and will share the CPU accordingly. | 
 |  | 
 | the SMP load-balancer can be extended/switched with additional parallel | 
 | computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs | 
 | can be supported easily by changing the load-balancer. Right now it's | 
 | tuned for my SMP systems. | 
 |  | 
 | i skipped the prev->mm == next->mm advantage - no workload i know of shows | 
 | any sensitivity to this. It can be added back by sacrificing O(1) | 
 | schedule() [the current and one-lower priority list can be searched for a | 
 | that->mm == current->mm condition], but costs a fair number of cycles | 
 | during a number of important workloads, so i wanted to avoid this as much | 
 | as possible. | 
 |  | 
 | - the SMP idle-task startup code was still racy and the new scheduler | 
 | triggered this. So i streamlined the idle-setup code a bit. We do not call | 
 | into schedule() before all processors have started up fully and all idle | 
 | threads are in place. | 
 |  | 
 | - the patch also cleans up a number of aspects of sched.c - moves code | 
 | into other areas of the kernel where it's appropriate, and simplifies | 
 | certain code paths and data constructs. As a result, the new scheduler's | 
 | code is smaller than the old one. | 
 |  | 
 | 	Ingo |