Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 1 | ============================== |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 2 | RT-mutex implementation design |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 3 | ============================== |
| 4 | |
| 5 | Copyright (c) 2006 Steven Rostedt |
| 6 | |
| 7 | Licensed under the GNU Free Documentation License, Version 1.2 |
| 8 | |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 9 | |
| 10 | This document tries to describe the design of the rtmutex.c implementation. |
| 11 | It doesn't describe the reasons why rtmutex.c exists. For that please see |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 12 | Documentation/locking/rt-mutex.rst. Although this document does explain problems |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 13 | that happen without this code, but that is in the concept to understand |
| 14 | what the code actually is doing. |
| 15 | |
| 16 | The goal of this document is to help others understand the priority |
| 17 | inheritance (PI) algorithm that is used, as well as reasons for the |
| 18 | decisions that were made to implement PI in the manner that was done. |
| 19 | |
| 20 | |
| 21 | Unbounded Priority Inversion |
| 22 | ---------------------------- |
| 23 | |
| 24 | Priority inversion is when a lower priority process executes while a higher |
| 25 | priority process wants to run. This happens for several reasons, and |
| 26 | most of the time it can't be helped. Anytime a high priority process wants |
| 27 | to use a resource that a lower priority process has (a mutex for example), |
| 28 | the high priority process must wait until the lower priority process is done |
| 29 | with the resource. This is a priority inversion. What we want to prevent |
| 30 | is something called unbounded priority inversion. That is when the high |
| 31 | priority process is prevented from running by a lower priority process for |
| 32 | an undetermined amount of time. |
| 33 | |
Xishi Qiu | c79a8d8 | 2013-11-06 13:18:21 -0800 | [diff] [blame] | 34 | The classic example of unbounded priority inversion is where you have three |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 35 | processes, let's call them processes A, B, and C, where A is the highest |
| 36 | priority process, C is the lowest, and B is in between. A tries to grab a lock |
| 37 | that C owns and must wait and lets C run to release the lock. But in the |
| 38 | meantime, B executes, and since B is of a higher priority than C, it preempts C, |
| 39 | but by doing so, it is in fact preempting A which is a higher priority process. |
| 40 | Now there's no way of knowing how long A will be sleeping waiting for C |
| 41 | to release the lock, because for all we know, B is a CPU hog and will |
| 42 | never give C a chance to release the lock. This is called unbounded priority |
| 43 | inversion. |
| 44 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 45 | Here's a little ASCII art to show the problem:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 46 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 47 | grab lock L1 (owned by C) |
| 48 | | |
| 49 | A ---+ |
| 50 | C preempted by B |
| 51 | | |
| 52 | C +----+ |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 53 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 54 | B +--------> |
| 55 | B now keeps A from running. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 56 | |
| 57 | |
| 58 | Priority Inheritance (PI) |
| 59 | ------------------------- |
| 60 | |
| 61 | There are several ways to solve this issue, but other ways are out of scope |
| 62 | for this document. Here we only discuss PI. |
| 63 | |
| 64 | PI is where a process inherits the priority of another process if the other |
| 65 | process blocks on a lock owned by the current process. To make this easier |
| 66 | to understand, let's use the previous example, with processes A, B, and C again. |
| 67 | |
| 68 | This time, when A blocks on the lock owned by C, C would inherit the priority |
| 69 | of A. So now if B becomes runnable, it would not preempt C, since C now has |
| 70 | the high priority of A. As soon as C releases the lock, it loses its |
| 71 | inherited priority, and A then can continue with the resource that C had. |
| 72 | |
| 73 | Terminology |
| 74 | ----------- |
| 75 | |
| 76 | Here I explain some terminology that is used in this document to help describe |
| 77 | the design that is used to implement PI. |
| 78 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 79 | PI chain |
| 80 | - The PI chain is an ordered series of locks and processes that cause |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 81 | processes to inherit priorities from a previous process that is |
| 82 | blocked on one of its locks. This is described in more detail |
| 83 | later in this document. |
| 84 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 85 | mutex |
| 86 | - In this document, to differentiate from locks that implement |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 87 | PI and spin locks that are used in the PI code, from now on |
| 88 | the PI locks will be called a mutex. |
| 89 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 90 | lock |
| 91 | - In this document from now on, I will use the term lock when |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 92 | referring to spin locks that are used to protect parts of the PI |
| 93 | algorithm. These locks disable preemption for UP (when |
| 94 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from |
| 95 | entering critical sections simultaneously. |
| 96 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 97 | spin lock |
| 98 | - Same as lock above. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 99 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 100 | waiter |
| 101 | - A waiter is a struct that is stored on the stack of a blocked |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 102 | process. Since the scope of the waiter is within the code for |
| 103 | a process being blocked on the mutex, it is fine to allocate |
| 104 | the waiter on the process's stack (local variable). This |
| 105 | structure holds a pointer to the task, as well as the mutex that |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 106 | the task is blocked on. It also has rbtree node structures to |
| 107 | place the task in the waiters rbtree of a mutex as well as the |
| 108 | pi_waiters rbtree of a mutex owner task (described below). |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 109 | |
| 110 | waiter is sometimes used in reference to the task that is waiting |
| 111 | on a mutex. This is the same as waiter->task. |
| 112 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 113 | waiters |
| 114 | - A list of processes that are blocked on a mutex. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 115 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 116 | top waiter |
| 117 | - The highest priority process waiting on a specific mutex. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 118 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 119 | top pi waiter |
| 120 | - The highest priority process waiting on one of the mutexes |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 121 | that a specific process owns. |
| 122 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 123 | Note: |
| 124 | task and process are used interchangeably in this document, mostly to |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 125 | differentiate between two processes that are being described together. |
| 126 | |
| 127 | |
| 128 | PI chain |
| 129 | -------- |
| 130 | |
| 131 | The PI chain is a list of processes and mutexes that may cause priority |
| 132 | inheritance to take place. Multiple chains may converge, but a chain |
| 133 | would never diverge, since a process can't be blocked on more than one |
| 134 | mutex at a time. |
| 135 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 136 | Example:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 137 | |
| 138 | Process: A, B, C, D, E |
| 139 | Mutexes: L1, L2, L3, L4 |
| 140 | |
| 141 | A owns: L1 |
| 142 | B blocked on L1 |
| 143 | B owns L2 |
| 144 | C blocked on L2 |
| 145 | C owns L3 |
| 146 | D blocked on L3 |
| 147 | D owns L4 |
| 148 | E blocked on L4 |
| 149 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 150 | The chain would be:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 151 | |
| 152 | E->L4->D->L3->C->L2->B->L1->A |
| 153 | |
| 154 | To show where two chains merge, we could add another process F and |
| 155 | another mutex L5 where B owns L5 and F is blocked on mutex L5. |
| 156 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 157 | The chain for F would be:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 158 | |
| 159 | F->L5->B->L1->A |
| 160 | |
| 161 | Since a process may own more than one mutex, but never be blocked on more than |
| 162 | one, the chains merge. |
| 163 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 164 | Here we show both chains:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 165 | |
| 166 | E->L4->D->L3->C->L2-+ |
| 167 | | |
| 168 | +->B->L1->A |
| 169 | | |
| 170 | F->L5-+ |
| 171 | |
| 172 | For PI to work, the processes at the right end of these chains (or we may |
| 173 | also call it the Top of the chain) must be equal to or higher in priority |
| 174 | than the processes to the left or below in the chain. |
| 175 | |
| 176 | Also since a mutex may have more than one process blocked on it, we can |
| 177 | have multiple chains merge at mutexes. If we add another process G that is |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 178 | blocked on mutex L2:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 179 | |
| 180 | G->L2->B->L1->A |
| 181 | |
| 182 | And once again, to show how this can grow I will show the merging chains |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 183 | again:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 184 | |
| 185 | E->L4->D->L3->C-+ |
| 186 | +->L2-+ |
| 187 | | | |
| 188 | G-+ +->B->L1->A |
| 189 | | |
| 190 | F->L5-+ |
| 191 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 192 | If process G has the highest priority in the chain, then all the tasks up |
| 193 | the chain (A and B in this example), must have their priorities increased |
| 194 | to that of G. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 195 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 196 | Mutex Waiters Tree |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 197 | ------------------ |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 198 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 199 | Every mutex keeps track of all the waiters that are blocked on itself. The |
| 200 | mutex has a rbtree to store these waiters by priority. This tree is protected |
| 201 | by a spin lock that is located in the struct of the mutex. This lock is called |
| 202 | wait_lock. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 203 | |
| 204 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 205 | Task PI Tree |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 206 | ------------ |
| 207 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 208 | To keep track of the PI chains, each process has its own PI rbtree. This is |
| 209 | a tree of all top waiters of the mutexes that are owned by the process. |
| 210 | Note that this tree only holds the top waiters and not all waiters that are |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 211 | blocked on mutexes owned by the process. |
| 212 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 213 | The top of the task's PI tree is always the highest priority task that |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 214 | is waiting on a mutex that is owned by the task. So if the task has |
| 215 | inherited a priority, it will always be the priority of the task that is |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 216 | at the top of this tree. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 217 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 218 | This tree is stored in the task structure of a process as a rbtree called |
| 219 | pi_waiters. It is protected by a spin lock also in the task structure, |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 220 | called pi_lock. This lock may also be taken in interrupt context, so when |
| 221 | locking the pi_lock, interrupts must be disabled. |
| 222 | |
| 223 | |
| 224 | Depth of the PI Chain |
| 225 | --------------------- |
| 226 | |
| 227 | The maximum depth of the PI chain is not dynamic, and could actually be |
| 228 | defined. But is very complex to figure it out, since it depends on all |
| 229 | the nesting of mutexes. Let's look at the example where we have 3 mutexes, |
| 230 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. |
| 231 | The following shows a locking order of L1->L2->L3, but may not actually |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 232 | be directly nested that way:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 233 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 234 | void func1(void) |
| 235 | { |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 236 | mutex_lock(L1); |
| 237 | |
| 238 | /* do anything */ |
| 239 | |
| 240 | mutex_unlock(L1); |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 241 | } |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 242 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 243 | void func2(void) |
| 244 | { |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 245 | mutex_lock(L1); |
| 246 | mutex_lock(L2); |
| 247 | |
| 248 | /* do something */ |
| 249 | |
| 250 | mutex_unlock(L2); |
| 251 | mutex_unlock(L1); |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 252 | } |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 253 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 254 | void func3(void) |
| 255 | { |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 256 | mutex_lock(L2); |
| 257 | mutex_lock(L3); |
| 258 | |
| 259 | /* do something else */ |
| 260 | |
| 261 | mutex_unlock(L3); |
| 262 | mutex_unlock(L2); |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 263 | } |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 264 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 265 | void func4(void) |
| 266 | { |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 267 | mutex_lock(L3); |
| 268 | |
| 269 | /* do something again */ |
| 270 | |
| 271 | mutex_unlock(L3); |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 272 | } |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 273 | |
| 274 | Now we add 4 processes that run each of these functions separately. |
| 275 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 |
| 276 | respectively, and such that D runs first and A last. With D being preempted |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 277 | in func4 in the "do something again" area, we have a locking that follows:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 278 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 279 | D owns L3 |
| 280 | C blocked on L3 |
| 281 | C owns L2 |
| 282 | B blocked on L2 |
| 283 | B owns L1 |
| 284 | A blocked on L1 |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 285 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 286 | And thus we have the chain A->L1->B->L2->C->L3->D. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 287 | |
| 288 | This gives us a PI depth of 4 (four processes), but looking at any of the |
| 289 | functions individually, it seems as though they only have at most a locking |
| 290 | depth of two. So, although the locking depth is defined at compile time, |
| 291 | it still is very difficult to find the possibilities of that depth. |
| 292 | |
| 293 | Now since mutexes can be defined by user-land applications, we don't want a DOS |
| 294 | type of application that nests large amounts of mutexes to create a large |
| 295 | PI chain, and have the code holding spin locks while looking at a large |
| 296 | amount of data. So to prevent this, the implementation not only implements |
| 297 | a maximum lock depth, but also only holds at most two different locks at a |
| 298 | time, as it walks the PI chain. More about this below. |
| 299 | |
| 300 | |
| 301 | Mutex owner and flags |
| 302 | --------------------- |
| 303 | |
| 304 | The mutex structure contains a pointer to the owner of the mutex. If the |
| 305 | mutex is not owned, this owner is set to NULL. Since all architectures |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 306 | have the task structure on at least a two byte alignment (and if this is |
| 307 | not true, the rtmutex.c code will be broken!), this allows for the least |
| 308 | significant bit to be used as a flag. Bit 0 is used as the "Has Waiters" |
| 309 | flag. It's set whenever there are waiters on a mutex. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 310 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 311 | See Documentation/locking/rt-mutex.rst for further details. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 312 | |
| 313 | cmpxchg Tricks |
| 314 | -------------- |
| 315 | |
| 316 | Some architectures implement an atomic cmpxchg (Compare and Exchange). This |
| 317 | is used (when applicable) to keep the fast path of grabbing and releasing |
| 318 | mutexes short. |
| 319 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 320 | cmpxchg is basically the following function performed atomically:: |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 321 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 322 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) |
| 323 | { |
Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 324 | unsigned long T = *A; |
| 325 | if (*A == *B) { |
| 326 | *A = *C; |
| 327 | } |
| 328 | return T; |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 329 | } |
| 330 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 331 | |
| 332 | This is really nice to have, since it allows you to only update a variable |
| 333 | if the variable is what you expect it to be. You know if it succeeded if |
| 334 | the return value (the old value of A) is equal to B. |
| 335 | |
| 336 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If |
| 337 | the architecture does not support CMPXCHG, then this macro is simply set |
| 338 | to fail every time. But if CMPXCHG is supported, then this will |
| 339 | help out extremely to keep the fast path short. |
| 340 | |
| 341 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize |
| 342 | the system for architectures that support it. This will also be explained |
| 343 | later in this document. |
| 344 | |
| 345 | |
| 346 | Priority adjustments |
| 347 | -------------------- |
| 348 | |
| 349 | The implementation of the PI code in rtmutex.c has several places that a |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 350 | process must adjust its priority. With the help of the pi_waiters of a |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 351 | process this is rather easy to know what needs to be adjusted. |
| 352 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 353 | The functions implementing the task adjustments are rt_mutex_adjust_prio |
| 354 | and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 355 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 356 | rt_mutex_adjust_prio examines the priority of the task, and the highest |
| 357 | priority process that is waiting any of mutexes owned by the task. Since |
| 358 | the pi_waiters of a task holds an order by priority of all the top waiters |
| 359 | of all the mutexes that the task owns, we simply need to compare the top |
| 360 | pi waiter to its own normal/deadline priority and take the higher one. |
| 361 | Then rt_mutex_setprio is called to adjust the priority of the task to the |
| 362 | new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c |
| 363 | to implement the actual change in priority. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 364 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 365 | Note: |
| 366 | For the "prio" field in task_struct, the lower the number, the |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 367 | higher the priority. A "prio" of 5 is of higher priority than a |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 368 | "prio" of 10. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 369 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 370 | It is interesting to note that rt_mutex_adjust_prio can either increase |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 371 | or decrease the priority of the task. In the case that a higher priority |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 372 | process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 373 | would increase/boost the task's priority. But if a higher priority task |
| 374 | were for some reason to leave the mutex (timeout or signal), this same function |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 375 | would decrease/unboost the priority of the task. That is because the pi_waiters |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 376 | always contains the highest priority task that is waiting on a mutex owned |
| 377 | by the task, so we only need to compare the priority of that top pi waiter |
| 378 | to the normal priority of the given task. |
| 379 | |
| 380 | |
| 381 | High level overview of the PI chain walk |
| 382 | ---------------------------------------- |
| 383 | |
| 384 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. |
| 385 | |
| 386 | The implementation has gone through several iterations, and has ended up |
| 387 | with what we believe is the best. It walks the PI chain by only grabbing |
| 388 | at most two locks at a time, and is very efficient. |
| 389 | |
| 390 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process |
| 391 | priorities. |
| 392 | |
| 393 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI |
| 394 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 395 | check for deadlocking, the mutex that the task owns, a pointer to a waiter |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 396 | that is the process's waiter struct that is blocked on the mutex (although this |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 397 | parameter may be NULL for deboosting), a pointer to the mutex on which the task |
| 398 | is blocked, and a top_task as the top waiter of the mutex. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 399 | |
| 400 | For this explanation, I will not mention deadlock detection. This explanation |
| 401 | will try to stay at a high level. |
| 402 | |
| 403 | When this function is called, there are no locks held. That also means |
| 404 | that the state of the owner and lock can change when entered into this function. |
| 405 | |
| 406 | Before this function is called, the task has already had rt_mutex_adjust_prio |
| 407 | performed on it. This means that the task is set to the priority that it |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 408 | should be at, but the rbtree nodes of the task's waiter have not been updated |
| 409 | with the new priorities, and this task may not be in the proper locations |
| 410 | in the pi_waiters and waiters trees that the task is blocked on. This function |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 411 | solves all that. |
| 412 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 413 | The main operation of this function is summarized by Thomas Gleixner in |
| 414 | rtmutex.c. See the 'Chain walk basics and protection scope' comment for further |
| 415 | details. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 416 | |
| 417 | Taking of a mutex (The walk through) |
| 418 | ------------------------------------ |
| 419 | |
| 420 | OK, now let's take a look at the detailed walk through of what happens when |
| 421 | taking a mutex. |
| 422 | |
| 423 | The first thing that is tried is the fast taking of the mutex. This is |
| 424 | done when we have CMPXCHG enabled (otherwise the fast taking automatically |
| 425 | fails). Only when the owner field of the mutex is NULL can the lock be |
| 426 | taken with the CMPXCHG and nothing else needs to be done. |
| 427 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 428 | If there is contention on the lock, we go about the slow path |
| 429 | (rt_mutex_slowlock). |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 430 | |
| 431 | The slow path function is where the task's waiter structure is created on |
| 432 | the stack. This is because the waiter structure is only needed for the |
| 433 | scope of this function. The waiter structure holds the nodes to store |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 434 | the task on the waiters tree of the mutex, and if need be, the pi_waiters |
| 435 | tree of the owner. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 436 | |
| 437 | The wait_lock of the mutex is taken since the slow path of unlocking the |
| 438 | mutex also takes this lock. |
| 439 | |
| 440 | We then call try_to_take_rt_mutex. This is where the architecture that |
| 441 | does not implement CMPXCHG would always grab the lock (if there's no |
| 442 | contention). |
| 443 | |
| 444 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the |
| 445 | slow path. The first thing that is done here is an atomic setting of |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 446 | the "Has Waiters" flag of the mutex's owner field. By setting this flag |
| 447 | now, the current owner of the mutex being contended for can't release the mutex |
| 448 | without going into the slow unlock path, and it would then need to grab the |
| 449 | wait_lock, which this code currently holds. So setting the "Has Waiters" flag |
| 450 | forces the current owner to synchronize with this code. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 451 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 452 | The lock is taken if the following are true: |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 453 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 454 | 1) The lock has no owner |
| 455 | 2) The current task is the highest priority against all other |
| 456 | waiters of the lock |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 457 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 458 | If the task succeeds to acquire the lock, then the task is set as the |
| 459 | owner of the lock, and if the lock still has waiters, the top_waiter |
| 460 | (highest priority task waiting on the lock) is added to this task's |
| 461 | pi_waiters tree. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 462 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 463 | If the lock is not taken by try_to_take_rt_mutex(), then the |
| 464 | task_blocks_on_rt_mutex() function is called. This will add the task to |
| 465 | the lock's waiter tree and propagate the pi chain of the lock as well |
| 466 | as the lock's owner's pi_waiters tree. This is described in the next |
| 467 | section. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 468 | |
| 469 | Task blocks on mutex |
| 470 | -------------------- |
| 471 | |
| 472 | The accounting of a mutex and process is done with the waiter structure of |
| 473 | the process. The "task" field is set to the process, and the "lock" field |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 474 | to the mutex. The rbtree node of waiter are initialized to the processes |
| 475 | current priority. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 476 | |
| 477 | Since the wait_lock was taken at the entry of the slow lock, we can safely |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 478 | add the waiter to the task waiter tree. If the current process is the |
| 479 | highest priority process currently waiting on this mutex, then we remove the |
| 480 | previous top waiter process (if it exists) from the pi_waiters of the owner, |
| 481 | and add the current process to that tree. Since the pi_waiter of the owner |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 482 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner |
| 483 | should adjust its priority accordingly. |
| 484 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 485 | If the owner is also blocked on a lock, and had its pi_waiters changed |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 486 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead |
| 487 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. |
| 488 | |
| 489 | Now all locks are released, and if the current process is still blocked on a |
| 490 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). |
| 491 | |
| 492 | Waking up in the loop |
| 493 | --------------------- |
| 494 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 495 | The task can then wake up for a couple of reasons: |
| 496 | 1) The previous lock owner released the lock, and the task now is top_waiter |
| 497 | 2) we received a signal or timeout |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 498 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 499 | In both cases, the task will try again to acquire the lock. If it |
| 500 | does, then it will take itself off the waiters tree and set itself back |
| 501 | to the TASK_RUNNING state. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 502 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 503 | In first case, if the lock was acquired by another task before this task |
| 504 | could get the lock, then it will go back to sleep and wait to be woken again. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 505 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 506 | The second case is only applicable for tasks that are grabbing a mutex |
| 507 | that can wake up before getting the lock, either due to a signal or |
| 508 | a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to |
| 509 | take the lock again, if it succeeds, then the task will return with the |
| 510 | lock held, otherwise it will return with -EINTR if the task was woken |
| 511 | by a signal, or -ETIMEDOUT if it timed out. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 512 | |
| 513 | |
| 514 | Unlocking the Mutex |
| 515 | ------------------- |
| 516 | |
| 517 | The unlocking of a mutex also has a fast path for those architectures with |
| 518 | CMPXCHG. Since the taking of a mutex on contention always sets the |
| 519 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to |
| 520 | take the slow path when unlocking the mutex. If the mutex doesn't have any |
| 521 | waiters, the owner field of the mutex would equal the current process and |
| 522 | the mutex can be unlocked by just replacing the owner field with NULL. |
| 523 | |
| 524 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), |
| 525 | the slow unlock path is taken. |
| 526 | |
| 527 | The first thing done in the slow unlock path is to take the wait_lock of the |
| 528 | mutex. This synchronizes the locking and unlocking of the mutex. |
| 529 | |
| 530 | A check is made to see if the mutex has waiters or not. On architectures that |
| 531 | do not have CMPXCHG, this is the location that the owner of the mutex will |
| 532 | determine if a waiter needs to be awoken or not. On architectures that |
| 533 | do have CMPXCHG, that check is done in the fast path, but it is still needed |
| 534 | in the slow path too. If a waiter of a mutex woke up because of a signal |
| 535 | or timeout between the time the owner failed the fast path CMPXCHG check and |
| 536 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the |
Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 537 | owner still needs to make this check. If there are no waiters then the mutex |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 538 | owner field is set to NULL, the wait_lock is released and nothing more is |
| 539 | needed. |
| 540 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 541 | If there are waiters, then we need to wake one up. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 542 | |
| 543 | On the wake up code, the pi_lock of the current owner is taken. The top |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 544 | waiter of the lock is found and removed from the waiters tree of the mutex |
| 545 | as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is |
| 546 | marked to prevent lower priority tasks from stealing the lock. |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 547 | |
| 548 | Finally we unlock the pi_lock of the pending owner and wake it up. |
| 549 | |
| 550 | |
| 551 | Contact |
| 552 | ------- |
| 553 | |
| 554 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> |
| 555 | |
| 556 | |
| 557 | Credits |
| 558 | ------- |
| 559 | |
| 560 | Author: Steven Rostedt <rostedt@goodmis.org> |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 561 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 562 | Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2017 |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 563 | |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 564 | Original Reviewers: |
| 565 | Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 566 | Randy Dunlap |
Mauro Carvalho Chehab | 387b146 | 2019-04-10 08:32:41 -0300 | [diff] [blame] | 567 | |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 568 | Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior |
Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 569 | |
| 570 | Updates |
| 571 | ------- |
| 572 | |
| 573 | This document was originally written for 2.6.17-rc3-mm1 |
Alex Shi | f1824df | 2017-07-31 09:50:53 +0800 | [diff] [blame] | 574 | was updated on 4.12 |