Blame - Documentation/locking/rt-mutex-design.rst - linux

blob: 59c2a64efb21695f8492686c5d80fae1102fe9f4 [file] [log] [blame]

Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	1	==============================
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	2	RT-mutex implementation design
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	3	==============================
				4
				5	Copyright (c) 2006 Steven Rostedt
				6
				7	Licensed under the GNU Free Documentation License, Version 1.2
				8
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	9
				10	This document tries to describe the design of the rtmutex.c implementation.
				11	It doesn't describe the reasons why rtmutex.c exists. For that please see
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	12	Documentation/locking/rt-mutex.rst. Although this document does explain problems
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	13	that happen without this code, but that is in the concept to understand
				14	what the code actually is doing.
				15
				16	The goal of this document is to help others understand the priority
				17	inheritance (PI) algorithm that is used, as well as reasons for the
				18	decisions that were made to implement PI in the manner that was done.
				19
				20
				21	Unbounded Priority Inversion
				22	----------------------------
				23
				24	Priority inversion is when a lower priority process executes while a higher
				25	priority process wants to run. This happens for several reasons, and
				26	most of the time it can't be helped. Anytime a high priority process wants
				27	to use a resource that a lower priority process has (a mutex for example),
				28	the high priority process must wait until the lower priority process is done
				29	with the resource. This is a priority inversion. What we want to prevent
				30	is something called unbounded priority inversion. That is when the high
				31	priority process is prevented from running by a lower priority process for
				32	an undetermined amount of time.
				33
Xishi Qiu	c79a8d8	2013-11-06 13:18:21 -0800	[diff] [blame]	34	The classic example of unbounded priority inversion is where you have three
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	35	processes, let's call them processes A, B, and C, where A is the highest
				36	priority process, C is the lowest, and B is in between. A tries to grab a lock
				37	that C owns and must wait and lets C run to release the lock. But in the
				38	meantime, B executes, and since B is of a higher priority than C, it preempts C,
				39	but by doing so, it is in fact preempting A which is a higher priority process.
				40	Now there's no way of knowing how long A will be sleeping waiting for C
				41	to release the lock, because for all we know, B is a CPU hog and will
				42	never give C a chance to release the lock. This is called unbounded priority
				43	inversion.
				44
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	45	Here's a little ASCII art to show the problem::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	46
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	47	grab lock L1 (owned by C)
				48	\|
				49	A ---+
				50	C preempted by B
				51	\|
				52	C +----+
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	53
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	54	B +-------->
				55	B now keeps A from running.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	56
				57
				58	Priority Inheritance (PI)
				59	-------------------------
				60
				61	There are several ways to solve this issue, but other ways are out of scope
				62	for this document. Here we only discuss PI.
				63
				64	PI is where a process inherits the priority of another process if the other
				65	process blocks on a lock owned by the current process. To make this easier
				66	to understand, let's use the previous example, with processes A, B, and C again.
				67
				68	This time, when A blocks on the lock owned by C, C would inherit the priority
				69	of A. So now if B becomes runnable, it would not preempt C, since C now has
				70	the high priority of A. As soon as C releases the lock, it loses its
				71	inherited priority, and A then can continue with the resource that C had.
				72
				73	Terminology
				74	-----------
				75
				76	Here I explain some terminology that is used in this document to help describe
				77	the design that is used to implement PI.
				78
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	79	PI chain
				80	- The PI chain is an ordered series of locks and processes that cause
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	81	processes to inherit priorities from a previous process that is
				82	blocked on one of its locks. This is described in more detail
				83	later in this document.
				84
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	85	mutex
				86	- In this document, to differentiate from locks that implement
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	87	PI and spin locks that are used in the PI code, from now on
				88	the PI locks will be called a mutex.
				89
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	90	lock
				91	- In this document from now on, I will use the term lock when
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	92	referring to spin locks that are used to protect parts of the PI
				93	algorithm. These locks disable preemption for UP (when
				94	CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
				95	entering critical sections simultaneously.
				96
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	97	spin lock
				98	- Same as lock above.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	99
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	100	waiter
				101	- A waiter is a struct that is stored on the stack of a blocked
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	102	process. Since the scope of the waiter is within the code for
				103	a process being blocked on the mutex, it is fine to allocate
				104	the waiter on the process's stack (local variable). This
				105	structure holds a pointer to the task, as well as the mutex that
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	106	the task is blocked on. It also has rbtree node structures to
				107	place the task in the waiters rbtree of a mutex as well as the
				108	pi_waiters rbtree of a mutex owner task (described below).
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	109
				110	waiter is sometimes used in reference to the task that is waiting
				111	on a mutex. This is the same as waiter->task.
				112
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	113	waiters
				114	- A list of processes that are blocked on a mutex.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	115
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	116	top waiter
				117	- The highest priority process waiting on a specific mutex.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	118
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	119	top pi waiter
				120	- The highest priority process waiting on one of the mutexes
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	121	that a specific process owns.
				122
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	123	Note:
				124	task and process are used interchangeably in this document, mostly to
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	125	differentiate between two processes that are being described together.
				126
				127
				128	PI chain
				129	--------
				130
				131	The PI chain is a list of processes and mutexes that may cause priority
				132	inheritance to take place. Multiple chains may converge, but a chain
				133	would never diverge, since a process can't be blocked on more than one
				134	mutex at a time.
				135
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	136	Example::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	137
				138	Process: A, B, C, D, E
				139	Mutexes: L1, L2, L3, L4
				140
				141	A owns: L1
				142	B blocked on L1
				143	B owns L2
				144	C blocked on L2
				145	C owns L3
				146	D blocked on L3
				147	D owns L4
				148	E blocked on L4
				149
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	150	The chain would be::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	151
				152	E->L4->D->L3->C->L2->B->L1->A
				153
				154	To show where two chains merge, we could add another process F and
				155	another mutex L5 where B owns L5 and F is blocked on mutex L5.
				156
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	157	The chain for F would be::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	158
				159	F->L5->B->L1->A
				160
				161	Since a process may own more than one mutex, but never be blocked on more than
				162	one, the chains merge.
				163
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	164	Here we show both chains::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	165
				166	E->L4->D->L3->C->L2-+
				167	\|
				168	+->B->L1->A
				169	\|
				170	F->L5-+
				171
				172	For PI to work, the processes at the right end of these chains (or we may
				173	also call it the Top of the chain) must be equal to or higher in priority
				174	than the processes to the left or below in the chain.
				175
				176	Also since a mutex may have more than one process blocked on it, we can
				177	have multiple chains merge at mutexes. If we add another process G that is
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	178	blocked on mutex L2::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	179
				180	G->L2->B->L1->A
				181
				182	And once again, to show how this can grow I will show the merging chains
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	183	again::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	184
				185	E->L4->D->L3->C-+
				186	+->L2-+
				187	\| \|
				188	G-+ +->B->L1->A
				189	\|
				190	F->L5-+
				191
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	192	If process G has the highest priority in the chain, then all the tasks up
				193	the chain (A and B in this example), must have their priorities increased
				194	to that of G.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	195
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	196	Mutex Waiters Tree
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	197	------------------
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	198
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	199	Every mutex keeps track of all the waiters that are blocked on itself. The
				200	mutex has a rbtree to store these waiters by priority. This tree is protected
				201	by a spin lock that is located in the struct of the mutex. This lock is called
				202	wait_lock.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	203
				204
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	205	Task PI Tree
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	206	------------
				207
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	208	To keep track of the PI chains, each process has its own PI rbtree. This is
				209	a tree of all top waiters of the mutexes that are owned by the process.
				210	Note that this tree only holds the top waiters and not all waiters that are
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	211	blocked on mutexes owned by the process.
				212
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	213	The top of the task's PI tree is always the highest priority task that
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	214	is waiting on a mutex that is owned by the task. So if the task has
				215	inherited a priority, it will always be the priority of the task that is
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	216	at the top of this tree.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	217
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	218	This tree is stored in the task structure of a process as a rbtree called
				219	pi_waiters. It is protected by a spin lock also in the task structure,
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	220	called pi_lock. This lock may also be taken in interrupt context, so when
				221	locking the pi_lock, interrupts must be disabled.
				222
				223
				224	Depth of the PI Chain
				225	---------------------
				226
				227	The maximum depth of the PI chain is not dynamic, and could actually be
				228	defined. But is very complex to figure it out, since it depends on all
				229	the nesting of mutexes. Let's look at the example where we have 3 mutexes,
				230	L1, L2, and L3, and four separate functions func1, func2, func3 and func4.
				231	The following shows a locking order of L1->L2->L3, but may not actually
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	232	be directly nested that way::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	233
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	234	void func1(void)
				235	{
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	236	mutex_lock(L1);
				237
				238	/* do anything */
				239
				240	mutex_unlock(L1);
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	241	}
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	242
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	243	void func2(void)
				244	{
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	245	mutex_lock(L1);
				246	mutex_lock(L2);
				247
				248	/* do something */
				249
				250	mutex_unlock(L2);
				251	mutex_unlock(L1);
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	252	}
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	253
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	254	void func3(void)
				255	{
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	256	mutex_lock(L2);
				257	mutex_lock(L3);
				258
				259	/* do something else */
				260
				261	mutex_unlock(L3);
				262	mutex_unlock(L2);
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	263	}
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	264
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	265	void func4(void)
				266	{
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	267	mutex_lock(L3);
				268
				269	/* do something again */
				270
				271	mutex_unlock(L3);
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	272	}
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	273
				274	Now we add 4 processes that run each of these functions separately.
				275	Processes A, B, C, and D which run functions func1, func2, func3 and func4
				276	respectively, and such that D runs first and A last. With D being preempted
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	277	in func4 in the "do something again" area, we have a locking that follows::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	278
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	279	D owns L3
				280	C blocked on L3
				281	C owns L2
				282	B blocked on L2
				283	B owns L1
				284	A blocked on L1
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	285
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	286	And thus we have the chain A->L1->B->L2->C->L3->D.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	287
				288	This gives us a PI depth of 4 (four processes), but looking at any of the
				289	functions individually, it seems as though they only have at most a locking
				290	depth of two. So, although the locking depth is defined at compile time,
				291	it still is very difficult to find the possibilities of that depth.
				292
				293	Now since mutexes can be defined by user-land applications, we don't want a DOS
				294	type of application that nests large amounts of mutexes to create a large
				295	PI chain, and have the code holding spin locks while looking at a large
				296	amount of data. So to prevent this, the implementation not only implements
				297	a maximum lock depth, but also only holds at most two different locks at a
				298	time, as it walks the PI chain. More about this below.
				299
				300
				301	Mutex owner and flags
				302	---------------------
				303
				304	The mutex structure contains a pointer to the owner of the mutex. If the
				305	mutex is not owned, this owner is set to NULL. Since all architectures
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	306	have the task structure on at least a two byte alignment (and if this is
				307	not true, the rtmutex.c code will be broken!), this allows for the least
				308	significant bit to be used as a flag. Bit 0 is used as the "Has Waiters"
				309	flag. It's set whenever there are waiters on a mutex.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	310
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	311	See Documentation/locking/rt-mutex.rst for further details.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	312
				313	cmpxchg Tricks
				314	--------------
				315
				316	Some architectures implement an atomic cmpxchg (Compare and Exchange). This
				317	is used (when applicable) to keep the fast path of grabbing and releasing
				318	mutexes short.
				319
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	320	cmpxchg is basically the following function performed atomically::
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	321
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	322	unsigned long _cmpxchg(unsigned long A, unsigned long B, unsigned long *C)
				323	{
Jan Altenberg	9ba0bdf	2006-09-30 23:28:08 -0700	[diff] [blame]	324	unsigned long T = *A;
				325	if (A == B) {
				326	A = C;
				327	}
				328	return T;
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	329	}
				330	#define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	331
				332	This is really nice to have, since it allows you to only update a variable
				333	if the variable is what you expect it to be. You know if it succeeded if
				334	the return value (the old value of A) is equal to B.
				335
				336	The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
				337	the architecture does not support CMPXCHG, then this macro is simply set
				338	to fail every time. But if CMPXCHG is supported, then this will
				339	help out extremely to keep the fast path short.
				340
				341	The use of rt_mutex_cmpxchg with the flags in the owner field help optimize
				342	the system for architectures that support it. This will also be explained
				343	later in this document.
				344
				345
				346	Priority adjustments
				347	--------------------
				348
				349	The implementation of the PI code in rtmutex.c has several places that a
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	350	process must adjust its priority. With the help of the pi_waiters of a
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	351	process this is rather easy to know what needs to be adjusted.
				352
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	353	The functions implementing the task adjustments are rt_mutex_adjust_prio
				354	and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	355
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	356	rt_mutex_adjust_prio examines the priority of the task, and the highest
				357	priority process that is waiting any of mutexes owned by the task. Since
				358	the pi_waiters of a task holds an order by priority of all the top waiters
				359	of all the mutexes that the task owns, we simply need to compare the top
				360	pi waiter to its own normal/deadline priority and take the higher one.
				361	Then rt_mutex_setprio is called to adjust the priority of the task to the
				362	new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
				363	to implement the actual change in priority.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	364
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	365	Note:
				366	For the "prio" field in task_struct, the lower the number, the
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	367	higher the priority. A "prio" of 5 is of higher priority than a
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	368	"prio" of 10.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	369
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	370	It is interesting to note that rt_mutex_adjust_prio can either increase
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	371	or decrease the priority of the task. In the case that a higher priority
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	372	process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	373	would increase/boost the task's priority. But if a higher priority task
				374	were for some reason to leave the mutex (timeout or signal), this same function
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	375	would decrease/unboost the priority of the task. That is because the pi_waiters
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	376	always contains the highest priority task that is waiting on a mutex owned
				377	by the task, so we only need to compare the priority of that top pi waiter
				378	to the normal priority of the given task.
				379
				380
				381	High level overview of the PI chain walk
				382	----------------------------------------
				383
				384	The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
				385
				386	The implementation has gone through several iterations, and has ended up
				387	with what we believe is the best. It walks the PI chain by only grabbing
				388	at most two locks at a time, and is very efficient.
				389
				390	The rt_mutex_adjust_prio_chain can be used either to boost or lower process
				391	priorities.
				392
				393	rt_mutex_adjust_prio_chain is called with a task to be checked for PI
				394	(de)boosting (the owner of a mutex that a process is blocking on), a flag to
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	395	check for deadlocking, the mutex that the task owns, a pointer to a waiter
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	396	that is the process's waiter struct that is blocked on the mutex (although this
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	397	parameter may be NULL for deboosting), a pointer to the mutex on which the task
				398	is blocked, and a top_task as the top waiter of the mutex.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	399
				400	For this explanation, I will not mention deadlock detection. This explanation
				401	will try to stay at a high level.
				402
				403	When this function is called, there are no locks held. That also means
				404	that the state of the owner and lock can change when entered into this function.
				405
				406	Before this function is called, the task has already had rt_mutex_adjust_prio
				407	performed on it. This means that the task is set to the priority that it
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	408	should be at, but the rbtree nodes of the task's waiter have not been updated
				409	with the new priorities, and this task may not be in the proper locations
				410	in the pi_waiters and waiters trees that the task is blocked on. This function
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	411	solves all that.
				412
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	413	The main operation of this function is summarized by Thomas Gleixner in
				414	rtmutex.c. See the 'Chain walk basics and protection scope' comment for further
				415	details.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	416
				417	Taking of a mutex (The walk through)
				418	------------------------------------
				419
				420	OK, now let's take a look at the detailed walk through of what happens when
				421	taking a mutex.
				422
				423	The first thing that is tried is the fast taking of the mutex. This is
				424	done when we have CMPXCHG enabled (otherwise the fast taking automatically
				425	fails). Only when the owner field of the mutex is NULL can the lock be
				426	taken with the CMPXCHG and nothing else needs to be done.
				427
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	428	If there is contention on the lock, we go about the slow path
				429	(rt_mutex_slowlock).
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	430
				431	The slow path function is where the task's waiter structure is created on
				432	the stack. This is because the waiter structure is only needed for the
				433	scope of this function. The waiter structure holds the nodes to store
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	434	the task on the waiters tree of the mutex, and if need be, the pi_waiters
				435	tree of the owner.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	436
				437	The wait_lock of the mutex is taken since the slow path of unlocking the
				438	mutex also takes this lock.
				439
				440	We then call try_to_take_rt_mutex. This is where the architecture that
				441	does not implement CMPXCHG would always grab the lock (if there's no
				442	contention).
				443
				444	try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
				445	slow path. The first thing that is done here is an atomic setting of
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	446	the "Has Waiters" flag of the mutex's owner field. By setting this flag
				447	now, the current owner of the mutex being contended for can't release the mutex
				448	without going into the slow unlock path, and it would then need to grab the
				449	wait_lock, which this code currently holds. So setting the "Has Waiters" flag
				450	forces the current owner to synchronize with this code.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	451
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	452	The lock is taken if the following are true:
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	453
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	454	1) The lock has no owner
				455	2) The current task is the highest priority against all other
				456	waiters of the lock
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	457
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	458	If the task succeeds to acquire the lock, then the task is set as the
				459	owner of the lock, and if the lock still has waiters, the top_waiter
				460	(highest priority task waiting on the lock) is added to this task's
				461	pi_waiters tree.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	462
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	463	If the lock is not taken by try_to_take_rt_mutex(), then the
				464	task_blocks_on_rt_mutex() function is called. This will add the task to
				465	the lock's waiter tree and propagate the pi chain of the lock as well
				466	as the lock's owner's pi_waiters tree. This is described in the next
				467	section.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	468
				469	Task blocks on mutex
				470	--------------------
				471
				472	The accounting of a mutex and process is done with the waiter structure of
				473	the process. The "task" field is set to the process, and the "lock" field
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	474	to the mutex. The rbtree node of waiter are initialized to the processes
				475	current priority.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	476
				477	Since the wait_lock was taken at the entry of the slow lock, we can safely
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	478	add the waiter to the task waiter tree. If the current process is the
				479	highest priority process currently waiting on this mutex, then we remove the
				480	previous top waiter process (if it exists) from the pi_waiters of the owner,
				481	and add the current process to that tree. Since the pi_waiter of the owner
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	482	has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
				483	should adjust its priority accordingly.
				484
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	485	If the owner is also blocked on a lock, and had its pi_waiters changed
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	486	(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
				487	and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
				488
				489	Now all locks are released, and if the current process is still blocked on a
				490	mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
				491
				492	Waking up in the loop
				493	---------------------
				494
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	495	The task can then wake up for a couple of reasons:
				496	1) The previous lock owner released the lock, and the task now is top_waiter
				497	2) we received a signal or timeout
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	498
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	499	In both cases, the task will try again to acquire the lock. If it
				500	does, then it will take itself off the waiters tree and set itself back
				501	to the TASK_RUNNING state.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	502
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	503	In first case, if the lock was acquired by another task before this task
				504	could get the lock, then it will go back to sleep and wait to be woken again.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	505
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	506	The second case is only applicable for tasks that are grabbing a mutex
				507	that can wake up before getting the lock, either due to a signal or
				508	a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
				509	take the lock again, if it succeeds, then the task will return with the
				510	lock held, otherwise it will return with -EINTR if the task was woken
				511	by a signal, or -ETIMEDOUT if it timed out.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	512
				513
				514	Unlocking the Mutex
				515	-------------------
				516
				517	The unlocking of a mutex also has a fast path for those architectures with
				518	CMPXCHG. Since the taking of a mutex on contention always sets the
				519	"Has Waiters" flag of the mutex's owner, we use this to know if we need to
				520	take the slow path when unlocking the mutex. If the mutex doesn't have any
				521	waiters, the owner field of the mutex would equal the current process and
				522	the mutex can be unlocked by just replacing the owner field with NULL.
				523
				524	If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
				525	the slow unlock path is taken.
				526
				527	The first thing done in the slow unlock path is to take the wait_lock of the
				528	mutex. This synchronizes the locking and unlocking of the mutex.
				529
				530	A check is made to see if the mutex has waiters or not. On architectures that
				531	do not have CMPXCHG, this is the location that the owner of the mutex will
				532	determine if a waiter needs to be awoken or not. On architectures that
				533	do have CMPXCHG, that check is done in the fast path, but it is still needed
				534	in the slow path too. If a waiter of a mutex woke up because of a signal
				535	or timeout between the time the owner failed the fast path CMPXCHG check and
				536	the grabbing of the wait_lock, the mutex may not have any waiters, thus the
Jan Altenberg	9ba0bdf	2006-09-30 23:28:08 -0700	[diff] [blame]	537	owner still needs to make this check. If there are no waiters then the mutex
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	538	owner field is set to NULL, the wait_lock is released and nothing more is
				539	needed.
				540
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	541	If there are waiters, then we need to wake one up.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	542
				543	On the wake up code, the pi_lock of the current owner is taken. The top
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	544	waiter of the lock is found and removed from the waiters tree of the mutex
				545	as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
				546	marked to prevent lower priority tasks from stealing the lock.
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	547
				548	Finally we unlock the pi_lock of the pending owner and wake it up.
				549
				550
				551	Contact
				552	-------
				553
				554	For updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
				555
				556
				557	Credits
				558	-------
				559
				560	Author: Steven Rostedt <rostedt@goodmis.org>
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	561
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	562	Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2017
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	563
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	564	Original Reviewers:
				565	Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	566	Randy Dunlap
Mauro Carvalho Chehab	387b146	2019-04-10 08:32:41 -0300	[diff] [blame]	567
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	568	Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
Steven Rostedt	a6537be	2006-06-27 02:54:54 -0700	[diff] [blame]	569
				570	Updates
				571	-------
				572
				573	This document was originally written for 2.6.17-rc3-mm1
Alex Shi	f1824df	2017-07-31 09:50:53 +0800	[diff] [blame]	574	was updated on 4.12