Blame - Documentation/memory-barriers.txt - linux

blob: 020cccdbdd0ce9052d26b6dd4cd2bcef8dab7a4e [file] [log] [blame]

David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1	============================
				2	LINUX KERNEL MEMORY BARRIERS
				3	============================
				4
				5	By: David Howells <dhowells@redhat.com>
David Howells	90fddab	2010-03-24 09:43:00 +0000	[diff] [blame]	6	Paul E. McKenney <paulmck@linux.vnet.ibm.com>
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	7
				8	Contents:
				9
				10	(*) Abstract memory access model.
				11
				12	- Device operations.
				13	- Guarantees.
				14
				15	(*) What are memory barriers?
				16
				17	- Varieties of memory barrier.
				18	- What may not be assumed about memory barriers?
				19	- Data dependency barriers.
				20	- Control dependencies.
				21	- SMP barrier pairing.
				22	- Examples of memory barrier sequences.
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	23	- Read memory barriers vs load speculation.
Paul E. McKenney	241e666	2011-02-10 16:54:50 -0800	[diff] [blame]	24	- Transitivity
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	25
				26	(*) Explicit kernel barriers.
				27
				28	- Compiler barrier.
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	29	- CPU memory barriers.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	30	- MMIO write barrier.
				31
				32	(*) Implicit kernel memory barriers.
				33
				34	- Locking functions.
				35	- Interrupt disabling functions.
David Howells	50fa610	2009-04-28 15:01:38 +0100	[diff] [blame]	36	- Sleep and wake-up functions.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	37	- Miscellaneous functions.
				38
				39	(*) Inter-CPU locking barrier effects.
				40
				41	- Locks vs memory accesses.
				42	- Locks vs I/O accesses.
				43
				44	(*) Where are memory barriers needed?
				45
				46	- Interprocessor interaction.
				47	- Atomic operations.
				48	- Accessing devices.
				49	- Interrupts.
				50
				51	(*) Kernel I/O barrier effects.
				52
				53	(*) Assumed minimum execution ordering model.
				54
				55	(*) The effects of the cpu cache.
				56
				57	- Cache coherency.
				58	- Cache coherency vs DMA.
				59	- Cache coherency vs MMIO.
				60
				61	(*) The things CPUs get up to.
				62
				63	- And then there's the Alpha.
				64
David Howells	90fddab	2010-03-24 09:43:00 +0000	[diff] [blame]	65	(*) Example uses.
				66
				67	- Circular buffers.
				68
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	69	(*) References.
				70
				71
				72	============================
				73	ABSTRACT MEMORY ACCESS MODEL
				74	============================
				75
				76	Consider the following abstract model of the system:
				77
				78	: :
				79	: :
				80	: :
				81	+-------+ : +--------+ : +-------+
				82	\| \| : \| \| : \| \|
				83	\| \| : \| \| : \| \|
				84	\| CPU 1 \|<----->\| Memory \|<----->\| CPU 2 \|
				85	\| \| : \| \| : \| \|
				86	\| \| : \| \| : \| \|
				87	+-------+ : +--------+ : +-------+
				88	^ : ^ : ^
				89	\| : \| : \|
				90	\| : \| : \|
				91	\| : v : \|
				92	\| : +--------+ : \|
				93	\| : \| \| : \|
				94	\| : \| \| : \|
				95	+---------->\| Device \|<----------+
				96	: \| \| :
				97	: \| \| :
				98	: +--------+ :
				99	: :
				100
				101	Each CPU executes a program that generates memory access operations. In the
				102	abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
				103	perform the memory operations in any order it likes, provided program causality
				104	appears to be maintained. Similarly, the compiler may also arrange the
				105	instructions it emits in any order it likes, provided it doesn't affect the
				106	apparent operation of the program.
				107
				108	So in the above diagram, the effects of the memory operations performed by a
				109	CPU are perceived by the rest of the system as the operations cross the
				110	interface between the CPU and rest of the system (the dotted lines).
				111
				112
				113	For example, consider the following sequence of events:
				114
				115	CPU 1 CPU 2
				116	=============== ===============
				117	{ A == 1; B == 2 }
				118	A = 3; x = A;
				119	B = 4; y = B;
				120
				121	The set of accesses as seen by the memory system in the middle can be arranged
				122	in 24 different combinations:
				123
				124	STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4
				125	STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3
				126	STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4
				127	STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4
				128	STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3
				129	STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4
				130	STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4
				131	STORE B=4, ...
				132	...
				133
				134	and can thus result in four different combinations of values:
				135
				136	x == 1, y == 2
				137	x == 1, y == 4
				138	x == 3, y == 2
				139	x == 3, y == 4
				140
				141
				142	Furthermore, the stores committed by a CPU to the memory system may not be
				143	perceived by the loads made by another CPU in the same order as the stores were
				144	committed.
				145
				146
				147	As a further example, consider this sequence of events:
				148
				149	CPU 1 CPU 2
				150	=============== ===============
				151	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
				152	B = 4; Q = P;
				153	P = &B D = *Q;
				154
				155	There is an obvious data dependency here, as the value loaded into D depends on
				156	the address retrieved from P by CPU 2. At the end of the sequence, any of the
				157	following results are possible:
				158
				159	(Q == &A) and (D == 1)
				160	(Q == &B) and (D == 2)
				161	(Q == &B) and (D == 4)
				162
				163	Note that CPU 2 will never try and load C into D because the CPU will load P
				164	into Q before issuing the load of *Q.
				165
				166
				167	DEVICE OPERATIONS
				168	-----------------
				169
				170	Some devices present their control interfaces as collections of memory
				171	locations, but the order in which the control registers are accessed is very
				172	important. For instance, imagine an ethernet card with a set of internal
				173	registers that are accessed through an address port register (A) and a data
				174	port register (D). To read internal register 5, the following code might then
				175	be used:
				176
				177	*A = 5;
				178	x = *D;
				179
				180	but this might show up as either of the following two sequences:
				181
				182	STORE A = 5, x = LOAD D
				183	x = LOAD D, STORE A = 5
				184
				185	the second of which will almost certainly result in a malfunction, since it set
				186	the address _after_ attempting to read the register.
				187
				188
				189	GUARANTEES
				190	----------
				191
				192	There are some minimal guarantees that may be expected of a CPU:
				193
				194	(*) On any given CPU, dependent memory accesses will be issued in order, with
				195	respect to itself. This means that for:
				196
				197	Q = P; D = *Q;
				198
				199	the CPU will issue the following memory operations:
				200
				201	Q = LOAD P, D = LOAD *Q
				202
				203	and always in that order.
				204
				205	(*) Overlapping loads and stores within a particular CPU will appear to be
				206	ordered within that CPU. This means that for:
				207
				208	a = X; X = b;
				209
				210	the CPU will only issue the following sequence of memory operations:
				211
				212	a = LOAD X, STORE X = b
				213
				214	And for:
				215
				216	X = c; d = X;
				217
				218	the CPU will only issue:
				219
				220	STORE X = c, d = LOAD X
				221
Matt LaPlante	fa00e7e	2006-11-30 04:55:36 +0100	[diff] [blame]	222	(Loads and stores overlap if they are targeted at overlapping pieces of
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	223	memory).
				224
				225	And there are a number of things that _must_ or _must_not_ be assumed:
				226
				227	(*) It _must_not_ be assumed that independent loads and stores will be issued
				228	in the order given. This means that for:
				229
				230	X = A; Y = B; *D = Z;
				231
				232	we may get any of the following sequences:
				233
				234	X = LOAD A, Y = LOAD B, STORE *D = Z
				235	X = LOAD A, STORE D = Z, Y = LOAD *B
				236	Y = LOAD B, X = LOAD A, STORE *D = Z
				237	Y = LOAD B, STORE D = Z, X = LOAD *A
				238	STORE D = Z, X = LOAD A, Y = LOAD *B
				239	STORE D = Z, Y = LOAD B, X = LOAD *A
				240
				241	(*) It _must_ be assumed that overlapping memory accesses may be merged or
				242	discarded. This means that for:
				243
				244	X = A; Y = (A + 4);
				245
				246	we may get any one of the following sequences:
				247
				248	X = LOAD A; Y = LOAD (A + 4);
				249	Y = LOAD (A + 4); X = LOAD A;
				250	{X, Y} = LOAD {A, (A + 4) };
				251
				252	And for:
				253
Paul E. McKenney	f191eec	2012-10-03 10:28:30 -0700	[diff] [blame]	254	A = X; (A + 4) = Y;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	255
Paul E. McKenney	f191eec	2012-10-03 10:28:30 -0700	[diff] [blame]	256	we may get any of:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	257
Paul E. McKenney	f191eec	2012-10-03 10:28:30 -0700	[diff] [blame]	258	STORE A = X; STORE (A + 4) = Y;
				259	STORE (A + 4) = Y; STORE A = X;
				260	STORE {A, (A + 4) } = {X, Y};
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	261
				262
				263	=========================
				264	WHAT ARE MEMORY BARRIERS?
				265	=========================
				266
				267	As can be seen above, independent memory operations are effectively performed
				268	in random order, but this can be a problem for CPU-CPU interaction and for I/O.
				269	What is required is some way of intervening to instruct the compiler and the
				270	CPU to restrict the order.
				271
				272	Memory barriers are such interventions. They impose a perceived partial
David Howells	2b94895	2006-06-25 05:48:49 -0700	[diff] [blame]	273	ordering over the memory operations on either side of the barrier.
				274
				275	Such enforcement is important because the CPUs and other devices in a system
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	276	can use a variety of tricks to improve performance, including reordering,
David Howells	2b94895	2006-06-25 05:48:49 -0700	[diff] [blame]	277	deferral and combination of memory operations; speculative loads; speculative
				278	branch prediction and various types of caching. Memory barriers are used to
				279	override or suppress these tricks, allowing the code to sanely control the
				280	interaction of multiple CPUs and/or devices.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	281
				282
				283	VARIETIES OF MEMORY BARRIER
				284	---------------------------
				285
				286	Memory barriers come in four basic varieties:
				287
				288	(1) Write (or store) memory barriers.
				289
				290	A write memory barrier gives a guarantee that all the STORE operations
				291	specified before the barrier will appear to happen before all the STORE
				292	operations specified after the barrier with respect to the other
				293	components of the system.
				294
				295	A write barrier is a partial ordering on stores only; it is not required
				296	to have any effect on loads.
				297
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	298	A CPU can be viewed as committing a sequence of store operations to the
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	299	memory system as time progresses. All stores before a write barrier will
				300	occur in the sequence _before_ all the stores after the write barrier.
				301
				302	[!] Note that write barriers should normally be paired with read or data
				303	dependency barriers; see the "SMP barrier pairing" subsection.
				304
				305
				306	(2) Data dependency barriers.
				307
				308	A data dependency barrier is a weaker form of read barrier. In the case
				309	where two loads are performed such that the second depends on the result
				310	of the first (eg: the first load retrieves the address to which the second
				311	load will be directed), a data dependency barrier would be required to
				312	make sure that the target of the second load is updated before the address
				313	obtained by the first load is accessed.
				314
				315	A data dependency barrier is a partial ordering on interdependent loads
				316	only; it is not required to have any effect on stores, independent loads
				317	or overlapping loads.
				318
				319	As mentioned in (1), the other CPUs in the system can be viewed as
				320	committing sequences of stores to the memory system that the CPU being
				321	considered can then perceive. A data dependency barrier issued by the CPU
				322	under consideration guarantees that for any load preceding it, if that
				323	load touches one of a sequence of stores from another CPU, then by the
				324	time the barrier completes, the effects of all the stores prior to that
				325	touched by the load will be perceptible to any loads issued after the data
				326	dependency barrier.
				327
				328	See the "Examples of memory barrier sequences" subsection for diagrams
				329	showing the ordering constraints.
				330
				331	[!] Note that the first load really has to have a _data_ dependency and
				332	not a control dependency. If the address for the second load is dependent
				333	on the first load, but the dependency is through a conditional rather than
				334	actually loading the address itself, then it's a _control_ dependency and
				335	a full read barrier or better is required. See the "Control dependencies"
				336	subsection for more information.
				337
				338	[!] Note that data dependency barriers should normally be paired with
				339	write barriers; see the "SMP barrier pairing" subsection.
				340
				341
				342	(3) Read (or load) memory barriers.
				343
				344	A read barrier is a data dependency barrier plus a guarantee that all the
				345	LOAD operations specified before the barrier will appear to happen before
				346	all the LOAD operations specified after the barrier with respect to the
				347	other components of the system.
				348
				349	A read barrier is a partial ordering on loads only; it is not required to
				350	have any effect on stores.
				351
				352	Read memory barriers imply data dependency barriers, and so can substitute
				353	for them.
				354
				355	[!] Note that read barriers should normally be paired with write barriers;
				356	see the "SMP barrier pairing" subsection.
				357
				358
				359	(4) General memory barriers.
				360
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	361	A general memory barrier gives a guarantee that all the LOAD and STORE
				362	operations specified before the barrier will appear to happen before all
				363	the LOAD and STORE operations specified after the barrier with respect to
				364	the other components of the system.
				365
				366	A general memory barrier is a partial ordering over both loads and stores.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	367
				368	General memory barriers imply both read and write memory barriers, and so
				369	can substitute for either.
				370
				371
				372	And a couple of implicit varieties:
				373
				374	(5) LOCK operations.
				375
				376	This acts as a one-way permeable barrier. It guarantees that all memory
				377	operations after the LOCK operation will appear to happen after the LOCK
				378	operation with respect to the other components of the system.
				379
				380	Memory operations that occur before a LOCK operation may appear to happen
				381	after it completes.
				382
				383	A LOCK operation should almost always be paired with an UNLOCK operation.
				384
				385
				386	(6) UNLOCK operations.
				387
				388	This also acts as a one-way permeable barrier. It guarantees that all
				389	memory operations before the UNLOCK operation will appear to happen before
				390	the UNLOCK operation with respect to the other components of the system.
				391
				392	Memory operations that occur after an UNLOCK operation may appear to
				393	happen before it completes.
				394
				395	LOCK and UNLOCK operations are guaranteed to appear with respect to each
				396	other strictly in the order specified.
				397
				398	The use of LOCK and UNLOCK operations generally precludes the need for
				399	other sorts of memory barrier (but note the exceptions mentioned in the
				400	subsection "MMIO write barrier").
				401
				402
				403	Memory barriers are only required where there's a possibility of interaction
				404	between two CPUs or between a CPU and a device. If it can be guaranteed that
				405	there won't be any such interaction in any particular piece of code, then
				406	memory barriers are unnecessary in that piece of code.
				407
				408
				409	Note that these are the _minimum_ guarantees. Different architectures may give
				410	more substantial guarantees, but they may _not_ be relied upon outside of arch
				411	specific code.
				412
				413
				414	WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
				415	----------------------------------------------
				416
				417	There are certain things that the Linux kernel memory barriers do not guarantee:
				418
				419	(*) There is no guarantee that any of the memory accesses specified before a
				420	memory barrier will be _complete_ by the completion of a memory barrier
				421	instruction; the barrier can be considered to draw a line in that CPU's
				422	access queue that accesses of the appropriate type may not cross.
				423
				424	(*) There is no guarantee that issuing a memory barrier on one CPU will have
				425	any direct effect on another CPU or any other hardware in the system. The
				426	indirect effect will be the order in which the second CPU sees the effects
				427	of the first CPU's accesses occur, but see the next point:
				428
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	429	(*) There is no guarantee that a CPU will see the correct order of effects
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	430	from a second CPU's accesses, even _if_ the second CPU uses a memory
				431	barrier, unless the first CPU _also_ uses a matching memory barrier (see
				432	the subsection on "SMP Barrier Pairing").
				433
				434	(*) There is no guarantee that some intervening piece of off-the-CPU
				435	hardware[*] will not reorder the memory accesses. CPU cache coherency
				436	mechanisms should propagate the indirect effects of a memory barrier
				437	between CPUs, but might not do so in order.
				438
				439	[*] For information on bus mastering DMA and coherency please read:
				440
Randy Dunlap	4b5ff469	2008-03-10 17:16:32 -0700	[diff] [blame]	441	Documentation/PCI/pci.txt
Paul Bolle	395cf96	2011-08-15 02:02:26 +0200	[diff] [blame]	442	Documentation/DMA-API-HOWTO.txt
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	443	Documentation/DMA-API.txt
				444
				445
				446	DATA DEPENDENCY BARRIERS
				447	------------------------
				448
				449	The usage requirements of data dependency barriers are a little subtle, and
				450	it's not always obvious that they're needed. To illustrate, consider the
				451	following sequence of events:
				452
				453	CPU 1 CPU 2
				454	=============== ===============
				455	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
				456	B = 4;
				457	<write barrier>
				458	P = &B
				459	Q = P;
				460	D = *Q;
				461
				462	There's a clear data dependency here, and it would seem that by the end of the
				463	sequence, Q must be either &A or &B, and that:
				464
				465	(Q == &A) implies (D == 1)
				466	(Q == &B) implies (D == 4)
				467
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	468	But! CPU 2's perception of P may be updated _before_ its perception of B, thus
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	469	leading to the following situation:
				470
				471	(Q == &B) and (D == 2) ????
				472
				473	Whilst this may seem like a failure of coherency or causality maintenance, it
				474	isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
				475	Alpha).
				476
David Howells	2b94895	2006-06-25 05:48:49 -0700	[diff] [blame]	477	To deal with this, a data dependency barrier or better must be inserted
				478	between the address load and the data load:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	479
				480	CPU 1 CPU 2
				481	=============== ===============
				482	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
				483	B = 4;
				484	<write barrier>
				485	P = &B
				486	Q = P;
				487	<data dependency barrier>
				488	D = *Q;
				489
				490	This enforces the occurrence of one of the two implications, and prevents the
				491	third possibility from arising.
				492
				493	[!] Note that this extremely counterintuitive situation arises most easily on
				494	machines with split caches, so that, for example, one cache bank processes
				495	even-numbered cache lines and the other bank processes odd-numbered cache
				496	lines. The pointer P might be stored in an odd-numbered cache line, and the
				497	variable B might be stored in an even-numbered cache line. Then, if the
				498	even-numbered bank of the reading CPU's cache is extremely busy while the
				499	odd-numbered bank is idle, one can see the new value of the pointer P (&B),
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	500	but the old value of the variable B (2).
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	501
				502
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	503	Another example of where data dependency barriers might be required is where a
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	504	number is read from memory and then used to calculate the index for an array
				505	access:
				506
				507	CPU 1 CPU 2
				508	=============== ===============
				509	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
				510	M[1] = 4;
				511	<write barrier>
				512	P = 1
				513	Q = P;
				514	<data dependency barrier>
				515	D = M[Q];
				516
				517
				518	The data dependency barrier is very important to the RCU system, for example.
				519	See rcu_dereference() in include/linux/rcupdate.h. This permits the current
				520	target of an RCU'd pointer to be replaced with a new modified target, without
				521	the replacement target appearing to be incompletely initialised.
				522
				523	See also the subsection on "Cache Coherency" for a more thorough example.
				524
				525
				526	CONTROL DEPENDENCIES
				527	--------------------
				528
				529	A control dependency requires a full read memory barrier, not simply a data
				530	dependency barrier to make it work correctly. Consider the following bit of
				531	code:
				532
				533	q = &a;
Paul E. McKenney	45c8a36	2013-07-02 15:24:09 -0700	[diff] [blame]	534	if (p) {
				535	<data dependency barrier>
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	536	q = &b;
Paul E. McKenney	45c8a36	2013-07-02 15:24:09 -0700	[diff] [blame]	537	}
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	538	x = *q;
				539
				540	This will not have the desired effect because there is no actual data
				541	dependency, but rather a control dependency that the CPU may short-circuit by
				542	attempting to predict the outcome in advance. In such a case what's actually
				543	required is:
				544
				545	q = &a;
Paul E. McKenney	45c8a36	2013-07-02 15:24:09 -0700	[diff] [blame]	546	if (p) {
				547	<read barrier>
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	548	q = &b;
Paul E. McKenney	45c8a36	2013-07-02 15:24:09 -0700	[diff] [blame]	549	}
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	550	x = *q;
				551
				552
				553	SMP BARRIER PAIRING
				554	-------------------
				555
				556	When dealing with CPU-CPU interactions, certain types of memory barrier should
				557	always be paired. A lack of appropriate pairing is almost certainly an error.
				558
				559	A write barrier should always be paired with a data dependency barrier or read
				560	barrier, though a general barrier would also be viable. Similarly a read
				561	barrier or a data dependency barrier should always be paired with at least an
				562	write barrier, though, again, a general barrier is viable:
				563
				564	CPU 1 CPU 2
				565	=============== ===============
				566	a = 1;
				567	<write barrier>
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	568	b = 2; x = b;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	569	<read barrier>
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	570	y = a;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	571
				572	Or:
				573
				574	CPU 1 CPU 2
				575	=============== ===============================
				576	a = 1;
				577	<write barrier>
				578	b = &a; x = b;
				579	<data dependency barrier>
				580	y = *x;
				581
				582	Basically, the read barrier always has to be there, even though it can be of
				583	the "weaker" type.
				584
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	585	[!] Note that the stores before the write barrier would normally be expected to
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	586	match the loads after the read barrier or the data dependency barrier, and vice
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	587	versa:
				588
				589	CPU 1 CPU 2
				590	=============== ===============
				591	a = 1; }---- --->{ v = c
				592	b = 2; } \ / { w = d
				593	<write barrier> \ <read barrier>
				594	c = 3; } / \ { x = a;
				595	d = 4; }---- --->{ y = b;
				596
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	597
				598	EXAMPLES OF MEMORY BARRIER SEQUENCES
				599	------------------------------------
				600
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	601	Firstly, write barriers act as partial orderings on store operations.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	602	Consider the following sequence of events:
				603
				604	CPU 1
				605	=======================
				606	STORE A = 1
				607	STORE B = 2
				608	STORE C = 3
				609	<write barrier>
				610	STORE D = 4
				611	STORE E = 5
				612
				613	This sequence of events is committed to the memory coherence system in an order
				614	that the rest of the system might perceive as the unordered set of { STORE A,
Adrian Bunk	80f7228	2006-06-30 18:27:16 +0200	[diff] [blame]	615	STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	616	}:
				617
				618	+-------+ : :
				619	\| \| +------+
				620	\| \|------>\| C=3 \| } /\
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	621	\| \| : +------+ }----- \ -----> Events perceptible to
				622	\| \| : \| A=1 \| } \/ the rest of the system
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	623	\| \| : +------+ }
				624	\| CPU 1 \| : \| B=2 \| }
				625	\| \| +------+ }
				626	\| \| wwwwwwwwwwwwwwww } <--- At this point the write barrier
				627	\| \| +------+ } requires all stores prior to the
				628	\| \| : \| E=5 \| } barrier to be committed before
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	629	\| \| : +------+ } further stores may take place
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	630	\| \|------>\| D=4 \| }
				631	\| \| +------+
				632	+-------+ : :
				633	\|
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	634	\| Sequence in which stores are committed to the
				635	\| memory system by CPU 1
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	636	V
				637
				638
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	639	Secondly, data dependency barriers act as partial orderings on data-dependent
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	640	loads. Consider the following sequence of events:
				641
				642	CPU 1 CPU 2
				643	======================= =======================
David Howells	c14038c	2006-04-10 22:54:24 -0700	[diff] [blame]	644	{ B = 7; X = 9; Y = 8; C = &Y }
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	645	STORE A = 1
				646	STORE B = 2
				647	<write barrier>
				648	STORE C = &B LOAD X
				649	STORE D = 4 LOAD C (gets &B)
				650	LOAD *C (reads B)
				651
				652	Without intervention, CPU 2 may perceive the events on CPU 1 in some
				653	effectively random order, despite the write barrier issued by CPU 1:
				654
				655	+-------+ : : : :
				656	\| \| +------+ +-------+ \| Sequence of update
				657	\| \|------>\| B=2 \|----- --->\| Y->8 \| \| of perception on
				658	\| \| : +------+ \ +-------+ \| CPU 2
				659	\| CPU 1 \| : \| A=1 \| \ --->\| C->&Y \| V
				660	\| \| +------+ \| +-------+
				661	\| \| wwwwwwwwwwwwwwww \| : :
				662	\| \| +------+ \| : :
				663	\| \| : \| C=&B \|--- \| : : +-------+
				664	\| \| : +------+ \ \| +-------+ \| \|
				665	\| \|------>\| D=4 \| ----------->\| C->&B \|------>\| \|
				666	\| \| +------+ \| +-------+ \| \|
				667	+-------+ : : \| : : \| \|
				668	\| : : \| \|
				669	\| : : \| CPU 2 \|
				670	\| +-------+ \| \|
				671	Apparently incorrect ---> \| \| B->7 \|------>\| \|
				672	perception of B (!) \| +-------+ \| \|
				673	\| : : \| \|
				674	\| +-------+ \| \|
				675	The load of X holds ---> \ \| X->9 \|------>\| \|
				676	up the maintenance \ +-------+ \| \|
				677	of coherence of B ----->\| B->2 \| +-------+
				678	+-------+
				679	: :
				680
				681
				682	In the above example, CPU 2 perceives that B is 7, despite the load of *C
Paolo Ornati	670e9f3	2006-10-03 22:57:56 +0200	[diff] [blame]	683	(which would be B) coming after the LOAD of C.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	684
				685	If, however, a data dependency barrier were to be placed between the load of C
David Howells	c14038c	2006-04-10 22:54:24 -0700	[diff] [blame]	686	and the load of *C (ie: B) on CPU 2:
				687
				688	CPU 1 CPU 2
				689	======================= =======================
				690	{ B = 7; X = 9; Y = 8; C = &Y }
				691	STORE A = 1
				692	STORE B = 2
				693	<write barrier>
				694	STORE C = &B LOAD X
				695	STORE D = 4 LOAD C (gets &B)
				696	<data dependency barrier>
				697	LOAD *C (reads B)
				698
				699	then the following will occur:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	700
				701	+-------+ : : : :
				702	\| \| +------+ +-------+
				703	\| \|------>\| B=2 \|----- --->\| Y->8 \|
				704	\| \| : +------+ \ +-------+
				705	\| CPU 1 \| : \| A=1 \| \ --->\| C->&Y \|
				706	\| \| +------+ \| +-------+
				707	\| \| wwwwwwwwwwwwwwww \| : :
				708	\| \| +------+ \| : :
				709	\| \| : \| C=&B \|--- \| : : +-------+
				710	\| \| : +------+ \ \| +-------+ \| \|
				711	\| \|------>\| D=4 \| ----------->\| C->&B \|------>\| \|
				712	\| \| +------+ \| +-------+ \| \|
				713	+-------+ : : \| : : \| \|
				714	\| : : \| \|
				715	\| : : \| CPU 2 \|
				716	\| +-------+ \| \|
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	717	\| \| X->9 \|------>\| \|
				718	\| +-------+ \| \|
				719	Makes sure all effects ---> \ ddddddddddddddddd \| \|
				720	prior to the store of C \ +-------+ \| \|
				721	are perceptible to ----->\| B->2 \|------>\| \|
				722	subsequent loads +-------+ \| \|
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	723	: : +-------+
				724
				725
				726	And thirdly, a read barrier acts as a partial order on loads. Consider the
				727	following sequence of events:
				728
				729	CPU 1 CPU 2
				730	======================= =======================
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	731	{ A = 0, B = 9 }
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	732	STORE A=1
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	733	<write barrier>
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	734	STORE B=2
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	735	LOAD B
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	736	LOAD A
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	737
				738	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
				739	some effectively random order, despite the write barrier issued by CPU 1:
				740
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	741	+-------+ : : : :
				742	\| \| +------+ +-------+
				743	\| \|------>\| A=1 \|------ --->\| A->0 \|
				744	\| \| +------+ \ +-------+
				745	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
				746	\| \| +------+ \| +-------+
				747	\| \|------>\| B=2 \|--- \| : :
				748	\| \| +------+ \ \| : : +-------+
				749	+-------+ : : \ \| +-------+ \| \|
				750	---------->\| B->2 \|------>\| \|
				751	\| +-------+ \| CPU 2 \|
				752	\| \| A->0 \|------>\| \|
				753	\| +-------+ \| \|
				754	\| : : +-------+
				755	\ : :
				756	\ +-------+
				757	---->\| A->1 \|
				758	+-------+
				759	: :
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	760
				761
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	762	If, however, a read barrier were to be placed between the load of B and the
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	763	load of A on CPU 2:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	764
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	765	CPU 1 CPU 2
				766	======================= =======================
				767	{ A = 0, B = 9 }
				768	STORE A=1
				769	<write barrier>
				770	STORE B=2
				771	LOAD B
				772	<read barrier>
				773	LOAD A
				774
				775	then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
				776	2:
				777
				778	+-------+ : : : :
				779	\| \| +------+ +-------+
				780	\| \|------>\| A=1 \|------ --->\| A->0 \|
				781	\| \| +------+ \ +-------+
				782	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
				783	\| \| +------+ \| +-------+
				784	\| \|------>\| B=2 \|--- \| : :
				785	\| \| +------+ \ \| : : +-------+
				786	+-------+ : : \ \| +-------+ \| \|
				787	---------->\| B->2 \|------>\| \|
				788	\| +-------+ \| CPU 2 \|
				789	\| : : \| \|
				790	\| : : \| \|
				791	At this point the read ----> \ rrrrrrrrrrrrrrrrr \| \|
				792	barrier causes all effects \ +-------+ \| \|
				793	prior to the storage of B ---->\| A->1 \|------>\| \|
				794	to be perceptible to CPU 2 +-------+ \| \|
				795	: : +-------+
				796
				797
				798	To illustrate this more completely, consider what could happen if the code
				799	contained a load of A either side of the read barrier:
				800
				801	CPU 1 CPU 2
				802	======================= =======================
				803	{ A = 0, B = 9 }
				804	STORE A=1
				805	<write barrier>
				806	STORE B=2
				807	LOAD B
				808	LOAD A [first load of A]
				809	<read barrier>
				810	LOAD A [second load of A]
				811
				812	Even though the two loads of A both occur after the load of B, they may both
				813	come up with different values:
				814
				815	+-------+ : : : :
				816	\| \| +------+ +-------+
				817	\| \|------>\| A=1 \|------ --->\| A->0 \|
				818	\| \| +------+ \ +-------+
				819	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
				820	\| \| +------+ \| +-------+
				821	\| \|------>\| B=2 \|--- \| : :
				822	\| \| +------+ \ \| : : +-------+
				823	+-------+ : : \ \| +-------+ \| \|
				824	---------->\| B->2 \|------>\| \|
				825	\| +-------+ \| CPU 2 \|
				826	\| : : \| \|
				827	\| : : \| \|
				828	\| +-------+ \| \|
				829	\| \| A->0 \|------>\| 1st \|
				830	\| +-------+ \| \|
				831	At this point the read ----> \ rrrrrrrrrrrrrrrrr \| \|
				832	barrier causes all effects \ +-------+ \| \|
				833	prior to the storage of B ---->\| A->1 \|------>\| 2nd \|
				834	to be perceptible to CPU 2 +-------+ \| \|
				835	: : +-------+
				836
				837
				838	But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
				839	before the read barrier completes anyway:
				840
				841	+-------+ : : : :
				842	\| \| +------+ +-------+
				843	\| \|------>\| A=1 \|------ --->\| A->0 \|
				844	\| \| +------+ \ +-------+
				845	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
				846	\| \| +------+ \| +-------+
				847	\| \|------>\| B=2 \|--- \| : :
				848	\| \| +------+ \ \| : : +-------+
				849	+-------+ : : \ \| +-------+ \| \|
				850	---------->\| B->2 \|------>\| \|
				851	\| +-------+ \| CPU 2 \|
				852	\| : : \| \|
				853	\ : : \| \|
				854	\ +-------+ \| \|
				855	---->\| A->1 \|------>\| 1st \|
				856	+-------+ \| \|
				857	rrrrrrrrrrrrrrrrr \| \|
				858	+-------+ \| \|
				859	\| A->1 \|------>\| 2nd \|
				860	+-------+ \| \|
				861	: : +-------+
				862
				863
				864	The guarantee is that the second load will always come up with A == 1 if the
				865	load of B came up with B == 2. No such guarantee exists for the first load of
				866	A; that may come up with either A == 0 or A == 1.
				867
				868
				869	READ MEMORY BARRIERS VS LOAD SPECULATION
				870	----------------------------------------
				871
				872	Many CPUs speculate with loads: that is they see that they will need to load an
				873	item from memory, and they find a time where they're not using the bus for any
				874	other loads, and so do the load in advance - even though they haven't actually
				875	got to that point in the instruction execution flow yet. This permits the
				876	actual load instruction to potentially complete immediately because the CPU
				877	already has the value to hand.
				878
				879	It may turn out that the CPU didn't actually need the value - perhaps because a
				880	branch circumvented the load - in which case it can discard the value or just
				881	cache it for later use.
				882
				883	Consider:
				884
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	885	CPU 1 CPU 2
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	886	======================= =======================
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	887	LOAD B
				888	DIVIDE } Divide instructions generally
				889	DIVIDE } take a long time to perform
				890	LOAD A
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	891
				892	Which might appear as this:
				893
				894	: : +-------+
				895	+-------+ \| \|
				896	--->\| B->2 \|------>\| \|
				897	+-------+ \| CPU 2 \|
				898	: :DIVIDE \| \|
				899	+-------+ \| \|
				900	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
				901	division speculates on the +-------+ ~ \| \|
				902	LOAD of A : : ~ \| \|
				903	: :DIVIDE \| \|
				904	: : ~ \| \|
				905	Once the divisions are complete --> : : ~-->\| \|
				906	the CPU can then perform the : : \| \|
				907	LOAD with immediate effect : : +-------+
				908
				909
				910	Placing a read barrier or a data dependency barrier just before the second
				911	load:
				912
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	913	CPU 1 CPU 2
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	914	======================= =======================
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	915	LOAD B
				916	DIVIDE
				917	DIVIDE
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	918	<read barrier>
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	919	LOAD A
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	920
				921	will force any value speculatively obtained to be reconsidered to an extent
				922	dependent on the type of barrier used. If there was no change made to the
				923	speculated memory location, then the speculated value will just be used:
				924
				925	: : +-------+
				926	+-------+ \| \|
				927	--->\| B->2 \|------>\| \|
				928	+-------+ \| CPU 2 \|
				929	: :DIVIDE \| \|
				930	+-------+ \| \|
				931	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
				932	division speculates on the +-------+ ~ \| \|
				933	LOAD of A : : ~ \| \|
				934	: :DIVIDE \| \|
				935	: : ~ \| \|
				936	: : ~ \| \|
				937	rrrrrrrrrrrrrrrr~ \| \|
				938	: : ~ \| \|
				939	: : ~-->\| \|
				940	: : \| \|
				941	: : +-------+
				942
				943
				944	but if there was an update or an invalidation from another CPU pending, then
				945	the speculation will be cancelled and the value reloaded:
				946
				947	: : +-------+
				948	+-------+ \| \|
				949	--->\| B->2 \|------>\| \|
				950	+-------+ \| CPU 2 \|
				951	: :DIVIDE \| \|
				952	+-------+ \| \|
				953	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
				954	division speculates on the +-------+ ~ \| \|
				955	LOAD of A : : ~ \| \|
				956	: :DIVIDE \| \|
				957	: : ~ \| \|
				958	: : ~ \| \|
				959	rrrrrrrrrrrrrrrrr \| \|
				960	+-------+ \| \|
				961	The speculation is discarded ---> --->\| A->1 \|------>\| \|
				962	and an updated value is +-------+ \| \|
				963	retrieved : : +-------+
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	964
				965
Paul E. McKenney	241e666	2011-02-10 16:54:50 -0800	[diff] [blame]	966	TRANSITIVITY
				967	------------
				968
				969	Transitivity is a deeply intuitive notion about ordering that is not
				970	always provided by real computer systems. The following example
				971	demonstrates transitivity (also called "cumulativity"):
				972
				973	CPU 1 CPU 2 CPU 3
				974	======================= ======================= =======================
				975	{ X = 0, Y = 0 }
				976	STORE X=1 LOAD X STORE Y=1
				977	<general barrier> <general barrier>
				978	LOAD Y LOAD X
				979
				980	Suppose that CPU 2's load from X returns 1 and its load from Y returns 0.
				981	This indicates that CPU 2's load from X in some sense follows CPU 1's
				982	store to X and that CPU 2's load from Y in some sense preceded CPU 3's
				983	store to Y. The question is then "Can CPU 3's load from X return 0?"
				984
				985	Because CPU 2's load from X in some sense came after CPU 1's store, it
				986	is natural to expect that CPU 3's load from X must therefore return 1.
				987	This expectation is an example of transitivity: if a load executing on
				988	CPU A follows a load from the same variable executing on CPU B, then
				989	CPU A's load must either return the same value that CPU B's load did,
				990	or must return some later value.
				991
				992	In the Linux kernel, use of general memory barriers guarantees
				993	transitivity. Therefore, in the above example, if CPU 2's load from X
				994	returns 1 and its load from Y returns 0, then CPU 3's load from X must
				995	also return 1.
				996
				997	However, transitivity is -not- guaranteed for read or write barriers.
				998	For example, suppose that CPU 2's general barrier in the above example
				999	is changed to a read barrier as shown below:
				1000
				1001	CPU 1 CPU 2 CPU 3
				1002	======================= ======================= =======================
				1003	{ X = 0, Y = 0 }
				1004	STORE X=1 LOAD X STORE Y=1
				1005	<read barrier> <general barrier>
				1006	LOAD Y LOAD X
				1007
				1008	This substitution destroys transitivity: in this example, it is perfectly
				1009	legal for CPU 2's load from X to return 1, its load from Y to return 0,
				1010	and CPU 3's load from X to return 0.
				1011
				1012	The key point is that although CPU 2's read barrier orders its pair
				1013	of loads, it does not guarantee to order CPU 1's store. Therefore, if
				1014	this example runs on a system where CPUs 1 and 2 share a store buffer
				1015	or a level of cache, CPU 2 might have early access to CPU 1's writes.
				1016	General barriers are therefore required to ensure that all CPUs agree
				1017	on the combined order of CPU 1's and CPU 2's accesses.
				1018
				1019	To reiterate, if your code requires transitivity, use general barriers
				1020	throughout.
				1021
				1022
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1023	========================
				1024	EXPLICIT KERNEL BARRIERS
				1025	========================
				1026
				1027	The Linux kernel has a variety of different barriers that act at different
				1028	levels:
				1029
				1030	(*) Compiler barrier.
				1031
				1032	(*) CPU memory barriers.
				1033
				1034	(*) MMIO write barrier.
				1035
				1036
				1037	COMPILER BARRIER
				1038	----------------
				1039
				1040	The Linux kernel has an explicit compiler barrier function that prevents the
				1041	compiler from moving the memory accesses either side of it to the other side:
				1042
				1043	barrier();
				1044
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1045	This is a general barrier - lesser varieties of compiler barrier do not exist.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1046
				1047	The compiler barrier has no direct effect on the CPU, which may then reorder
				1048	things however it wishes.
				1049
				1050
				1051	CPU MEMORY BARRIERS
				1052	-------------------
				1053
				1054	The Linux kernel has eight basic CPU memory barriers:
				1055
				1056	TYPE MANDATORY SMP CONDITIONAL
				1057	=============== ======================= ===========================
				1058	GENERAL mb() smp_mb()
				1059	WRITE wmb() smp_wmb()
				1060	READ rmb() smp_rmb()
				1061	DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
				1062
				1063
Nick Piggin	73f1028	2008-05-14 06:35:11 +0200	[diff] [blame]	1064	All memory barriers except the data dependency barriers imply a compiler
				1065	barrier. Data dependencies do not impose any additional compiler ordering.
				1066
				1067	Aside: In the case of data dependencies, the compiler would be expected to
				1068	issue the loads in the correct order (eg. `a[b]` would have to load the value
				1069	of b before loading a[b]), however there is no guarantee in the C specification
				1070	that the compiler may not speculate the value of b (eg. is equal to 1) and load
				1071	a before b (eg. tmp = a[1]; if (b != 1) tmp = a[b]; ). There is also the
				1072	problem of a compiler reloading b after having loaded a[b], thus having a newer
				1073	copy of b than a[b]. A consensus has not yet been reached about these problems,
				1074	however the ACCESS_ONCE macro is a good place to start looking.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1075
				1076	SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1077	systems because it is assumed that a CPU will appear to be self-consistent,
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1078	and will order overlapping accesses correctly with respect to itself.
				1079
				1080	[!] Note that SMP memory barriers _must_ be used to control the ordering of
				1081	references to shared memory on SMP systems, though the use of locking instead
				1082	is sufficient.
				1083
				1084	Mandatory barriers should not be used to control SMP effects, since mandatory
				1085	barriers unnecessarily impose overhead on UP systems. They may, however, be
				1086	used to control MMIO effects on accesses through relaxed memory I/O windows.
				1087	These are required even on non-SMP systems as they affect the order in which
				1088	memory operations appear to a device by prohibiting both the compiler and the
				1089	CPU from reordering them.
				1090
				1091
				1092	There are some more advanced barrier functions:
				1093
				1094	(*) set_mb(var, value)
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1095
Oleg Nesterov	75b2bd5	2006-11-08 17:44:38 -0800	[diff] [blame]	1096	This assigns the value to the variable and then inserts a full memory
Steven Rostedt	f92213b	2006-07-14 16:05:01 -0400	[diff] [blame]	1097	barrier after it, depending on the function. It isn't guaranteed to
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1098	insert anything more than a compiler barrier in a UP compilation.
				1099
				1100
				1101	(*) smp_mb__before_atomic_dec();
				1102	(*) smp_mb__after_atomic_dec();
				1103	(*) smp_mb__before_atomic_inc();
				1104	(*) smp_mb__after_atomic_inc();
				1105
				1106	These are for use with atomic add, subtract, increment and decrement
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1107	functions that don't return a value, especially when used for reference
				1108	counting. These functions do not imply memory barriers.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1109
				1110	As an example, consider a piece of code that marks an object as being dead
				1111	and then decrements the object's reference count:
				1112
				1113	obj->dead = 1;
				1114	smp_mb__before_atomic_dec();
				1115	atomic_dec(&obj->ref_count);
				1116
				1117	This makes sure that the death mark on the object is perceived to be set
				1118	before the reference counter is decremented.
				1119
				1120	See Documentation/atomic_ops.txt for more information. See the "Atomic
				1121	operations" subsection for information on where to use these.
				1122
				1123
				1124	(*) smp_mb__before_clear_bit(void);
				1125	(*) smp_mb__after_clear_bit(void);
				1126
				1127	These are for use similar to the atomic inc/dec barriers. These are
				1128	typically used for bitwise unlocking operations, so care must be taken as
				1129	there are no implicit memory barriers here either.
				1130
				1131	Consider implementing an unlock operation of some nature by clearing a
				1132	locking bit. The clear_bit() would then need to be barriered like this:
				1133
				1134	smp_mb__before_clear_bit();
				1135	clear_bit( ... );
				1136
				1137	This prevents memory operations before the clear leaking to after it. See
				1138	the subsection on "Locking Functions" with reference to UNLOCK operation
				1139	implications.
				1140
				1141	See Documentation/atomic_ops.txt for more information. See the "Atomic
				1142	operations" subsection for information on where to use these.
				1143
				1144
				1145	MMIO WRITE BARRIER
				1146	------------------
				1147
				1148	The Linux kernel also has a special barrier for use with memory-mapped I/O
				1149	writes:
				1150
				1151	mmiowb();
				1152
				1153	This is a variation on the mandatory write barrier that causes writes to weakly
				1154	ordered I/O regions to be partially ordered. Its effects may go beyond the
				1155	CPU->Hardware interface and actually affect the hardware at some level.
				1156
				1157	See the subsection "Locks vs I/O accesses" for more information.
				1158
				1159
				1160	===============================
				1161	IMPLICIT KERNEL MEMORY BARRIERS
				1162	===============================
				1163
				1164	Some of the other functions in the linux kernel imply memory barriers, amongst
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	1165	which are locking and scheduling functions.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1166
				1167	This specification is a _minimum_ guarantee; any particular architecture may
				1168	provide more substantial guarantees, but these may not be relied upon outside
				1169	of arch specific code.
				1170
				1171
				1172	LOCKING FUNCTIONS
				1173	-----------------
				1174
				1175	The Linux kernel has a number of locking constructs:
				1176
				1177	(*) spin locks
				1178	(*) R/W spin locks
				1179	(*) mutexes
				1180	(*) semaphores
				1181	(*) R/W semaphores
				1182	(*) RCU
				1183
				1184	In all cases there are variants on "LOCK" operations and "UNLOCK" operations
				1185	for each construct. These operations all imply certain barriers:
				1186
				1187	(1) LOCK operation implication:
				1188
				1189	Memory operations issued after the LOCK will be completed after the LOCK
				1190	operation has completed.
				1191
				1192	Memory operations issued before the LOCK may be completed after the LOCK
				1193	operation has completed.
				1194
				1195	(2) UNLOCK operation implication:
				1196
				1197	Memory operations issued before the UNLOCK will be completed before the
				1198	UNLOCK operation has completed.
				1199
				1200	Memory operations issued after the UNLOCK may be completed before the
				1201	UNLOCK operation has completed.
				1202
				1203	(3) LOCK vs LOCK implication:
				1204
				1205	All LOCK operations issued before another LOCK operation will be completed
				1206	before that LOCK operation.
				1207
				1208	(4) LOCK vs UNLOCK implication:
				1209
				1210	All LOCK operations issued before an UNLOCK operation will be completed
				1211	before the UNLOCK operation.
				1212
				1213	All UNLOCK operations issued before a LOCK operation will be completed
				1214	before the LOCK operation.
				1215
				1216	(5) Failed conditional LOCK implication:
				1217
				1218	Certain variants of the LOCK operation may fail, either due to being
				1219	unable to get the lock immediately, or due to receiving an unblocked
				1220	signal whilst asleep waiting for the lock to become available. Failed
				1221	locks do not imply any sort of barrier.
				1222
				1223	Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
				1224	equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
				1225
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1226	[!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
				1227	barriers is that the effects of instructions outside of a critical section
				1228	may seep into the inside of the critical section.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1229
David Howells	670bd95	2006-06-10 09:54:12 -0700	[diff] [blame]	1230	A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
				1231	because it is possible for an access preceding the LOCK to happen after the
				1232	LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
				1233	two accesses can themselves then cross:
				1234
				1235	*A = a;
				1236	LOCK
				1237	UNLOCK
				1238	*B = b;
				1239
				1240	may occur as:
				1241
				1242	LOCK, STORE B, STORE A, UNLOCK
				1243
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1244	Locks and semaphores may not provide any guarantee of ordering on UP compiled
				1245	systems, and so cannot be counted on in such a situation to actually achieve
				1246	anything at all - especially with respect to I/O accesses - unless combined
				1247	with interrupt disabling operations.
				1248
				1249	See also the section on "Inter-CPU locking barrier effects".
				1250
				1251
				1252	As an example, consider the following:
				1253
				1254	*A = a;
				1255	*B = b;
				1256	LOCK
				1257	*C = c;
				1258	*D = d;
				1259	UNLOCK
				1260	*E = e;
				1261	*F = f;
				1262
				1263	The following sequence of events is acceptable:
				1264
				1265	LOCK, {F,A}, E, {C,D}, B, UNLOCK
				1266
				1267	[+] Note that {F,A} indicates a combined access.
				1268
				1269	But none of the following are:
				1270
				1271	{F,A}, B, LOCK, C, D, UNLOCK, E
				1272	A, B, C, LOCK, D, UNLOCK, E, F
				1273	A, B, LOCK, C, UNLOCK, D, E, F
				1274	B, LOCK, C, D, UNLOCK, {F,A}, E
				1275
				1276
				1277
				1278	INTERRUPT DISABLING FUNCTIONS
				1279	-----------------------------
				1280
				1281	Functions that disable interrupts (LOCK equivalent) and enable interrupts
				1282	(UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O
				1283	barriers are required in such a situation, they must be provided from some
				1284	other means.
				1285
				1286
David Howells	50fa610	2009-04-28 15:01:38 +0100	[diff] [blame]	1287	SLEEP AND WAKE-UP FUNCTIONS
				1288	---------------------------
				1289
				1290	Sleeping and waking on an event flagged in global data can be viewed as an
				1291	interaction between two pieces of data: the task state of the task waiting for
				1292	the event and the global data used to indicate the event. To make sure that
				1293	these appear to happen in the right order, the primitives to begin the process
				1294	of going to sleep, and the primitives to initiate a wake up imply certain
				1295	barriers.
				1296
				1297	Firstly, the sleeper normally follows something like this sequence of events:
				1298
				1299	for (;;) {
				1300	set_current_state(TASK_UNINTERRUPTIBLE);
				1301	if (event_indicated)
				1302	break;
				1303	schedule();
				1304	}
				1305
				1306	A general memory barrier is interpolated automatically by set_current_state()
				1307	after it has altered the task state:
				1308
				1309	CPU 1
				1310	===============================
				1311	set_current_state();
				1312	set_mb();
				1313	STORE current->state
				1314	<general barrier>
				1315	LOAD event_indicated
				1316
				1317	set_current_state() may be wrapped by:
				1318
				1319	prepare_to_wait();
				1320	prepare_to_wait_exclusive();
				1321
				1322	which therefore also imply a general memory barrier after setting the state.
				1323	The whole sequence above is available in various canned forms, all of which
				1324	interpolate the memory barrier in the right place:
				1325
				1326	wait_event();
				1327	wait_event_interruptible();
				1328	wait_event_interruptible_exclusive();
				1329	wait_event_interruptible_timeout();
				1330	wait_event_killable();
				1331	wait_event_timeout();
				1332	wait_on_bit();
				1333	wait_on_bit_lock();
				1334
				1335
				1336	Secondly, code that performs a wake up normally follows something like this:
				1337
				1338	event_indicated = 1;
				1339	wake_up(&event_wait_queue);
				1340
				1341	or:
				1342
				1343	event_indicated = 1;
				1344	wake_up_process(event_daemon);
				1345
				1346	A write memory barrier is implied by wake_up() and co. if and only if they wake
				1347	something up. The barrier occurs before the task state is cleared, and so sits
				1348	between the STORE to indicate the event and the STORE to set TASK_RUNNING:
				1349
				1350	CPU 1 CPU 2
				1351	=============================== ===============================
				1352	set_current_state(); STORE event_indicated
				1353	set_mb(); wake_up();
				1354	STORE current->state <write barrier>
				1355	<general barrier> STORE current->state
				1356	LOAD event_indicated
				1357
				1358	The available waker functions include:
				1359
				1360	complete();
				1361	wake_up();
				1362	wake_up_all();
				1363	wake_up_bit();
				1364	wake_up_interruptible();
				1365	wake_up_interruptible_all();
				1366	wake_up_interruptible_nr();
				1367	wake_up_interruptible_poll();
				1368	wake_up_interruptible_sync();
				1369	wake_up_interruptible_sync_poll();
				1370	wake_up_locked();
				1371	wake_up_locked_poll();
				1372	wake_up_nr();
				1373	wake_up_poll();
				1374	wake_up_process();
				1375
				1376
				1377	[!] Note that the memory barriers implied by the sleeper and the waker do _not_
				1378	order multiple stores before the wake-up with respect to loads of those stored
				1379	values after the sleeper has called set_current_state(). For instance, if the
				1380	sleeper does:
				1381
				1382	set_current_state(TASK_INTERRUPTIBLE);
				1383	if (event_indicated)
				1384	break;
				1385	__set_current_state(TASK_RUNNING);
				1386	do_something(my_data);
				1387
				1388	and the waker does:
				1389
				1390	my_data = value;
				1391	event_indicated = 1;
				1392	wake_up(&event_wait_queue);
				1393
				1394	there's no guarantee that the change to event_indicated will be perceived by
				1395	the sleeper as coming after the change to my_data. In such a circumstance, the
				1396	code on both sides must interpolate its own memory barriers between the
				1397	separate data accesses. Thus the above sleeper ought to do:
				1398
				1399	set_current_state(TASK_INTERRUPTIBLE);
				1400	if (event_indicated) {
				1401	smp_rmb();
				1402	do_something(my_data);
				1403	}
				1404
				1405	and the waker should do:
				1406
				1407	my_data = value;
				1408	smp_wmb();
				1409	event_indicated = 1;
				1410	wake_up(&event_wait_queue);
				1411
				1412
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1413	MISCELLANEOUS FUNCTIONS
				1414	-----------------------
				1415
				1416	Other functions that imply barriers:
				1417
				1418	(*) schedule() and similar imply full memory barriers.
				1419
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1420
				1421	=================================
				1422	INTER-CPU LOCKING BARRIER EFFECTS
				1423	=================================
				1424
				1425	On SMP systems locking primitives give a more substantial form of barrier: one
				1426	that does affect memory access ordering on other CPUs, within the context of
				1427	conflict on any particular lock.
				1428
				1429
				1430	LOCKS VS MEMORY ACCESSES
				1431	------------------------
				1432
Aneesh Kumar	79afecf	2006-05-15 09:44:36 -0700	[diff] [blame]	1433	Consider the following: the system has a pair of spinlocks (M) and (Q), and
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1434	three CPUs; then should the following sequence of events occur:
				1435
				1436	CPU 1 CPU 2
				1437	=============================== ===============================
				1438	A = a; E = e;
				1439	LOCK M LOCK Q
				1440	B = b; F = f;
				1441	C = c; G = g;
				1442	UNLOCK M UNLOCK Q
				1443	D = d; H = h;
				1444
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1445	Then there is no guarantee as to what order CPU 3 will see the accesses to *A
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1446	through *H occur in, other than the constraints imposed by the separate locks
				1447	on the separate CPUs. It might, for example, see:
				1448
				1449	E, LOCK M, LOCK Q, G, C, F, A, B, UNLOCK Q, D, H, UNLOCK M
				1450
				1451	But it won't see any of:
				1452
				1453	B, C or *D preceding LOCK M
				1454	A, B or *C following UNLOCK M
				1455	F, G or *H preceding LOCK Q
				1456	E, F or *G following UNLOCK Q
				1457
				1458
				1459	However, if the following occurs:
				1460
				1461	CPU 1 CPU 2
				1462	=============================== ===============================
				1463	*A = a;
				1464	LOCK M [1]
				1465	*B = b;
				1466	*C = c;
				1467	UNLOCK M [1]
				1468	D = d; E = e;
				1469	LOCK M [2]
				1470	*F = f;
				1471	*G = g;
				1472	UNLOCK M [2]
				1473	*H = h;
				1474
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1475	CPU 3 might see:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1476
				1477	E, LOCK M [1], C, B, A, UNLOCK M [1],
				1478	LOCK M [2], H, F, G, UNLOCK M [2], D
				1479
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1480	But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1481
				1482	B, C, D, F, G or H preceding LOCK M [1]
				1483	A, B or *C following UNLOCK M [1]
				1484	F, G or *H preceding LOCK M [2]
				1485	A, B, C, E, F or G following UNLOCK M [2]
				1486
				1487
				1488	LOCKS VS I/O ACCESSES
				1489	---------------------
				1490
				1491	Under certain circumstances (especially involving NUMA), I/O accesses within
				1492	two spinlocked sections on two different CPUs may be seen as interleaved by the
				1493	PCI bridge, because the PCI bridge does not necessarily participate in the
				1494	cache-coherence protocol, and is therefore incapable of issuing the required
				1495	read memory barriers.
				1496
				1497	For example:
				1498
				1499	CPU 1 CPU 2
				1500	=============================== ===============================
				1501	spin_lock(Q)
				1502	writel(0, ADDR)
				1503	writel(1, DATA);
				1504	spin_unlock(Q);
				1505	spin_lock(Q);
				1506	writel(4, ADDR);
				1507	writel(5, DATA);
				1508	spin_unlock(Q);
				1509
				1510	may be seen by the PCI bridge as follows:
				1511
				1512	STORE ADDR = 0, STORE ADDR = 4, STORE DATA = 1, STORE DATA = 5
				1513
				1514	which would probably cause the hardware to malfunction.
				1515
				1516
				1517	What is necessary here is to intervene with an mmiowb() before dropping the
				1518	spinlock, for example:
				1519
				1520	CPU 1 CPU 2
				1521	=============================== ===============================
				1522	spin_lock(Q)
				1523	writel(0, ADDR)
				1524	writel(1, DATA);
				1525	mmiowb();
				1526	spin_unlock(Q);
				1527	spin_lock(Q);
				1528	writel(4, ADDR);
				1529	writel(5, DATA);
				1530	mmiowb();
				1531	spin_unlock(Q);
				1532
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1533	this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
				1534	before either of the stores issued on CPU 2.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1535
				1536
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1537	Furthermore, following a store by a load from the same device obviates the need
				1538	for the mmiowb(), because the load forces the store to complete before the load
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1539	is performed:
				1540
				1541	CPU 1 CPU 2
				1542	=============================== ===============================
				1543	spin_lock(Q)
				1544	writel(0, ADDR)
				1545	a = readl(DATA);
				1546	spin_unlock(Q);
				1547	spin_lock(Q);
				1548	writel(4, ADDR);
				1549	b = readl(DATA);
				1550	spin_unlock(Q);
				1551
				1552
				1553	See Documentation/DocBook/deviceiobook.tmpl for more information.
				1554
				1555
				1556	=================================
				1557	WHERE ARE MEMORY BARRIERS NEEDED?
				1558	=================================
				1559
				1560	Under normal operation, memory operation reordering is generally not going to
				1561	be a problem as a single-threaded linear piece of code will still appear to
David Howells	50fa610	2009-04-28 15:01:38 +0100	[diff] [blame]	1562	work correctly, even if it's in an SMP kernel. There are, however, four
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1563	circumstances in which reordering definitely _could_ be a problem:
				1564
				1565	(*) Interprocessor interaction.
				1566
				1567	(*) Atomic operations.
				1568
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1569	(*) Accessing devices.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1570
				1571	(*) Interrupts.
				1572
				1573
				1574	INTERPROCESSOR INTERACTION
				1575	--------------------------
				1576
				1577	When there's a system with more than one processor, more than one CPU in the
				1578	system may be working on the same data set at the same time. This can cause
				1579	synchronisation problems, and the usual way of dealing with them is to use
				1580	locks. Locks, however, are quite expensive, and so it may be preferable to
				1581	operate without the use of a lock if at all possible. In such a case
				1582	operations that affect both CPUs may have to be carefully ordered to prevent
				1583	a malfunction.
				1584
				1585	Consider, for example, the R/W semaphore slow path. Here a waiting process is
				1586	queued on the semaphore, by virtue of it having a piece of its stack linked to
				1587	the semaphore's list of waiting processes:
				1588
				1589	struct rw_semaphore {
				1590	...
				1591	spinlock_t lock;
				1592	struct list_head waiters;
				1593	};
				1594
				1595	struct rwsem_waiter {
				1596	struct list_head list;
				1597	struct task_struct *task;
				1598	};
				1599
				1600	To wake up a particular waiter, the up_read() or up_write() functions have to:
				1601
				1602	(1) read the next pointer from this waiter's record to know as to where the
				1603	next waiter record is;
				1604
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1605	(2) read the pointer to the waiter's task structure;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1606
				1607	(3) clear the task pointer to tell the waiter it has been given the semaphore;
				1608
				1609	(4) call wake_up_process() on the task; and
				1610
				1611	(5) release the reference held on the waiter's task struct.
				1612
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1613	In other words, it has to perform this sequence of events:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1614
				1615	LOAD waiter->list.next;
				1616	LOAD waiter->task;
				1617	STORE waiter->task;
				1618	CALL wakeup
				1619	RELEASE task
				1620
				1621	and if any of these steps occur out of order, then the whole thing may
				1622	malfunction.
				1623
				1624	Once it has queued itself and dropped the semaphore lock, the waiter does not
				1625	get the lock again; it instead just waits for its task pointer to be cleared
				1626	before proceeding. Since the record is on the waiter's stack, this means that
				1627	if the task pointer is cleared _before_ the next pointer in the list is read,
				1628	another CPU might start processing the waiter and might clobber the waiter's
				1629	stack before the up*() function has a chance to read the next pointer.
				1630
				1631	Consider then what might happen to the above sequence of events:
				1632
				1633	CPU 1 CPU 2
				1634	=============================== ===============================
				1635	down_xxx()
				1636	Queue waiter
				1637	Sleep
				1638	up_yyy()
				1639	LOAD waiter->task;
				1640	STORE waiter->task;
				1641	Woken up by other event
				1642	<preempt>
				1643	Resume processing
				1644	down_xxx() returns
				1645	call foo()
				1646	foo() clobbers *waiter
				1647	</preempt>
				1648	LOAD waiter->list.next;
				1649	--- OOPS ---
				1650
				1651	This could be dealt with using the semaphore lock, but then the down_xxx()
				1652	function has to needlessly get the spinlock again after being woken up.
				1653
				1654	The way to deal with this is to insert a general SMP memory barrier:
				1655
				1656	LOAD waiter->list.next;
				1657	LOAD waiter->task;
				1658	smp_mb();
				1659	STORE waiter->task;
				1660	CALL wakeup
				1661	RELEASE task
				1662
				1663	In this case, the barrier makes a guarantee that all memory accesses before the
				1664	barrier will appear to happen before all the memory accesses after the barrier
				1665	with respect to the other CPUs on the system. It does _not_ guarantee that all
				1666	the memory accesses before the barrier will be complete by the time the barrier
				1667	instruction itself is complete.
				1668
				1669	On a UP system - where this wouldn't be a problem - the smp_mb() is just a
				1670	compiler barrier, thus making sure the compiler emits the instructions in the
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	1671	right order without actually intervening in the CPU. Since there's only one
				1672	CPU, that CPU's dependency ordering logic will take care of everything else.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1673
				1674
				1675	ATOMIC OPERATIONS
				1676	-----------------
				1677
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1678	Whilst they are technically interprocessor interaction considerations, atomic
				1679	operations are noted specially as some of them imply full memory barriers and
				1680	some don't, but they're very heavily relied on as a group throughout the
				1681	kernel.
				1682
				1683	Any atomic operation that modifies some state in memory and returns information
				1684	about the state (old or new) implies an SMP-conditional general memory barrier
Nick Piggin	2633357	2007-10-18 03:06:39 -0700	[diff] [blame]	1685	(smp_mb()) on each side of the actual operation (with the exception of
				1686	explicit lock operations, described later). These include:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1687
				1688	xchg();
				1689	cmpxchg();
Richard Braun	7e8b1e7	2012-12-13 11:07:32 +0100	[diff] [blame]	1690	atomic_xchg();
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1691	atomic_cmpxchg();
				1692	atomic_inc_return();
				1693	atomic_dec_return();
				1694	atomic_add_return();
				1695	atomic_sub_return();
				1696	atomic_inc_and_test();
				1697	atomic_dec_and_test();
				1698	atomic_sub_and_test();
				1699	atomic_add_negative();
Oleg Nesterov	02c608c	2008-02-24 00:03:29 +0300	[diff] [blame]	1700	atomic_add_unless(); /* when succeeds (returns 1) */
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1701	test_and_set_bit();
				1702	test_and_clear_bit();
				1703	test_and_change_bit();
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1704
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1705	These are used for such things as implementing LOCK-class and UNLOCK-class
				1706	operations and adjusting reference counters towards object destruction, and as
				1707	such the implicit memory barrier effects are necessary.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1708
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1709
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1710	The following operations are potential problems as they do _not_ imply memory
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1711	barriers, but might be used for implementing such things as UNLOCK-class
				1712	operations:
				1713
				1714	atomic_set();
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1715	set_bit();
				1716	clear_bit();
				1717	change_bit();
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1718
				1719	With these the appropriate explicit memory barrier should be used if necessary
				1720	(smp_mb__before_clear_bit() for instance).
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1721
				1722
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1723	The following also do _not_ imply memory barriers, and so may require explicit
				1724	memory barriers under some circumstances (smp_mb__before_atomic_dec() for
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1725	instance):
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1726
				1727	atomic_add();
				1728	atomic_sub();
				1729	atomic_inc();
				1730	atomic_dec();
				1731
				1732	If they're used for statistics generation, then they probably don't need memory
				1733	barriers, unless there's a coupling between statistical data.
				1734
				1735	If they're used for reference counting on an object to control its lifetime,
				1736	they probably don't need memory barriers because either the reference count
				1737	will be adjusted inside a locked section, or the caller will already hold
				1738	sufficient references to make the lock, and thus a memory barrier unnecessary.
				1739
				1740	If they're used for constructing a lock of some description, then they probably
				1741	do need memory barriers as a lock primitive generally has to do things in a
				1742	specific order.
				1743
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1744	Basically, each usage case has to be carefully considered as to whether memory
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1745	barriers are needed or not.
				1746
Nick Piggin	2633357	2007-10-18 03:06:39 -0700	[diff] [blame]	1747	The following operations are special locking primitives:
				1748
				1749	test_and_set_bit_lock();
				1750	clear_bit_unlock();
				1751	__clear_bit_unlock();
				1752
				1753	These implement LOCK-class and UNLOCK-class operations. These should be used in
				1754	preference to other operations when implementing locking primitives, because
				1755	their implementations can be optimised on many architectures.
				1756
David Howells	dbc8700	2006-04-10 22:54:23 -0700	[diff] [blame]	1757	[!] Note that special memory barrier primitives are available for these
				1758	situations because on some CPUs the atomic instructions used imply full memory
				1759	barriers, and so barrier instructions are superfluous in conjunction with them,
				1760	and in such cases the special barrier primitives will be no-ops.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1761
				1762	See Documentation/atomic_ops.txt for more information.
				1763
				1764
				1765	ACCESSING DEVICES
				1766	-----------------
				1767
				1768	Many devices can be memory mapped, and so appear to the CPU as if they're just
				1769	a set of memory locations. To control such a device, the driver usually has to
				1770	make the right memory accesses in exactly the right order.
				1771
				1772	However, having a clever CPU or a clever compiler creates a potential problem
				1773	in that the carefully sequenced accesses in the driver code won't reach the
				1774	device in the requisite order if the CPU or the compiler thinks it is more
				1775	efficient to reorder, combine or merge accesses - something that would cause
				1776	the device to malfunction.
				1777
				1778	Inside of the Linux kernel, I/O should be done through the appropriate accessor
				1779	routines - such as inb() or writel() - which know how to make such accesses
				1780	appropriately sequential. Whilst this, for the most part, renders the explicit
				1781	use of memory barriers unnecessary, there are a couple of situations where they
				1782	might be needed:
				1783
				1784	(1) On some systems, I/O stores are not strongly ordered across all CPUs, and
				1785	so for _all_ general drivers locks should be used and mmiowb() must be
				1786	issued prior to unlocking the critical section.
				1787
				1788	(2) If the accessor functions are used to refer to an I/O memory window with
				1789	relaxed memory access properties, then _mandatory_ memory barriers are
				1790	required to enforce ordering.
				1791
				1792	See Documentation/DocBook/deviceiobook.tmpl for more information.
				1793
				1794
				1795	INTERRUPTS
				1796	----------
				1797
				1798	A driver may be interrupted by its own interrupt service routine, and thus the
				1799	two parts of the driver may interfere with each other's attempts to control or
				1800	access the device.
				1801
				1802	This may be alleviated - at least in part - by disabling local interrupts (a
				1803	form of locking), such that the critical operations are all contained within
				1804	the interrupt-disabled section in the driver. Whilst the driver's interrupt
				1805	routine is executing, the driver's core may not run on the same CPU, and its
				1806	interrupt is not permitted to happen again until the current interrupt has been
				1807	handled, thus the interrupt handler does not need to lock against that.
				1808
				1809	However, consider a driver that was talking to an ethernet card that sports an
				1810	address register and a data register. If that driver's core talks to the card
				1811	under interrupt-disablement and then the driver's interrupt handler is invoked:
				1812
				1813	LOCAL IRQ DISABLE
				1814	writew(ADDR, 3);
				1815	writew(DATA, y);
				1816	LOCAL IRQ ENABLE
				1817	<interrupt>
				1818	writew(ADDR, 4);
				1819	q = readw(DATA);
				1820	</interrupt>
				1821
				1822	The store to the data register might happen after the second store to the
				1823	address register if ordering rules are sufficiently relaxed:
				1824
				1825	STORE ADDR = 3, STORE ADDR = 4, STORE DATA = y, q = LOAD DATA
				1826
				1827
				1828	If ordering rules are relaxed, it must be assumed that accesses done inside an
				1829	interrupt disabled section may leak outside of it and may interleave with
				1830	accesses performed in an interrupt - and vice versa - unless implicit or
				1831	explicit barriers are used.
				1832
				1833	Normally this won't be a problem because the I/O accesses done inside such
				1834	sections will include synchronous load operations on strictly ordered I/O
				1835	registers that form implicit I/O barriers. If this isn't sufficient then an
				1836	mmiowb() may need to be used explicitly.
				1837
				1838
				1839	A similar situation may occur between an interrupt routine and two routines
				1840	running on separate CPUs that communicate with each other. If such a case is
				1841	likely, then interrupt-disabling locks should be used to guarantee ordering.
				1842
				1843
				1844	==========================
				1845	KERNEL I/O BARRIER EFFECTS
				1846	==========================
				1847
				1848	When accessing I/O memory, drivers should use the appropriate accessor
				1849	functions:
				1850
				1851	(*) inX(), outX():
				1852
				1853	These are intended to talk to I/O space rather than memory space, but
				1854	that's primarily a CPU-specific concept. The i386 and x86_64 processors do
				1855	indeed have special I/O space access cycles and instructions, but many
				1856	CPUs don't have such a concept.
				1857
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1858	The PCI bus, amongst others, defines an I/O space concept which - on such
				1859	CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
David Howells	6bc3927	2006-06-25 05:49:22 -0700	[diff] [blame]	1860	space. However, it may also be mapped as a virtual I/O space in the CPU's
				1861	memory map, particularly on those CPUs that don't support alternate I/O
				1862	spaces.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1863
				1864	Accesses to this space may be fully synchronous (as on i386), but
				1865	intermediary bridges (such as the PCI host bridge) may not fully honour
				1866	that.
				1867
				1868	They are guaranteed to be fully ordered with respect to each other.
				1869
				1870	They are not guaranteed to be fully ordered with respect to other types of
				1871	memory and I/O operation.
				1872
				1873	(*) readX(), writeX():
				1874
				1875	Whether these are guaranteed to be fully ordered and uncombined with
				1876	respect to each other on the issuing CPU depends on the characteristics
				1877	defined for the memory window through which they're accessing. On later
				1878	i386 architecture machines, for example, this is controlled by way of the
				1879	MTRR registers.
				1880
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1881	Ordinarily, these will be guaranteed to be fully ordered and uncombined,
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1882	provided they're not accessing a prefetchable device.
				1883
				1884	However, intermediary hardware (such as a PCI bridge) may indulge in
				1885	deferral if it so wishes; to flush a store, a load from the same location
				1886	is preferred[*], but a load from the same device or from configuration
				1887	space should suffice for PCI.
				1888
				1889	[*] NOTE! attempting to load from the same location as was written to may
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	1890	cause a malfunction - consider the 16550 Rx/Tx serial registers for
				1891	example.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1892
				1893	Used with prefetchable I/O memory, an mmiowb() barrier may be required to
				1894	force stores to be ordered.
				1895
				1896	Please refer to the PCI specification for more information on interactions
				1897	between PCI transactions.
				1898
				1899	(*) readX_relaxed()
				1900
				1901	These are similar to readX(), but are not guaranteed to be ordered in any
				1902	way. Be aware that there is no I/O read barrier available.
				1903
				1904	(*) ioreadX(), iowriteX()
				1905
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1906	These will perform appropriately for the type of access they're actually
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1907	doing, be it inX()/outX() or readX()/writeX().
				1908
				1909
				1910	========================================
				1911	ASSUMED MINIMUM EXECUTION ORDERING MODEL
				1912	========================================
				1913
				1914	It has to be assumed that the conceptual CPU is weakly-ordered but that it will
				1915	maintain the appearance of program causality with respect to itself. Some CPUs
				1916	(such as i386 or x86_64) are more constrained than others (such as powerpc or
				1917	frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
				1918	of arch-specific code.
				1919
				1920	This means that it must be considered that the CPU will execute its instruction
				1921	stream in any order it feels like - or even in parallel - provided that if an
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	1922	instruction in the stream depends on an earlier instruction, then that
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1923	earlier instruction must be sufficiently complete[*] before the later
				1924	instruction may proceed; in other words: provided that the appearance of
				1925	causality is maintained.
				1926
				1927	[*] Some instructions have more than one effect - such as changing the
				1928	condition codes, changing registers or changing memory - and different
				1929	instructions may depend on different effects.
				1930
				1931	A CPU may also discard any instruction sequence that winds up having no
				1932	ultimate effect. For example, if two adjacent instructions both load an
				1933	immediate value into the same register, the first may be discarded.
				1934
				1935
				1936	Similarly, it has to be assumed that compiler might reorder the instruction
				1937	stream in any way it sees fit, again provided the appearance of causality is
				1938	maintained.
				1939
				1940
				1941	============================
				1942	THE EFFECTS OF THE CPU CACHE
				1943	============================
				1944
				1945	The way cached memory operations are perceived across the system is affected to
				1946	a certain extent by the caches that lie between CPUs and memory, and by the
				1947	memory coherence system that maintains the consistency of state in the system.
				1948
				1949	As far as the way a CPU interacts with another part of the system through the
				1950	caches goes, the memory system has to include the CPU's caches, and memory
				1951	barriers for the most part act at the interface between the CPU and its cache
				1952	(memory barriers logically act on the dotted line in the following diagram):
				1953
				1954	<--- CPU ---> : <----------- Memory ----------->
				1955	:
				1956	+--------+ +--------+ : +--------+ +-----------+
				1957	\| \| \| \| : \| \| \| \| +--------+
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	1958	\| CPU \| \| Memory \| : \| CPU \| \| \| \| \|
				1959	\| Core \|--->\| Access \|----->\| Cache \|<-->\| \| \| \|
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1960	\| \| \| Queue \| : \| \| \| \|--->\| Memory \|
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	1961	\| \| \| \| : \| \| \| \| \| \|
				1962	+--------+ +--------+ : +--------+ \| \| \| \|
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1963	: \| Cache \| +--------+
				1964	: \| Coherency \|
				1965	: \| Mechanism \| +--------+
				1966	+--------+ +--------+ : +--------+ \| \| \| \|
				1967	\| \| \| \| : \| \| \| \| \| \|
				1968	\| CPU \| \| Memory \| : \| CPU \| \| \|--->\| Device \|
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	1969	\| Core \|--->\| Access \|----->\| Cache \|<-->\| \| \| \|
				1970	\| \| \| Queue \| : \| \| \| \| \| \|
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	1971	\| \| \| \| : \| \| \| \| +--------+
				1972	+--------+ +--------+ : +--------+ +-----------+
				1973	:
				1974	:
				1975
				1976	Although any particular load or store may not actually appear outside of the
				1977	CPU that issued it since it may have been satisfied within the CPU's own cache,
				1978	it will still appear as if the full memory access had taken place as far as the
				1979	other CPUs are concerned since the cache coherency mechanisms will migrate the
				1980	cacheline over to the accessing CPU and propagate the effects upon conflict.
				1981
				1982	The CPU core may execute instructions in any order it deems fit, provided the
				1983	expected program causality appears to be maintained. Some of the instructions
				1984	generate load and store operations which then go into the queue of memory
				1985	accesses to be performed. The core may place these in the queue in any order
				1986	it wishes, and continue execution until it is forced to wait for an instruction
				1987	to complete.
				1988
				1989	What memory barriers are concerned with is controlling the order in which
				1990	accesses cross from the CPU side of things to the memory side of things, and
				1991	the order in which the effects are perceived to happen by the other observers
				1992	in the system.
				1993
				1994	[!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
				1995	their own loads and stores as if they had happened in program order.
				1996
				1997	[!] MMIO or other device accesses may bypass the cache system. This depends on
				1998	the properties of the memory window through which devices are accessed and/or
				1999	the use of any special device communication instructions the CPU may have.
				2000
				2001
				2002	CACHE COHERENCY
				2003	---------------
				2004
				2005	Life isn't quite as simple as it may appear above, however: for while the
				2006	caches are expected to be coherent, there's no guarantee that that coherency
				2007	will be ordered. This means that whilst changes made on one CPU will
				2008	eventually become visible on all CPUs, there's no guarantee that they will
				2009	become apparent in the same order on those other CPUs.
				2010
				2011
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2012	Consider dealing with a system that has a pair of CPUs (1 & 2), each of which
				2013	has a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2014
				2015	:
				2016	: +--------+
				2017	: +---------+ \| \|
				2018	+--------+ : +--->\| Cache A \|<------->\| \|
				2019	\| \| : \| +---------+ \| \|
				2020	\| CPU 1 \|<---+ \| \|
				2021	\| \| : \| +---------+ \| \|
				2022	+--------+ : +--->\| Cache B \|<------->\| \|
				2023	: +---------+ \| \|
				2024	: \| Memory \|
				2025	: +---------+ \| System \|
				2026	+--------+ : +--->\| Cache C \|<------->\| \|
				2027	\| \| : \| +---------+ \| \|
				2028	\| CPU 2 \|<---+ \| \|
				2029	\| \| : \| +---------+ \| \|
				2030	+--------+ : +--->\| Cache D \|<------->\| \|
				2031	: +---------+ \| \|
				2032	: +--------+
				2033	:
				2034
				2035	Imagine the system has the following properties:
				2036
				2037	(*) an odd-numbered cache line may be in cache A, cache C or it may still be
				2038	resident in memory;
				2039
				2040	(*) an even-numbered cache line may be in cache B, cache D or it may still be
				2041	resident in memory;
				2042
				2043	(*) whilst the CPU core is interrogating one cache, the other cache may be
				2044	making use of the bus to access the rest of the system - perhaps to
				2045	displace a dirty cacheline or to do a speculative load;
				2046
				2047	(*) each cache has a queue of operations that need to be applied to that cache
				2048	to maintain coherency with the rest of the system;
				2049
				2050	(*) the coherency queue is not flushed by normal loads to lines already
				2051	present in the cache, even though the contents of the queue may
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2052	potentially affect those loads.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2053
				2054	Imagine, then, that two writes are made on the first CPU, with a write barrier
				2055	between them to guarantee that they will appear to reach that CPU's caches in
				2056	the requisite order:
				2057
				2058	CPU 1 CPU 2 COMMENT
				2059	=============== =============== =======================================
				2060	u == 0, v == 1 and p == &u, q == &u
				2061	v = 2;
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2062	smp_wmb(); Make sure change to v is visible before
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2063	change to p
				2064	<A:modify v=2> v is now in cache A exclusively
				2065	p = &v;
				2066	<B:modify p=&v> p is now in cache B exclusively
				2067
				2068	The write memory barrier forces the other CPUs in the system to perceive that
				2069	the local CPU's caches have apparently been updated in the correct order. But
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2070	now imagine that the second CPU wants to read those values:
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2071
				2072	CPU 1 CPU 2 COMMENT
				2073	=============== =============== =======================================
				2074	...
				2075	q = p;
				2076	x = *q;
				2077
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2078	The above pair of reads may then fail to happen in the expected order, as the
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2079	cacheline holding p may get updated in one of the second CPU's caches whilst
				2080	the update to the cacheline holding v is delayed in the other of the second
				2081	CPU's caches by some other cache event:
				2082
				2083	CPU 1 CPU 2 COMMENT
				2084	=============== =============== =======================================
				2085	u == 0, v == 1 and p == &u, q == &u
				2086	v = 2;
				2087	smp_wmb();
				2088	<A:modify v=2> <C:busy>
				2089	<C:queue v=2>
Aneesh Kumar	79afecf	2006-05-15 09:44:36 -0700	[diff] [blame]	2090	p = &v; q = p;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2091	<D:request p>
				2092	<B:modify p=&v> <D:commit p=&v>
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	2093	<D:read p>
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2094	x = *q;
				2095	<C:read *q> Reads from v before v updated in cache
				2096	<C:unbusy>
				2097	<C:commit v=2>
				2098
				2099	Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
				2100	no guarantee that, without intervention, the order of update will be the same
				2101	as that committed on CPU 1.
				2102
				2103
				2104	To intervene, we need to interpolate a data dependency barrier or a read
				2105	barrier between the loads. This will force the cache to commit its coherency
				2106	queue before processing any further requests:
				2107
				2108	CPU 1 CPU 2 COMMENT
				2109	=============== =============== =======================================
				2110	u == 0, v == 1 and p == &u, q == &u
				2111	v = 2;
				2112	smp_wmb();
				2113	<A:modify v=2> <C:busy>
				2114	<C:queue v=2>
Paolo 'Blaisorblade' Giarrusso	3fda982	2006-10-19 23:28:19 -0700	[diff] [blame]	2115	p = &v; q = p;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2116	<D:request p>
				2117	<B:modify p=&v> <D:commit p=&v>
Ingo Molnar	e0edc78	2013-11-22 11:24:53 +0100	[diff] [blame^]	2118	<D:read p>
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2119	smp_read_barrier_depends()
				2120	<C:unbusy>
				2121	<C:commit v=2>
				2122	x = *q;
				2123	<C:read *q> Reads from v after v updated in cache
				2124
				2125
				2126	This sort of problem can be encountered on DEC Alpha processors as they have a
				2127	split cache that improves performance by making better use of the data bus.
				2128	Whilst most CPUs do imply a data dependency barrier on the read when a memory
				2129	access depends on a read, not all do, so it may not be relied on.
				2130
				2131	Other CPUs may also have split caches, but must coordinate between the various
Matt LaPlante	3f6dee9	2006-10-03 22:45:33 +0200	[diff] [blame]	2132	cachelets for normal memory accesses. The semantics of the Alpha removes the
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2133	need for coordination in the absence of memory barriers.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2134
				2135
				2136	CACHE COHERENCY VS DMA
				2137	----------------------
				2138
				2139	Not all systems maintain cache coherency with respect to devices doing DMA. In
				2140	such cases, a device attempting DMA may obtain stale data from RAM because
				2141	dirty cache lines may be resident in the caches of various CPUs, and may not
				2142	have been written back to RAM yet. To deal with this, the appropriate part of
				2143	the kernel must flush the overlapping bits of cache on each CPU (and maybe
				2144	invalidate them as well).
				2145
				2146	In addition, the data DMA'd to RAM by a device may be overwritten by dirty
				2147	cache lines being written back to RAM from a CPU's cache after the device has
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2148	installed its own data, or cache lines present in the CPU's cache may simply
				2149	obscure the fact that RAM has been updated, until at such time as the cacheline
				2150	is discarded from the CPU's cache and reloaded. To deal with this, the
				2151	appropriate part of the kernel must invalidate the overlapping bits of the
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2152	cache on each CPU.
				2153
				2154	See Documentation/cachetlb.txt for more information on cache management.
				2155
				2156
				2157	CACHE COHERENCY VS MMIO
				2158	-----------------------
				2159
				2160	Memory mapped I/O usually takes place through memory locations that are part of
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2161	a window in the CPU's memory space that has different properties assigned than
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2162	the usual RAM directed window.
				2163
				2164	Amongst these properties is usually the fact that such accesses bypass the
				2165	caching entirely and go directly to the device buses. This means MMIO accesses
				2166	may, in effect, overtake accesses to cached memory that were emitted earlier.
				2167	A memory barrier isn't sufficient in such a case, but rather the cache must be
				2168	flushed between the cached memory write and the MMIO access if the two are in
				2169	any way dependent.
				2170
				2171
				2172	=========================
				2173	THE THINGS CPUS GET UP TO
				2174	=========================
				2175
				2176	A programmer might take it for granted that the CPU will perform memory
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2177	operations in exactly the order specified, so that if the CPU is, for example,
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2178	given the following piece of code to execute:
				2179
				2180	a = *A;
				2181	*B = b;
				2182	c = *C;
				2183	d = *D;
				2184	*E = e;
				2185
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2186	they would then expect that the CPU will complete the memory operation for each
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2187	instruction before moving on to the next one, leading to a definite sequence of
				2188	operations as seen by external observers in the system:
				2189
				2190	LOAD A, STORE B, LOAD C, LOAD D, STORE *E.
				2191
				2192
				2193	Reality is, of course, much messier. With many CPUs and compilers, the above
				2194	assumption doesn't hold because:
				2195
				2196	(*) loads are more likely to need to be completed immediately to permit
				2197	execution progress, whereas stores can often be deferred without a
				2198	problem;
				2199
				2200	(*) loads may be done speculatively, and the result discarded should it prove
				2201	to have been unnecessary;
				2202
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2203	(*) loads may be done speculatively, leading to the result having been fetched
				2204	at the wrong time in the expected sequence of events;
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2205
				2206	(*) the order of the memory accesses may be rearranged to promote better use
				2207	of the CPU buses and caches;
				2208
				2209	(*) loads and stores may be combined to improve performance when talking to
				2210	memory or I/O hardware that can do batched accesses of adjacent locations,
				2211	thus cutting down on transaction setup costs (memory and PCI devices may
				2212	both be able to do this); and
				2213
				2214	(*) the CPU's data cache may affect the ordering, and whilst cache-coherency
				2215	mechanisms may alleviate this - once the store has actually hit the cache
				2216	- there's no guarantee that the coherency management will be propagated in
				2217	order to other CPUs.
				2218
				2219	So what another CPU, say, might actually observe from the above piece of code
				2220	is:
				2221
				2222	LOAD A, ..., LOAD {C,D}, STORE E, STORE *B
				2223
				2224	(Where "LOAD {C,D}" is a combined load)
				2225
				2226
				2227	However, it is guaranteed that a CPU will be self-consistent: it will see its
				2228	_own_ accesses appear to be correctly ordered, without the need for a memory
				2229	barrier. For instance with the following code:
				2230
				2231	U = *A;
				2232	*A = V;
				2233	*A = W;
				2234	X = *A;
				2235	*A = Y;
				2236	Z = *A;
				2237
				2238	and assuming no intervention by an external influence, it can be assumed that
				2239	the final result will appear to be:
				2240
				2241	U == the original value of *A
				2242	X == W
				2243	Z == Y
				2244	*A == Y
				2245
				2246	The code above may cause the CPU to generate the full sequence of memory
				2247	accesses:
				2248
				2249	U=LOAD A, STORE A=V, STORE A=W, X=LOAD A, STORE A=Y, Z=LOAD A
				2250
				2251	in that order, but, without intervention, the sequence may have almost any
				2252	combination of elements combined or discarded, provided the program's view of
				2253	the world remains consistent.
				2254
				2255	The compiler may also combine, discard or defer elements of the sequence before
				2256	the CPU even sees them.
				2257
				2258	For instance:
				2259
				2260	*A = V;
				2261	*A = W;
				2262
				2263	may be reduced to:
				2264
				2265	*A = W;
				2266
				2267	since, without a write barrier, it can be assumed that the effect of the
				2268	storage of V to *A is lost. Similarly:
				2269
				2270	*A = Y;
				2271	Z = *A;
				2272
				2273	may, without a memory barrier, be reduced to:
				2274
				2275	*A = Y;
				2276	Z = Y;
				2277
				2278	and the LOAD operation never appear outside of the CPU.
				2279
				2280
				2281	AND THEN THERE'S THE ALPHA
				2282	--------------------------
				2283
				2284	The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
				2285	some versions of the Alpha CPU have a split data cache, permitting them to have
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2286	two semantically-related cache lines updated at separate times. This is where
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2287	the data dependency barrier really becomes necessary as this synchronises both
				2288	caches with the memory coherence system, thus making it seem like pointer
				2289	changes vs new data occur in the right order.
				2290
Jarek Poplawski	81fc632	2007-05-23 13:58:20 -0700	[diff] [blame]	2291	The Alpha defines the Linux kernel's memory barrier model.
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2292
				2293	See the subsection on "Cache Coherency" above.
				2294
				2295
David Howells	90fddab	2010-03-24 09:43:00 +0000	[diff] [blame]	2296	============
				2297	EXAMPLE USES
				2298	============
				2299
				2300	CIRCULAR BUFFERS
				2301	----------------
				2302
				2303	Memory barriers can be used to implement circular buffering without the need
				2304	of a lock to serialise the producer with the consumer. See:
				2305
				2306	Documentation/circular-buffers.txt
				2307
				2308	for details.
				2309
				2310
David Howells	108b42b	2006-03-31 16:00:29 +0100	[diff] [blame]	2311	==========
				2312	REFERENCES
				2313	==========
				2314
				2315	Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
				2316	Digital Press)
				2317	Chapter 5.2: Physical Address Space Characteristics
				2318	Chapter 5.4: Caches and Write Buffers
				2319	Chapter 5.5: Data Sharing
				2320	Chapter 5.6: Read/Write Ordering
				2321
				2322	AMD64 Architecture Programmer's Manual Volume 2: System Programming
				2323	Chapter 7.1: Memory-Access Ordering
				2324	Chapter 7.4: Buffering and Combining Memory Writes
				2325
				2326	IA-32 Intel Architecture Software Developer's Manual, Volume 3:
				2327	System Programming Guide
				2328	Chapter 7.1: Locked Atomic Operations
				2329	Chapter 7.2: Memory Ordering
				2330	Chapter 7.4: Serializing Instructions
				2331
				2332	The SPARC Architecture Manual, Version 9
				2333	Chapter 8: Memory Models
				2334	Appendix D: Formal Specification of the Memory Models
				2335	Appendix J: Programming with the Memory Models
				2336
				2337	UltraSPARC Programmer Reference Manual
				2338	Chapter 5: Memory Accesses and Cacheability
				2339	Chapter 15: Sparc-V9 Memory Models
				2340
				2341	UltraSPARC III Cu User's Manual
				2342	Chapter 9: Memory Models
				2343
				2344	UltraSPARC IIIi Processor User's Manual
				2345	Chapter 8: Memory Models
				2346
				2347	UltraSPARC Architecture 2005
				2348	Chapter 9: Memory
				2349	Appendix D: Formal Specifications of the Memory Models
				2350
				2351	UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
				2352	Chapter 8: Memory Models
				2353	Appendix F: Caches and Cache Coherency
				2354
				2355	Solaris Internals, Core Kernel Architecture, p63-68:
				2356	Chapter 3.3: Hardware Considerations for Locks and
				2357	Synchronization
				2358
				2359	Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
				2360	for Kernel Programmers:
				2361	Chapter 13: Other Memory Models
				2362
				2363	Intel Itanium Architecture Software Developer's Manual: Volume 1:
				2364	Section 2.6: Speculation
				2365	Section 4.4: Memory Access