Documentation/mm/mmu_notifier.rst - linux - Git at Google

 When do you need to notify inside page table lock ?
 ===================================================

 When clearing a pte/pmd we are given a choice to notify the event through
 (notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
 the page table lock. But that notification is not necessary in all cases.

 For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
 thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
 process virtual address space). There is only 2 cases when you need to notify
 those secondary TLB while holding page table lock when clearing a pte/pmd:

   A) page backing address is free before mmu_notifier_invalidate_range_end()
   B) a page table entry is updated to point to a new page (COW, write fault
      on zero page, __replace_page(), ...)

 Case A is obvious you do not want to take the risk for the device to write to
 a page that might now be used by some completely different task.

 Case B is more subtle. For correctness it requires the following sequence to
 happen:

   - take page table lock
   - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
   - set page table entry to point to new page

 If clearing the page table entry is not followed by a notify before setting
 the new pte/pmd value then you can break memory model like C11 or C++11 for
 the device.

 Consider the following scenario (device use a feature similar to ATS/PASID):

 Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
 they are write protected for COW (other case of B apply too).

 ::

  [Time N] --------------------------------------------------------------------
  CPU-thread-0  {try to write to addrA}
  CPU-thread-1  {try to write to addrB}
  CPU-thread-2  {}
  CPU-thread-3  {}
  DEV-thread-0  {read addrA and populate device TLB}
  DEV-thread-2  {read addrB and populate device TLB}
  [Time N+1] ------------------------------------------------------------------
  CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
  CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
  CPU-thread-2  {}
  CPU-thread-3  {}
  DEV-thread-0  {}
  DEV-thread-2  {}
  [Time N+2] ------------------------------------------------------------------
  CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
  CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
  CPU-thread-2  {}
  CPU-thread-3  {}
  DEV-thread-0  {}
  DEV-thread-2  {}
  [Time N+3] ------------------------------------------------------------------
  CPU-thread-0  {preempted}
  CPU-thread-1  {preempted}
  CPU-thread-2  {write to addrA which is a write to new page}
  CPU-thread-3  {}
  DEV-thread-0  {}
  DEV-thread-2  {}
  [Time N+3] ------------------------------------------------------------------
  CPU-thread-0  {preempted}
  CPU-thread-1  {preempted}
  CPU-thread-2  {}
  CPU-thread-3  {write to addrB which is a write to new page}
  DEV-thread-0  {}
  DEV-thread-2  {}
  [Time N+4] ------------------------------------------------------------------
  CPU-thread-0  {preempted}
  CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
  CPU-thread-2  {}
  CPU-thread-3  {}
  DEV-thread-0  {}
  DEV-thread-2  {}
  [Time N+5] ------------------------------------------------------------------
  CPU-thread-0  {preempted}
  CPU-thread-1  {}
  CPU-thread-2  {}
  CPU-thread-3  {}
  DEV-thread-0  {read addrA from old page}
  DEV-thread-2  {read addrB from new page}

 So here because at time N+2 the clear page table entry was not pair with a
 notification to invalidate the secondary TLB, the device see the new value for
 addrB before seeing the new value for addrA. This break total memory ordering
 for the device.

 When changing a pte to write protect or to point to a new write protected page
 with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
 call to mmu_notifier_invalidate_range_end() outside the page table lock. This
 is true even if the thread doing the page table update is preempted right after
 releasing page table lock but before call mmu_notifier_invalidate_range_end().
	When do you need to notify inside page table lock ?
	===================================================

	When clearing a pte/pmd we are given a choice to notify the event through
	(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
	the page table lock. But that notification is not necessary in all cases.

	For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
	thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
	process virtual address space). There is only 2 cases when you need to notify
	those secondary TLB while holding page table lock when clearing a pte/pmd:

	A) page backing address is free before mmu_notifier_invalidate_range_end()
	B) a page table entry is updated to point to a new page (COW, write fault
	on zero page, __replace_page(), ...)

	Case A is obvious you do not want to take the risk for the device to write to
	a page that might now be used by some completely different task.

	Case B is more subtle. For correctness it requires the following sequence to
	happen:

	- take page table lock
	- clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
	- set page table entry to point to new page

	If clearing the page table entry is not followed by a notify before setting
	the new pte/pmd value then you can break memory model like C11 or C++11 for
	the device.

	Consider the following scenario (device use a feature similar to ATS/PASID):

	Two address addrA and addrB such that \\|addrA - addrB\\| >= PAGE_SIZE we assume
	they are write protected for COW (other case of B apply too).

	::

	[Time N] --------------------------------------------------------------------
	CPU-thread-0 {try to write to addrA}
	CPU-thread-1 {try to write to addrB}
	CPU-thread-2 {}
	CPU-thread-3 {}
	DEV-thread-0 {read addrA and populate device TLB}
	DEV-thread-2 {read addrB and populate device TLB}
	[Time N+1] ------------------------------------------------------------------
	CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
	CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
	CPU-thread-2 {}
	CPU-thread-3 {}
	DEV-thread-0 {}
	DEV-thread-2 {}
	[Time N+2] ------------------------------------------------------------------
	CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}}
	CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}}
	CPU-thread-2 {}
	CPU-thread-3 {}
	DEV-thread-0 {}
	DEV-thread-2 {}
	[Time N+3] ------------------------------------------------------------------
	CPU-thread-0 {preempted}
	CPU-thread-1 {preempted}
	CPU-thread-2 {write to addrA which is a write to new page}
	CPU-thread-3 {}
	DEV-thread-0 {}
	DEV-thread-2 {}
	[Time N+3] ------------------------------------------------------------------
	CPU-thread-0 {preempted}
	CPU-thread-1 {preempted}
	CPU-thread-2 {}
	CPU-thread-3 {write to addrB which is a write to new page}
	DEV-thread-0 {}
	DEV-thread-2 {}
	[Time N+4] ------------------------------------------------------------------
	CPU-thread-0 {preempted}
	CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
	CPU-thread-2 {}
	CPU-thread-3 {}
	DEV-thread-0 {}
	DEV-thread-2 {}
	[Time N+5] ------------------------------------------------------------------
	CPU-thread-0 {preempted}
	CPU-thread-1 {}
	CPU-thread-2 {}
	CPU-thread-3 {}
	DEV-thread-0 {read addrA from old page}
	DEV-thread-2 {read addrB from new page}

	So here because at time N+2 the clear page table entry was not pair with a
	notification to invalidate the secondary TLB, the device see the new value for
	addrB before seeing the new value for addrA. This break total memory ordering
	for the device.

	When changing a pte to write protect or to point to a new write protected page
	with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
	call to mmu_notifier_invalidate_range_end() outside the page table lock. This
	is true even if the thread doing the page table update is preempted right after
	releasing page table lock but before call mmu_notifier_invalidate_range_end().