~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/virt/kvm/locking.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/virt/kvm/locking.rst (Version linux-6.12-rc7) and /Documentation/virt/kvm/locking.rst (Version linux-5.4.285)


  1 .. SPDX-License-Identifier: GPL-2.0               
  2                                                   
  3 =================                                 
  4 KVM Lock Overview                                 
  5 =================                                 
  6                                                   
  7 1. Acquisition Orders                             
  8 ---------------------                             
  9                                                   
 10 The acquisition orders for mutexes are as foll    
 11                                                   
 12 - cpus_read_lock() is taken outside kvm_lock      
 13                                                   
 14 - kvm_usage_lock is taken outside cpus_read_lo    
 15                                                   
 16 - kvm->lock is taken outside vcpu->mutex          
 17                                                   
 18 - kvm->lock is taken outside kvm->slots_lock a    
 19                                                   
 20 - kvm->slots_lock is taken outside kvm->irq_lo    
 21   them together is quite rare.                    
 22                                                   
 23 - kvm->mn_active_invalidate_count ensures that    
 24   invalidate_range_start() and invalidate_rang    
 25   use the same memslots array.  kvm->slots_loc    
 26   are taken on the waiting side when modifying    
 27   must not take either kvm->slots_lock or kvm-    
 28                                                   
 29 cpus_read_lock() vs kvm_lock:                     
 30                                                   
 31 - Taking cpus_read_lock() outside of kvm_lock     
 32   being the official ordering, as it is quite     
 33   cpus_read_lock() while holding kvm_lock.  Us    
 34   e.g. avoid complex operations when possible.    
 35                                                   
 36 For SRCU:                                         
 37                                                   
 38 - ``synchronize_srcu(&kvm->srcu)`` is called i    
 39   for kvm->lock, vcpu->mutex and kvm->slots_lo    
 40   be taken inside a kvm->srcu read-side critic    
 41   following is broken::                           
 42                                                   
 43       srcu_read_lock(&kvm->srcu);                 
 44       mutex_lock(&kvm->slots_lock);               
 45                                                   
 46 - kvm->slots_arch_lock instead is released bef    
 47   ``synchronize_srcu()``.  It _can_ therefore     
 48   kvm->srcu read-side critical section, for ex    
 49   a vmexit.                                       
 50                                                   
 51 On x86:                                           
 52                                                   
 53 - vcpu->mutex is taken outside kvm->arch.hyper    
 54                                                   
 55 - kvm->arch.mmu_lock is an rwlock; critical se    
 56   kvm->arch.tdp_mmu_pages_lock and kvm->arch.m    
 57   also take kvm->arch.mmu_lock                    
 58                                                   
 59 Everything else is a leaf: no other lock is ta    
 60 sections.                                         
 61                                                   
 62 2. Exception                                      
 63 ------------                                      
 64                                                   
 65 Fast page fault:                                  
 66                                                   
 67 Fast page fault is the fast path which fixes t    
 68 the mmu-lock on x86. Currently, the page fault    
 69 following two cases:                              
 70                                                   
 71 1. Access Tracking: The SPTE is not present, b    
 72    tracking. That means we need to restore the    
 73    described in more detail later below.          
 74                                                   
 75 2. Write-Protection: The SPTE is present and t    
 76    write-protect. That means we just need to c    
 77                                                   
 78 What we use to avoid all the races is the Host    
 79 on the spte:                                      
 80                                                   
 81 - Host-writable means the gfn is writable in t    
 82   its KVM memslot.                                
 83 - MMU-writable means the gfn is writable in th    
 84   write-protected by shadow page write-protect    
 85                                                   
 86 On fast page fault path, we will use cmpxchg t    
 87 bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_    
 88 R/X bits if for an access-traced spte, or both    
 89 changing these bits can be detected by cmpxchg    
 90                                                   
 91 But we need carefully check these cases:          
 92                                                   
 93 1) The mapping from gfn to pfn                    
 94                                                   
 95 The mapping from gfn to pfn may be changed sin    
 96 is not changed during cmpxchg. This is a ABA p    
 97 will happen:                                      
 98                                                   
 99 +---------------------------------------------    
100 | At the beginning::                              
101 |                                                 
102 |       gpte = gfn1                               
103 |       gfn1 is mapped to pfn1 on host            
104 |       spte is the shadow page table entry co    
105 |       spte = pfn1                               
106 +---------------------------------------------    
107 | On fast page fault path:                        
108 +------------------------------------+--------    
109 | CPU 0:                             | CPU 1:     
110 +------------------------------------+--------    
111 | ::                                 |            
112 |                                    |            
113 |   old_spte = *spte;                |            
114 +------------------------------------+--------    
115 |                                    | pfn1 is    
116 |                                    |            
117 |                                    |    spte    
118 |                                    |            
119 |                                    | pfn1 is    
120 |                                    |            
121 |                                    | gpte is    
122 |                                    | gfn2 by    
123 |                                    |            
124 |                                    |    spte    
125 +------------------------------------+--------    
126 | ::                                              
127 |                                                 
128 |   if (cmpxchg(spte, old_spte, old_spte+W)       
129 |       mark_page_dirty(vcpu->kvm, gfn1)          
130 |            OOPS!!!                              
131 +---------------------------------------------    
132                                                   
133 We dirty-log for gfn1, that means gfn2 is lost    
134                                                   
135 For direct sp, we can easily avoid it since th    
136 to gfn.  For indirect sp, we disabled fast pag    
137                                                   
138 A solution for indirect sp could be to pin the    
139 gfn_to_pfn_memslot_atomic, before the cmpxchg.    
140                                                   
141 - We have held the refcount of pfn; that means    
142   be reused for another gfn.                      
143 - The pfn is writable and therefore it cannot     
144   by KSM.                                         
145                                                   
146 Then, we can ensure the dirty bitmaps is corre    
147                                                   
148 2) Dirty bit tracking                             
149                                                   
150 In the origin code, the spte can be fast updat    
151 spte is read-only and the Accessed bit has alr    
152 Accessed bit and Dirty bit can not be lost.       
153                                                   
154 But it is not true after fast page fault since    
155 writable between reading spte and updating spt    
156                                                   
157 +---------------------------------------------    
158 | At the beginning::                              
159 |                                                 
160 |       spte.W = 0                                
161 |       spte.Accessed = 1                         
162 +------------------------------------+--------    
163 | CPU 0:                             | CPU 1:     
164 +------------------------------------+--------    
165 | In mmu_spte_clear_track_bits()::   |            
166 |                                    |            
167 |  old_spte = *spte;                 |            
168 |                                    |            
169 |                                    |            
170 |  /* 'if' condition is satisfied. */|            
171 |  if (old_spte.Accessed == 1 &&     |            
172 |       old_spte.W == 0)             |            
173 |     spte = 0ull;                   |            
174 +------------------------------------+--------    
175 |                                    | on fast    
176 |                                    |            
177 |                                    |    spte    
178 |                                    |            
179 |                                    | memory     
180 |                                    |            
181 |                                    |    spte    
182 +------------------------------------+--------    
183 |  ::                                |            
184 |                                    |            
185 |   else                             |            
186 |     old_spte = xchg(spte, 0ull)    |            
187 |   if (old_spte.Accessed == 1)      |            
188 |     kvm_set_pfn_accessed(spte.pfn);|            
189 |   if (old_spte.Dirty == 1)         |            
190 |     kvm_set_pfn_dirty(spte.pfn);   |            
191 |     OOPS!!!                        |            
192 +------------------------------------+--------    
193                                                   
194 The Dirty bit is lost in this case.               
195                                                   
196 In order to avoid this kind of issue, we alway    
197 if it can be updated out of mmu-lock [see spte    
198 the spte is always atomically updated in this     
199                                                   
200 3) flush tlbs due to spte updated                 
201                                                   
202 If the spte is updated from writable to read-o    
203 otherwise rmap_write_protect will find a read-    
204 writable spte might be cached on a CPU's TLB.     
205                                                   
206 As mentioned before, the spte can be updated t    
207 fast page fault path. In order to easily audit    
208 to be flushed caused this reason in mmu_spte_u    
209 function to update spte (present -> present).     
210                                                   
211 Since the spte is "volatile" if it can be upda    
212 atomically update the spte and the race caused    
213 See the comments in spte_has_volatile_bits() a    
214                                                   
215 Lockless Access Tracking:                         
216                                                   
217 This is used for Intel CPUs that are using EPT    
218 bits. In this case, PTEs are tagged as A/D dis    
219 when the KVM MMU notifier is called to track a    
220 kvm_mmu_notifier_clear_flush_young), it marks     
221 by clearing the RWX bits in the PTE and storin    
222 unused/ignored bits. When the VM tries to acce    
223 generated and the fast page fault mechanism de    
224 atomically restore the PTE to a Present state.    
225 PTE is marked for access tracking and during r    
226 the W bit is set depending on whether or not i    
227 wasn't, then the W bit will remain clear until    
228 time it will be set using the Dirty tracking m    
229                                                   
230 3. Reference                                      
231 ------------                                      
232                                                   
233 ``kvm_lock``                                      
234 ^^^^^^^^^^^^                                      
235                                                   
236 :Type:          mutex                             
237 :Arch:          any                               
238 :Protects:      - vm_list                         
239                                                   
240 ``kvm_usage_lock``                                
241 ^^^^^^^^^^^^^^^^^^                                
242                                                   
243 :Type:          mutex                             
244 :Arch:          any                               
245 :Protects:      - kvm_usage_count                 
246                 - hardware virtualization enab    
247 :Comment:       Exists to allow taking cpus_re    
248                 protected, which simplifies th    
249                                                   
250 ``kvm->mn_invalidate_lock``                       
251 ^^^^^^^^^^^^^^^^^^^^^^^^^^^                       
252                                                   
253 :Type:          spinlock_t                        
254 :Arch:          any                               
255 :Protects:      mn_active_invalidate_count, mn    
256                                                   
257 ``kvm_arch::tsc_write_lock``                      
258 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                      
259                                                   
260 :Type:          raw_spinlock_t                    
261 :Arch:          x86                               
262 :Protects:      - kvm_arch::{last_tsc_write,la    
263                 - tsc offset in vmcb              
264 :Comment:       'raw' because updating the tsc    
265                                                   
266 ``kvm->mmu_lock``                                 
267 ^^^^^^^^^^^^^^^^^                                 
268 :Type:          spinlock_t or rwlock_t            
269 :Arch:          any                               
270 :Protects:      -shadow page/shadow tlb entry     
271 :Comment:       it is a spinlock since it is u    
272                                                   
273 ``kvm->srcu``                                     
274 ^^^^^^^^^^^^^                                     
275 :Type:          srcu lock                         
276 :Arch:          any                               
277 :Protects:      - kvm->memslots                   
278                 - kvm->buses                      
279 :Comment:       The srcu read lock must be hel    
280                 when using gfn_to_* functions)    
281                 MMIO/PIO address->device struc    
282                 The srcu index can be stored i    
283                 if it is needed by multiple fu    
284                                                   
285 ``kvm->slots_arch_lock``                          
286 ^^^^^^^^^^^^^^^^^^^^^^^^                          
287 :Type:          mutex                             
288 :Arch:          any (only needed on x86 though    
289 :Protects:      any arch-specific fields of me    
290                 in a ``kvm->srcu`` read-side c    
291 :Comment:       must be held before reading th    
292                 until after all changes to the    
293                                                   
294 ``wakeup_vcpus_on_cpu_lock``                      
295 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                      
296 :Type:          spinlock_t                        
297 :Arch:          x86                               
298 :Protects:      wakeup_vcpus_on_cpu               
299 :Comment:       This is a per-CPU lock and it     
300                 When VT-d posted-interrupts ar    
301                 devices, we put the blocked vC    
302                 protected by blocked_vcpu_on_c    
303                 wakeup notification event sinc    
304                 assigned devices happens, we w    
305                 wakeup.                           
306                                                   
307 ``vendor_module_lock``                            
308 ^^^^^^^^^^^^^^^^^^^^^^                            
309 :Type:          mutex                             
310 :Arch:          x86                               
311 :Protects:      loading a vendor module (kvm_a    
312 :Comment:       Exists because using kvm_lock     
313     in notifiers, e.g. __kvmclock_cpufreq_noti    
314     cpu_hotplug_lock is held, e.g. from cpufre    
315     operations need to take cpu_hotplug_lock w    
316     updating static calls.                        
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php