~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/trace/ring-buffer-design.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/trace/ring-buffer-design.rst (Version linux-6.11.5) and /Documentation/trace/ring-buffer-design.rst (Version unix-v6-master)


  1 .. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.    
  2                                                   
  3 ===========================                       
  4 Lockless Ring Buffer Design                       
  5 ===========================                       
  6                                                   
  7 Copyright 2009 Red Hat Inc.                       
  8                                                   
  9 :Author:   Steven Rostedt <srostedt@redhat.com>    
 10 :License:  The GNU Free Documentation License,    
 11            (dual licensed under the GPL v2)       
 12 :Reviewers:  Mathieu Desnoyers, Huang Ying, Hi    
 13              and Frederic Weisbecker.             
 14                                                   
 15                                                   
 16 Written for: 2.6.31                               
 17                                                   
 18 Terminology used in this Document                 
 19 ---------------------------------                 
 20                                                   
 21 tail                                              
 22         - where new writes happen in the ring     
 23                                                   
 24 head                                              
 25         - where new reads happen in the ring b    
 26                                                   
 27 producer                                          
 28         - the task that writes into the ring b    
 29                                                   
 30 writer                                            
 31         - same as producer                        
 32                                                   
 33 consumer                                          
 34         - the task that reads from the buffer     
 35                                                   
 36 reader                                            
 37         - same as consumer.                       
 38                                                   
 39 reader_page                                       
 40         - A page outside the ring buffer used     
 41           by the reader.                          
 42                                                   
 43 head_page                                         
 44         - a pointer to the page that the reade    
 45                                                   
 46 tail_page                                         
 47         - a pointer to the page that will be w    
 48                                                   
 49 commit_page                                       
 50         - a pointer to the page with the last     
 51                                                   
 52 cmpxchg                                           
 53         - hardware-assisted atomic transaction    
 54                                                   
 55             A = B if previous A == C              
 56                                                   
 57             R = cmpxchg(A, C, B) is saying tha    
 58                 if current A is equal to C, an    
 59                 A into R                          
 60                                                   
 61             R gets the previous A regardless i    
 62                                                   
 63           To see if the update was successful     
 64           may be used.                            
 65                                                   
 66 The Generic Ring Buffer                           
 67 -----------------------                           
 68                                                   
 69 The ring buffer can be used in either an overw    
 70 producer/consumer mode.                           
 71                                                   
 72 Producer/consumer mode is where if the produce    
 73 buffer before the consumer could free up anyth    
 74 will stop writing to the buffer. This will los    
 75                                                   
 76 Overwrite mode is where if the producer were t    
 77 before the consumer could free up anything, th    
 78 overwrite the older data. This will lose the o    
 79                                                   
 80 No two writers can write at the same time (on     
 81 but a writer may interrupt another writer, but    
 82 before the previous writer may continue. This     
 83 algorithm. The writers act like a "stack". The    
 84 enforces this behavior::                          
 85                                                   
 86                                                   
 87   writer1 start                                   
 88      <preempted> writer2 start                    
 89          <preempted> writer3 start                
 90                      writer3 finishes             
 91                  writer2 finishes                 
 92   writer1 finishes                                
 93                                                   
 94 This is very much like a writer being preempte    
 95 the interrupt doing a write as well.              
 96                                                   
 97 Readers can happen at any time. But no two rea    
 98 same time, nor can a reader preempt/interrupt     
 99 cannot preempt/interrupt a writer, but it may     
100 buffer at the same time as a writer is writing    
101 on another processor to do so. A reader may re    
102 and can be preempted by a writer.                 
103                                                   
104 A writer can preempt a reader, but a reader ca    
105 But a reader can read the buffer at the same t    
106 as a writer.                                      
107                                                   
108 The ring buffer is made up of a list of pages     
109                                                   
110 At initialization a reader page is allocated f    
111 part of the ring buffer.                          
112                                                   
113 The head_page, tail_page and commit_page are a    
114 to the same page.                                 
115                                                   
116 The reader page is initialized to have its nex    
117 the head page, and its previous pointer pointi    
118 the head page.                                    
119                                                   
120 The reader has its own page to use. At start u    
121 allocated but is not attached to the list. Whe    
122 to read from the buffer, if its page is empty     
123 it will swap its page with the head_page. The     
124 become part of the ring buffer and the head_pa    
125 The page after the inserted page (old reader_p    
126 new head page.                                    
127                                                   
128 Once the new page is given to the reader, the     
129 it wants with it, as long as a writer has left    
130                                                   
131 A sample of how the reader page is swapped: No    
132 show the head page in the buffer, it is for de    
133 only.                                             
134                                                   
135 ::                                                
136                                                   
137   +------+                                        
138   |reader|          RING BUFFER                   
139   |page  |                                        
140   +------+                                        
141                   +---+   +---+   +---+           
142                   |   |-->|   |-->|   |           
143                   |   |<--|   |<--|   |           
144                   +---+   +---+   +---+           
145                    ^ |             ^ |            
146                    | +-------------+ |            
147                    +-----------------+            
148                                                   
149                                                   
150   +------+                                        
151   |reader|          RING BUFFER                   
152   |page  |-------------------+                    
153   +------+                   v                    
154     |             +---+   +---+   +---+           
155     |             |   |-->|   |-->|   |           
156     |             |   |<--|   |<--|   |<-+        
157     |             +---+   +---+   +---+  |        
158     |              ^ |             ^ |   |        
159     |              | +-------------+ |   |        
160     |              +-----------------+   |        
161     +------------------------------------+        
162                                                   
163   +------+                                        
164   |reader|          RING BUFFER                   
165   |page  |-------------------+                    
166   +------+ <---------------+ v                    
167     |  ^          +---+   +---+   +---+           
168     |  |          |   |-->|   |-->|   |           
169     |  |          |   |   |   |<--|   |<-+        
170     |  |          +---+   +---+   +---+  |        
171     |  |             |             ^ |   |        
172     |  |             +-------------+ |   |        
173     |  +-----------------------------+   |        
174     +------------------------------------+        
175                                                   
176   +------+                                        
177   |buffer|          RING BUFFER                   
178   |page  |-------------------+                    
179   +------+ <---------------+ v                    
180     |  ^          +---+   +---+   +---+           
181     |  |          |   |   |   |-->|   |           
182     |  |  New     |   |   |   |<--|   |<-+        
183     |  | Reader   +---+   +---+   +---+  |        
184     |  |  page ----^                 |   |        
185     |  |                             |   |        
186     |  +-----------------------------+   |        
187     +------------------------------------+        
188                                                   
189                                                   
190                                                   
191 It is possible that the page swapped is the co    
192 if what is in the ring buffer is less than wha    
193                                                   
194 ::                                                
195                                                   
196             reader page    commit page   tail     
197                 |              |             |    
198                 v              |             |    
199                +---+           |             |    
200                |   |<----------+             |    
201                |   |<------------------------+    
202                |   |------+                       
203                +---+      |                       
204                           |                       
205                           v                       
206       +---+    +---+    +---+    +---+            
207   <---|   |--->|   |--->|   |--->|   |--->        
208   --->|   |<---|   |<---|   |<---|   |<---        
209       +---+    +---+    +---+    +---+            
210                                                   
211 This case is still valid for this algorithm.      
212 When the writer leaves the page, it simply goe    
213 since the reader page still points to the next    
214 buffer.                                           
215                                                   
216                                                   
217 The main pointers:                                
218                                                   
219   reader page                                     
220             - The page used solely by the read    
221               of the ring buffer (may be swapp    
222                                                   
223   head page                                       
224             - the next page in the ring buffer    
225               with the reader page.               
226                                                   
227   tail page                                       
228             - the page where the next write wi    
229                                                   
230   commit page                                     
231             - the page that last finished a wr    
232                                                   
233 The commit page only is updated by the outermo    
234 writer stack. A writer that preempts another w    
235 commit page.                                      
236                                                   
237 When data is written into the ring buffer, a p    
238 in the ring buffer and passed back to the writ    
239 is finished writing data into that position, i    
240                                                   
241 Another write (or a read) may take place at an    
242 transaction. If another write happens it must     
243 with the previous write.                          
244                                                   
245                                                   
246    Write reserve::                                
247                                                   
248        Buffer page                                
249       +---------+                                 
250       |written  |                                 
251       +---------+  <--- given back to writer (    
252       |reserved |                                 
253       +---------+ <--- tail pointer               
254       | empty   |                                 
255       +---------+                                 
256                                                   
257    Write commit::                                 
258                                                   
259        Buffer page                                
260       +---------+                                 
261       |written  |                                 
262       +---------+                                 
263       |written  |                                 
264       +---------+  <--- next position for writ    
265       | empty   |                                 
266       +---------+                                 
267                                                   
268                                                   
269  If a write happens after the first reserve::     
270                                                   
271        Buffer page                                
272       +---------+                                 
273       |written  |                                 
274       +---------+  <-- current commit             
275       |reserved |                                 
276       +---------+  <--- given back to second w    
277       |reserved |                                 
278       +---------+ <--- tail pointer               
279                                                   
280   After second writer commits::                   
281                                                   
282                                                   
283        Buffer page                                
284       +---------+                                 
285       |written  |                                 
286       +---------+  <--(last full commit)          
287       |reserved |                                 
288       +---------+                                 
289       |pending  |                                 
290       |commit   |                                 
291       +---------+ <--- tail pointer               
292                                                   
293   When the first writer commits::                 
294                                                   
295        Buffer page                                
296       +---------+                                 
297       |written  |                                 
298       +---------+                                 
299       |written  |                                 
300       +---------+                                 
301       |written  |                                 
302       +---------+  <--(last full commit and ta    
303                                                   
304                                                   
305 The commit pointer points to the last write lo    
306 committed without preempting another write. Wh    
307 preempted another write is committed, it only     
308 and will not be a full commit until all writes    
309                                                   
310 The commit page points to the page that has th    
311 The tail page points to the page with the last    
312 committing).                                      
313                                                   
314 The tail page is always equal to or after the     
315 be several pages ahead. If the tail page catch    
316 page then no more writes may take place (regar    
317 of the ring buffer: overwrite and produce/cons    
318                                                   
319 The order of pages is::                           
320                                                   
321  head page                                        
322  commit page                                      
323  tail page                                        
324                                                   
325 Possible scenario::                               
326                                                   
327                                tail page          
328     head page         commit page  |              
329         |                 |        |              
330         v                 v        v              
331       +---+    +---+    +---+    +---+            
332   <---|   |--->|   |--->|   |--->|   |--->        
333   --->|   |<---|   |<---|   |<---|   |<---        
334       +---+    +---+    +---+    +---+            
335                                                   
336 There is a special case that the head page is     
337 and possibly the tail page. That is when the c    
338 swapped with the reader page. This is because     
339 part of the ring buffer, but the reader page i    
340 has been less than a full page that has been c    
341 and a reader swaps out a page, it will be swap    
342                                                   
343 ::                                                
344                                                   
345             reader page    commit page   tail     
346                 |              |             |    
347                 v              |             |    
348                +---+           |             |    
349                |   |<----------+             |    
350                |   |<------------------------+    
351                |   |------+                       
352                +---+      |                       
353                           |                       
354                           v                       
355       +---+    +---+    +---+    +---+            
356   <---|   |--->|   |--->|   |--->|   |--->        
357   --->|   |<---|   |<---|   |<---|   |<---        
358       +---+    +---+    +---+    +---+            
359                           ^                       
360                           |                       
361                       head page                   
362                                                   
363                                                   
364 In this case, the head page will not move when    
365 move back into the ring buffer.                   
366                                                   
367 The reader cannot swap a page into the ring bu    
368 is still on that page. If the read meets the l    
369 not pending or reserved), then there is nothin    
370 The buffer is considered empty until another f    
371                                                   
372 When the tail meets the head page, if the buff    
373 the head page will be pushed ahead one. If the    
374 mode, the write will fail.                        
375                                                   
376 Overwrite mode::                                  
377                                                   
378               tail page                           
379                  |                                
380                  v                                
381       +---+    +---+    +---+    +---+            
382   <---|   |--->|   |--->|   |--->|   |--->        
383   --->|   |<---|   |<---|   |<---|   |<---        
384       +---+    +---+    +---+    +---+            
385                           ^                       
386                           |                       
387                       head page                   
388                                                   
389                                                   
390               tail page                           
391                  |                                
392                  v                                
393       +---+    +---+    +---+    +---+            
394   <---|   |--->|   |--->|   |--->|   |--->        
395   --->|   |<---|   |<---|   |<---|   |<---        
396       +---+    +---+    +---+    +---+            
397                                    ^              
398                                    |              
399                                head page          
400                                                   
401                                                   
402                       tail page                   
403                           |                       
404                           v                       
405       +---+    +---+    +---+    +---+            
406   <---|   |--->|   |--->|   |--->|   |--->        
407   --->|   |<---|   |<---|   |<---|   |<---        
408       +---+    +---+    +---+    +---+            
409                                    ^              
410                                    |              
411                                head page          
412                                                   
413 Note, the reader page will still point to the     
414 But when a swap takes place, it will use the m    
415                                                   
416                                                   
417 Making the Ring Buffer Lockless:                  
418 --------------------------------                  
419                                                   
420 The main idea behind the lockless algorithm is    
421 of the head_page pointer with the swapping of     
422 State flags are placed inside the pointer to t    
423 each page must be aligned in memory by 4 bytes    
424 least significant bits of the address to be us    
425 they will always be zero for the address. To g    
426 simply mask out the flags::                       
427                                                   
428   MASK = ~3                                       
429                                                   
430   address & MASK                                  
431                                                   
432 Two flags will be kept by these two bits:         
433                                                   
434    HEADER                                         
435         - the page being pointed to is a head     
436                                                   
437    UPDATE                                         
438         - the page being pointed to is being u    
439           and was or is about to be a head pag    
440                                                   
441 ::                                                
442                                                   
443               reader page                         
444                   |                               
445                   v                               
446                 +---+                             
447                 |   |------+                      
448                 +---+      |                      
449                             |                     
450                             v                     
451         +---+    +---+    +---+    +---+          
452     <---|   |--->|   |-H->|   |--->|   |--->      
453     --->|   |<---|   |<---|   |<---|   |<---      
454         +---+    +---+    +---+    +---+          
455                                                   
456                                                   
457 The above pointer "-H->" would have the HEADER    
458 the next page is the next page to be swapped o    
459 This pointer means the next page is the head p    
460                                                   
461 When the tail page meets the head pointer, it     
462 change the pointer to the UPDATE state::          
463                                                   
464                                                   
465               tail page                           
466                  |                                
467                  v                                
468       +---+    +---+    +---+    +---+            
469   <---|   |--->|   |-H->|   |--->|   |--->        
470   --->|   |<---|   |<---|   |<---|   |<---        
471       +---+    +---+    +---+    +---+            
472                                                   
473               tail page                           
474                  |                                
475                  v                                
476       +---+    +---+    +---+    +---+            
477   <---|   |--->|   |-U->|   |--->|   |--->        
478   --->|   |<---|   |<---|   |<---|   |<---        
479       +---+    +---+    +---+    +---+            
480                                                   
481 "-U->" represents a pointer in the UPDATE stat    
482                                                   
483 Any access to the reader will need to take som    
484 the readers. But the writers will never take a    
485 ring buffer. This means we only need to worry     
486 and writes only preempt in "stack" formation.     
487                                                   
488 When the reader tries to swap the page with th    
489 will also use cmpxchg. If the flag bit in the     
490 head page does not have the HEADER flag set, t    
491 and the reader will need to look for the new h    
492 Note, the flags UPDATE and HEADER are never se    
493                                                   
494 The reader swaps the reader page as follows::     
495                                                   
496   +------+                                        
497   |reader|          RING BUFFER                   
498   |page  |                                        
499   +------+                                        
500                   +---+    +---+    +---+         
501                   |   |--->|   |--->|   |         
502                   |   |<---|   |<---|   |         
503                   +---+    +---+    +---+         
504                    ^ |               ^ |          
505                    | +---------------+ |          
506                    +-----H-------------+          
507                                                   
508 The reader sets the reader page next pointer a    
509 the head page::                                   
510                                                   
511                                                   
512   +------+                                        
513   |reader|          RING BUFFER                   
514   |page  |-------H-----------+                    
515   +------+                   v                    
516     |             +---+    +---+    +---+         
517     |             |   |--->|   |--->|   |         
518     |             |   |<---|   |<---|   |<-+      
519     |             +---+    +---+    +---+  |      
520     |              ^ |               ^ |   |      
521     |              | +---------------+ |   |      
522     |              +-----H-------------+   |      
523     +--------------------------------------+      
524                                                   
525 It does a cmpxchg with the pointer to the prev    
526 point to the reader page. Note that the new po    
527 flag set.  This action atomically moves the he    
528                                                   
529   +------+                                        
530   |reader|          RING BUFFER                   
531   |page  |-------H-----------+                    
532   +------+                   v                    
533     |  ^          +---+   +---+   +---+           
534     |  |          |   |-->|   |-->|   |           
535     |  |          |   |<--|   |<--|   |<-+        
536     |  |          +---+   +---+   +---+  |        
537     |  |             |             ^ |   |        
538     |  |             +-------------+ |   |        
539     |  +-----------------------------+   |        
540     +------------------------------------+        
541                                                   
542 After the new head page is set, the previous p    
543 updated to the reader page::                      
544                                                   
545   +------+                                        
546   |reader|          RING BUFFER                   
547   |page  |-------H-----------+                    
548   +------+ <---------------+ v                    
549     |  ^          +---+   +---+   +---+           
550     |  |          |   |-->|   |-->|   |           
551     |  |          |   |   |   |<--|   |<-+        
552     |  |          +---+   +---+   +---+  |        
553     |  |             |             ^ |   |        
554     |  |             +-------------+ |   |        
555     |  +-----------------------------+   |        
556     +------------------------------------+        
557                                                   
558   +------+                                        
559   |buffer|          RING BUFFER                   
560   |page  |-------H-----------+  <--- New head     
561   +------+ <---------------+ v                    
562     |  ^          +---+   +---+   +---+           
563     |  |          |   |   |   |-->|   |           
564     |  |  New     |   |   |   |<--|   |<-+        
565     |  | Reader   +---+   +---+   +---+  |        
566     |  |  page ----^                 |   |        
567     |  |                             |   |        
568     |  +-----------------------------+   |        
569     +------------------------------------+        
570                                                   
571 Another important point: The page that the rea    
572 by its previous pointer (the one that now poin    
573 never points back to the reader page. That is     
574 not part of the ring buffer. Traversing the ri    
575 will always stay in the ring buffer. Traversin    
576 prev pointers may not.                            
577                                                   
578 Note, the way to determine a reader page is si    
579 pointer of the page. If the next pointer of th    
580 point back to the original page, then the orig    
581                                                   
582                                                   
583              +--------+                           
584              | reader |  next   +----+            
585              |  page  |-------->|    |<======     
586              +--------+         +----+            
587                  |                | ^             
588                  |                v | next        
589             prev |              +----+            
590                  +------------->|    |            
591                                 +----+            
592                                                   
593 The way the head page moves forward:              
594                                                   
595 When the tail page meets the head page and the    
596 and more writes take place, the head page must    
597 writer may move the tail page. The way this is    
598 performs a cmpxchg to convert the pointer to t    
599 flag to have the UPDATE flag set. Once this is    
600 not be able to swap the head page from the buf    
601 move the head page, until the writer is finish    
602                                                   
603 This eliminates any races that the reader can     
604 must spin, and this is why the reader cannot p    
605                                                   
606               tail page                           
607                  |                                
608                  v                                
609       +---+    +---+    +---+    +---+            
610   <---|   |--->|   |-H->|   |--->|   |--->        
611   --->|   |<---|   |<---|   |<---|   |<---        
612       +---+    +---+    +---+    +---+            
613                                                   
614               tail page                           
615                  |                                
616                  v                                
617       +---+    +---+    +---+    +---+            
618   <---|   |--->|   |-U->|   |--->|   |--->        
619   --->|   |<---|   |<---|   |<---|   |<---        
620       +---+    +---+    +---+    +---+            
621                                                   
622 The following page will be made into the new h    
623                                                   
624              tail page                            
625                  |                                
626                  v                                
627       +---+    +---+    +---+    +---+            
628   <---|   |--->|   |-U->|   |-H->|   |--->        
629   --->|   |<---|   |<---|   |<---|   |<---        
630       +---+    +---+    +---+    +---+            
631                                                   
632 After the new head page has been set, we can s    
633 pointer back to NORMAL::                          
634                                                   
635              tail page                            
636                  |                                
637                  v                                
638       +---+    +---+    +---+    +---+            
639   <---|   |--->|   |--->|   |-H->|   |--->        
640   --->|   |<---|   |<---|   |<---|   |<---        
641       +---+    +---+    +---+    +---+            
642                                                   
643 After the head page has been moved, the tail p    
644                                                   
645                       tail page                   
646                           |                       
647                           v                       
648       +---+    +---+    +---+    +---+            
649   <---|   |--->|   |--->|   |-H->|   |--->        
650   --->|   |<---|   |<---|   |<---|   |<---        
651       +---+    +---+    +---+    +---+            
652                                                   
653                                                   
654 The above are the trivial updates. Now for the    
655                                                   
656                                                   
657 As stated before, if enough writes preempt the    
658 tail page may make it all the way around the b    
659 page. At this time, we must start dropping wri    
660 of warning to the user). But what happens if t    
661 reader page? The commit page is not part of th    
662 must account for this::                           
663                                                   
664                                                   
665             reader page    commit page            
666                 |              |                  
667                 v              |                  
668                +---+           |                  
669                |   |<----------+                  
670                |   |                              
671                |   |------+                       
672                +---+      |                       
673                           |                       
674                           v                       
675       +---+    +---+    +---+    +---+            
676   <---|   |--->|   |-H->|   |--->|   |--->        
677   --->|   |<---|   |<---|   |<---|   |<---        
678       +---+    +---+    +---+    +---+            
679                  ^                                
680                  |                                
681              tail page                            
682                                                   
683 If the tail page were to simply push the head     
684 leaving the reader page would not be pointing     
685                                                   
686 The solution to this is to test if the commit     
687 before pushing the head page. If it is, then i    
688 tail page wrapped the buffer, and we must drop    
689                                                   
690 This is not a race condition, because the comm    
691 by the outermost writer (the writer that was p    
692 This means that the commit will not move while    
693 tail page. The reader cannot swap the reader p    
694 used as the commit page. The reader can simply    
695 is off the reader page. Once the commit page l    
696 it will never go back on it unless a reader do    
697 buffer page that is also the commit page.         
698                                                   
699                                                   
700 Nested writes                                     
701 -------------                                     
702                                                   
703 In the pushing forward of the tail page we mus    
704 the head page if the head page is the next pag    
705 is not the next page, the tail page is simply     
706                                                   
707 Only writers move the tail page. This must be     
708 against nested writers::                          
709                                                   
710   temp_page = tail_page                           
711   next_page = temp_page->next                     
712   cmpxchg(tail_page, temp_page, next_page)        
713                                                   
714 The above will update the tail page if it is s    
715 page. If this fails, a nested write pushed it     
716 does not need to push it::                        
717                                                   
718                                                   
719              temp page                            
720                  |                                
721                  v                                
722               tail page                           
723                  |                                
724                  v                                
725       +---+    +---+    +---+    +---+            
726   <---|   |--->|   |--->|   |--->|   |--->        
727   --->|   |<---|   |<---|   |<---|   |<---        
728       +---+    +---+    +---+    +---+            
729                                                   
730 Nested write comes in and moves the tail page     
731                                                   
732                       tail page (moved by nest    
733               temp page   |                       
734                  |        |                       
735                  v        v                       
736       +---+    +---+    +---+    +---+            
737   <---|   |--->|   |--->|   |--->|   |--->        
738   --->|   |<---|   |<---|   |<---|   |<---        
739       +---+    +---+    +---+    +---+            
740                                                   
741 The above would fail the cmpxchg, but since th    
742 been moved forward, the writer will just try a    
743 on the new tail page.                             
744                                                   
745 But the moving of the head page is a bit more     
746                                                   
747               tail page                           
748                  |                                
749                  v                                
750       +---+    +---+    +---+    +---+            
751   <---|   |--->|   |-H->|   |--->|   |--->        
752   --->|   |<---|   |<---|   |<---|   |<---        
753       +---+    +---+    +---+    +---+            
754                                                   
755 The write converts the head page pointer to UP    
756                                                   
757               tail page                           
758                  |                                
759                  v                                
760       +---+    +---+    +---+    +---+            
761   <---|   |--->|   |-U->|   |--->|   |--->        
762   --->|   |<---|   |<---|   |<---|   |<---        
763       +---+    +---+    +---+    +---+            
764                                                   
765 But if a nested writer preempts here, it will     
766 page is a head page, but it is also nested. It    
767 it is nested and will save that information. T    
768 fact that it sees the UPDATE flag instead of a    
769 pointer.                                          
770                                                   
771 The nested writer will set the new head page p    
772                                                   
773              tail page                            
774                  |                                
775                  v                                
776       +---+    +---+    +---+    +---+            
777   <---|   |--->|   |-U->|   |-H->|   |--->        
778   --->|   |<---|   |<---|   |<---|   |<---        
779       +---+    +---+    +---+    +---+            
780                                                   
781 But it will not reset the update back to norma    
782 that converted a pointer from HEAD to UPDATE w    
783 to NORMAL::                                       
784                                                   
785                       tail page                   
786                           |                       
787                           v                       
788       +---+    +---+    +---+    +---+            
789   <---|   |--->|   |-U->|   |-H->|   |--->        
790   --->|   |<---|   |<---|   |<---|   |<---        
791       +---+    +---+    +---+    +---+            
792                                                   
793 After the nested writer finishes, the outermos    
794 the UPDATE pointer to NORMAL::                    
795                                                   
796                                                   
797                       tail page                   
798                           |                       
799                           v                       
800       +---+    +---+    +---+    +---+            
801   <---|   |--->|   |--->|   |-H->|   |--->        
802   --->|   |<---|   |<---|   |<---|   |<---        
803       +---+    +---+    +---+    +---+            
804                                                   
805                                                   
806 It can be even more complex if several nested     
807 the tail page ahead several pages::               
808                                                   
809                                                   
810   (first writer)                                  
811                                                   
812               tail page                           
813                  |                                
814                  v                                
815       +---+    +---+    +---+    +---+            
816   <---|   |--->|   |-H->|   |--->|   |--->        
817   --->|   |<---|   |<---|   |<---|   |<---        
818       +---+    +---+    +---+    +---+            
819                                                   
820 The write converts the head page pointer to UP    
821                                                   
822               tail page                           
823                  |                                
824                  v                                
825       +---+    +---+    +---+    +---+            
826   <---|   |--->|   |-U->|   |--->|   |--->        
827   --->|   |<---|   |<---|   |<---|   |<---        
828       +---+    +---+    +---+    +---+            
829                                                   
830 Next writer comes in, and sees the update and     
831 head page::                                       
832                                                   
833   (second writer)                                 
834                                                   
835              tail page                            
836                  |                                
837                  v                                
838       +---+    +---+    +---+    +---+            
839   <---|   |--->|   |-U->|   |-H->|   |--->        
840   --->|   |<---|   |<---|   |<---|   |<---        
841       +---+    +---+    +---+    +---+            
842                                                   
843 The nested writer moves the tail page forward.    
844 update page to NORMAL because it is not the ou    
845                                                   
846                       tail page                   
847                           |                       
848                           v                       
849       +---+    +---+    +---+    +---+            
850   <---|   |--->|   |-U->|   |-H->|   |--->        
851   --->|   |<---|   |<---|   |<---|   |<---        
852       +---+    +---+    +---+    +---+            
853                                                   
854 Another writer preempts and sees the page afte    
855 It changes it from HEAD to UPDATE::               
856                                                   
857   (third writer)                                  
858                                                   
859                       tail page                   
860                           |                       
861                           v                       
862       +---+    +---+    +---+    +---+            
863   <---|   |--->|   |-U->|   |-U->|   |--->        
864   --->|   |<---|   |<---|   |<---|   |<---        
865       +---+    +---+    +---+    +---+            
866                                                   
867 The writer will move the head page forward::      
868                                                   
869                                                   
870   (third writer)                                  
871                                                   
872                       tail page                   
873                           |                       
874                           v                       
875       +---+    +---+    +---+    +---+            
876   <---|   |--->|   |-U->|   |-U->|   |-H->        
877   --->|   |<---|   |<---|   |<---|   |<---        
878       +---+    +---+    +---+    +---+            
879                                                   
880 But now that the third writer did change the H    
881 will convert it to normal::                       
882                                                   
883                                                   
884   (third writer)                                  
885                                                   
886                       tail page                   
887                           |                       
888                           v                       
889       +---+    +---+    +---+    +---+            
890   <---|   |--->|   |-U->|   |--->|   |-H->        
891   --->|   |<---|   |<---|   |<---|   |<---        
892       +---+    +---+    +---+    +---+            
893                                                   
894                                                   
895 Then it will move the tail page, and return ba    
896                                                   
897                                                   
898   (second writer)                                 
899                                                   
900                                tail page          
901                                    |              
902                                    v              
903       +---+    +---+    +---+    +---+            
904   <---|   |--->|   |-U->|   |--->|   |-H->        
905   --->|   |<---|   |<---|   |<---|   |<---        
906       +---+    +---+    +---+    +---+            
907                                                   
908                                                   
909 The second writer will fail to move the tail p    
910 moved, so it will try again and add its data t    
911 It will return to the first writer::              
912                                                   
913                                                   
914   (first writer)                                  
915                                                   
916                                tail page          
917                                    |              
918                                    v              
919       +---+    +---+    +---+    +---+            
920   <---|   |--->|   |-U->|   |--->|   |-H->        
921   --->|   |<---|   |<---|   |<---|   |<---        
922       +---+    +---+    +---+    +---+            
923                                                   
924 The first writer cannot know atomically if the    
925 while it updates the HEAD page. It will then u    
926 what it thinks is the new head page::             
927                                                   
928                                                   
929   (first writer)                                  
930                                                   
931                                tail page          
932                                    |              
933                                    v              
934       +---+    +---+    +---+    +---+            
935   <---|   |--->|   |-U->|   |-H->|   |-H->        
936   --->|   |<---|   |<---|   |<---|   |<---        
937       +---+    +---+    +---+    +---+            
938                                                   
939 Since the cmpxchg returns the old value of the    
940 will see it succeeded in updating the pointer     
941 But as we can see, this is not good enough. It    
942 if the tail page is either where it use to be     
943                                                   
944                                                   
945   (first writer)                                  
946                                                   
947                  A        B    tail page          
948                  |        |        |              
949                  v        v        v              
950       +---+    +---+    +---+    +---+            
951   <---|   |--->|   |-U->|   |-H->|   |-H->        
952   --->|   |<---|   |<---|   |<---|   |<---        
953       +---+    +---+    +---+    +---+            
954                                                   
955 If tail page != A and tail page != B, then it     
956 back to NORMAL. The fact that it only needs to    
957 writers means that it only needs to check this    
958                                                   
959                                                   
960   (first writer)                                  
961                                                   
962                  A        B    tail page          
963                  |        |        |              
964                  v        v        v              
965       +---+    +---+    +---+    +---+            
966   <---|   |--->|   |-U->|   |--->|   |-H->        
967   --->|   |<---|   |<---|   |<---|   |<---        
968       +---+    +---+    +---+    +---+            
969                                                   
970 Now the writer can update the head page. This     
971 remain in UPDATE and only reset by the outermo    
972 the reader from seeing the incorrect head page    
973                                                   
974                                                   
975   (first writer)                                  
976                                                   
977                  A        B    tail page          
978                  |        |        |              
979                  v        v        v              
980       +---+    +---+    +---+    +---+            
981   <---|   |--->|   |--->|   |--->|   |-H->        
982   --->|   |<---|   |<---|   |<---|   |<---        
983       +---+    +---+    +---+    +---+            
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php