~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/RCU/listRCU.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/RCU/listRCU.rst (Version linux-6.11.5) and /Documentation/RCU/listRCU.rst (Version linux-5.18.19)


  1 .. _list_rcu_doc:                                   1 .. _list_rcu_doc:
  2                                                     2 
  3 Using RCU to Protect Read-Mostly Linked Lists       3 Using RCU to Protect Read-Mostly Linked Lists
  4 =============================================       4 =============================================
  5                                                     5 
  6 One of the most common uses of RCU is protecti !!   6 One of the best applications of RCU is to protect read-mostly linked lists
  7 (``struct list_head`` in list.h).  One big adv !!   7 (``struct list_head`` in list.h).  One big advantage of this approach
  8 that all of the required memory ordering is pr !!   8 is that all of the required memory barriers are included for you in
  9 This document describes several list-based RCU !!   9 the list macros.  This document describes several applications of RCU,
 10                                                !!  10 with the best fits first.
 11 When iterating a list while holding the rcu_re << 
 12 modify the list.  The reader is guaranteed to  << 
 13 which were added to the list before they acqui << 
 14 and are still on the list when they drop the r << 
 15 Elements which are added to, or removed from t << 
 16 be seen.  If the writer calls list_replace_rcu << 
 17 either the old element or the new element; the << 
 18 nor will they see neither.                     << 
 19                                                    11 
 20                                                    12 
 21 Example 1: Read-mostly list: Deferred Destruct     13 Example 1: Read-mostly list: Deferred Destruction
 22 ----------------------------------------------     14 -------------------------------------------------
 23                                                    15 
 24 A widely used usecase for RCU lists in the ker     16 A widely used usecase for RCU lists in the kernel is lockless iteration over
 25 all processes in the system. ``task_struct::ta     17 all processes in the system. ``task_struct::tasks`` represents the list node that
 26 links all the processes. The list can be trave     18 links all the processes. The list can be traversed in parallel to any list
 27 additions or removals.                             19 additions or removals.
 28                                                    20 
 29 The traversal of the list is done using ``for_     21 The traversal of the list is done using ``for_each_process()`` which is defined
 30 by the 2 macros::                                  22 by the 2 macros::
 31                                                    23 
 32         #define next_task(p) \                     24         #define next_task(p) \
 33                 list_entry_rcu((p)->tasks.next     25                 list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
 34                                                    26 
 35         #define for_each_process(p) \              27         #define for_each_process(p) \
 36                 for (p = &init_task ; (p = nex     28                 for (p = &init_task ; (p = next_task(p)) != &init_task ; )
 37                                                    29 
 38 The code traversing the list of all processes      30 The code traversing the list of all processes typically looks like::
 39                                                    31 
 40         rcu_read_lock();                           32         rcu_read_lock();
 41         for_each_process(p) {                      33         for_each_process(p) {
 42                 /* Do something with p */          34                 /* Do something with p */
 43         }                                          35         }
 44         rcu_read_unlock();                         36         rcu_read_unlock();
 45                                                    37 
 46 The simplified and heavily inlined code for re !!  38 The simplified code for removing a process from a task list is::
 47 task list is::                                 << 
 48                                                    39 
 49         void release_task(struct task_struct *     40         void release_task(struct task_struct *p)
 50         {                                          41         {
 51                 write_lock(&tasklist_lock);        42                 write_lock(&tasklist_lock);
 52                 list_del_rcu(&p->tasks);           43                 list_del_rcu(&p->tasks);
 53                 write_unlock(&tasklist_lock);      44                 write_unlock(&tasklist_lock);
 54                 call_rcu(&p->rcu, delayed_put_     45                 call_rcu(&p->rcu, delayed_put_task_struct);
 55         }                                          46         }
 56                                                    47 
 57 When a process exits, ``release_task()`` calls !!  48 When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)`` under
 58 via __exit_signal() and __unhash_process() und !!  49 ``tasklist_lock`` writer lock protection, to remove the task from the list of
 59 writer lock protection.  The list_del_rcu() in !!  50 all tasks. The ``tasklist_lock`` prevents concurrent list additions/removals
 60 the task from the list of all tasks. The ``tas !!  51 from corrupting the list. Readers using ``for_each_process()`` are not protected
 61 prevents concurrent list additions/removals fr !!  52 with the ``tasklist_lock``. To prevent readers from noticing changes in the list
 62 list. Readers using ``for_each_process()`` are !!  53 pointers, the ``task_struct`` object is freed only after one or more grace
 63 ``tasklist_lock``. To prevent readers from not !!  54 periods elapse (with the help of call_rcu()). This deferring of destruction
 64 pointers, the ``task_struct`` object is freed  !!  55 ensures that any readers traversing the list will see valid ``p->tasks.next``
 65 grace periods elapse, with the help of call_rc !!  56 pointers and deletion/freeing can happen in parallel with traversal of the list.
 66 put_task_struct_rcu_user(). This deferring of  !!  57 This pattern is also called an **existence lock**, since RCU pins the object in
 67 any readers traversing the list will see valid !!  58 memory until all existing readers finish.
 68 and deletion/freeing can happen in parallel wi << 
 69 This pattern is also called an **existence loc << 
 70 from invoking the delayed_put_task_struct() ca << 
 71 all existing readers finish, which guarantees  << 
 72 object in question will remain in existence un << 
 73 of all RCU readers that might possibly have a  << 
 74                                                    59 
 75                                                    60 
 76 Example 2: Read-Side Action Taken Outside of L     61 Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
 77 ----------------------------------------------     62 ----------------------------------------------------------------------
 78                                                    63 
 79 Some reader-writer locking use cases compute a !!  64 The best applications are cases where, if reader-writer locking were
 80 the read-side lock, but continue to use that v !!  65 used, the read-side lock would be dropped before taking any action
 81 released.  These use cases are often good cand !!  66 based on the results of the search.  The most celebrated example is
 82 to RCU.  One prominent example involves networ !!  67 the routing table.  Because the routing table is tracking the state of
 83 Because the packet-routing data tracks the sta !!  68 equipment outside of the computer, it will at times contain stale data.
 84 of the computer, it will at times contain stal !!  69 Therefore, once the route has been computed, there is no need to hold
 85 the route has been computed, there is no need  !!  70 the routing table static during transmission of the packet.  After all,
 86 static during transmission of the packet.  Aft !!  71 you can hold the routing table static all you want, but that won't keep
 87 routing table static all you want, but that wo !!  72 the external Internet from changing, and it is the state of the external
 88 Internet from changing, and it is the state of !!  73 Internet that really matters.  In addition, routing entries are typically
 89 that really matters.  In addition, routing ent !!  74 added or deleted, rather than being modified in place.
 90 or deleted, rather than being modified in plac << 
 91 of the finite speed of light and the non-zero  << 
 92 helping make synchronization be lighter weight << 
 93                                                    75 
 94 A straightforward example of this type of RCU  !!  76 A straightforward example of this use of RCU may be found in the
 95 the system-call auditing support.  For example !!  77 system-call auditing support.  For example, a reader-writer locked
 96 implementation of ``audit_filter_task()`` migh     78 implementation of ``audit_filter_task()`` might be as follows::
 97                                                    79 
 98         static enum audit_state audit_filter_t !!  80         static enum audit_state audit_filter_task(struct task_struct *tsk)
 99         {                                          81         {
100                 struct audit_entry *e;             82                 struct audit_entry *e;
101                 enum audit_state   state;          83                 enum audit_state   state;
102                                                    84 
103                 read_lock(&auditsc_lock);          85                 read_lock(&auditsc_lock);
104                 /* Note: audit_filter_mutex he     86                 /* Note: audit_filter_mutex held by caller. */
105                 list_for_each_entry(e, &audit_     87                 list_for_each_entry(e, &audit_tsklist, list) {
106                         if (audit_filter_rules     88                         if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
107                                 if (state == A << 
108                                         *key = << 
109                                 read_unlock(&a     89                                 read_unlock(&auditsc_lock);
110                                 return state;      90                                 return state;
111                         }                          91                         }
112                 }                                  92                 }
113                 read_unlock(&auditsc_lock);        93                 read_unlock(&auditsc_lock);
114                 return AUDIT_BUILD_CONTEXT;        94                 return AUDIT_BUILD_CONTEXT;
115         }                                          95         }
116                                                    96 
117 Here the list is searched under the lock, but      97 Here the list is searched under the lock, but the lock is dropped before
118 the corresponding value is returned.  By the t     98 the corresponding value is returned.  By the time that this value is acted
119 on, the list may well have been modified.  Thi     99 on, the list may well have been modified.  This makes sense, since if
120 you are turning auditing off, it is OK to audi    100 you are turning auditing off, it is OK to audit a few extra system calls.
121                                                   101 
122 This means that RCU can be easily applied to t    102 This means that RCU can be easily applied to the read side, as follows::
123                                                   103 
124         static enum audit_state audit_filter_t !! 104         static enum audit_state audit_filter_task(struct task_struct *tsk)
125         {                                         105         {
126                 struct audit_entry *e;            106                 struct audit_entry *e;
127                 enum audit_state   state;         107                 enum audit_state   state;
128                                                   108 
129                 rcu_read_lock();                  109                 rcu_read_lock();
130                 /* Note: audit_filter_mutex he    110                 /* Note: audit_filter_mutex held by caller. */
131                 list_for_each_entry_rcu(e, &au    111                 list_for_each_entry_rcu(e, &audit_tsklist, list) {
132                         if (audit_filter_rules    112                         if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
133                                 if (state == A << 
134                                         *key = << 
135                                 rcu_read_unloc    113                                 rcu_read_unlock();
136                                 return state;     114                                 return state;
137                         }                         115                         }
138                 }                                 116                 }
139                 rcu_read_unlock();                117                 rcu_read_unlock();
140                 return AUDIT_BUILD_CONTEXT;       118                 return AUDIT_BUILD_CONTEXT;
141         }                                         119         }
142                                                   120 
143 The read_lock() and read_unlock() calls have b !! 121 The ``read_lock()`` and ``read_unlock()`` calls have become rcu_read_lock()
144 and rcu_read_unlock(), respectively, and the l !! 122 and rcu_read_unlock(), respectively, and the list_for_each_entry() has
145 has become list_for_each_entry_rcu().  The **_ !! 123 become list_for_each_entry_rcu().  The **_rcu()** list-traversal primitives
146 primitives add READ_ONCE() and diagnostic chec !! 124 insert the read-side memory barriers that are required on DEC Alpha CPUs.
147 outside of an RCU read-side critical section.  << 
148                                                   125 
149 The changes to the update side are also straig    126 The changes to the update side are also straightforward. A reader-writer lock
150 might be used as follows for deletion and inse !! 127 might be used as follows for deletion and insertion::
151 versions of audit_del_rule() and audit_add_rul << 
152                                                   128 
153         static inline int audit_del_rule(struc    129         static inline int audit_del_rule(struct audit_rule *rule,
154                                          struc    130                                          struct list_head *list)
155         {                                         131         {
156                 struct audit_entry *e;            132                 struct audit_entry *e;
157                                                   133 
158                 write_lock(&auditsc_lock);        134                 write_lock(&auditsc_lock);
159                 list_for_each_entry(e, list, l    135                 list_for_each_entry(e, list, list) {
160                         if (!audit_compare_rul    136                         if (!audit_compare_rule(rule, &e->rule)) {
161                                 list_del(&e->l    137                                 list_del(&e->list);
162                                 write_unlock(&    138                                 write_unlock(&auditsc_lock);
163                                 return 0;         139                                 return 0;
164                         }                         140                         }
165                 }                                 141                 }
166                 write_unlock(&auditsc_lock);      142                 write_unlock(&auditsc_lock);
167                 return -EFAULT;         /* No     143                 return -EFAULT;         /* No matching rule */
168         }                                         144         }
169                                                   145 
170         static inline int audit_add_rule(struc    146         static inline int audit_add_rule(struct audit_entry *entry,
171                                          struc    147                                          struct list_head *list)
172         {                                         148         {
173                 write_lock(&auditsc_lock);        149                 write_lock(&auditsc_lock);
174                 if (entry->rule.flags & AUDIT_    150                 if (entry->rule.flags & AUDIT_PREPEND) {
175                         entry->rule.flags &= ~    151                         entry->rule.flags &= ~AUDIT_PREPEND;
176                         list_add(&entry->list,    152                         list_add(&entry->list, list);
177                 } else {                          153                 } else {
178                         list_add_tail(&entry->    154                         list_add_tail(&entry->list, list);
179                 }                                 155                 }
180                 write_unlock(&auditsc_lock);      156                 write_unlock(&auditsc_lock);
181                 return 0;                         157                 return 0;
182         }                                         158         }
183                                                   159 
184 Following are the RCU equivalents for these tw    160 Following are the RCU equivalents for these two functions::
185                                                   161 
186         static inline int audit_del_rule(struc    162         static inline int audit_del_rule(struct audit_rule *rule,
187                                          struc    163                                          struct list_head *list)
188         {                                         164         {
189                 struct audit_entry *e;            165                 struct audit_entry *e;
190                                                   166 
191                 /* No need to use the _rcu ite    167                 /* No need to use the _rcu iterator here, since this is the only
192                  * deletion routine. */           168                  * deletion routine. */
193                 list_for_each_entry(e, list, l    169                 list_for_each_entry(e, list, list) {
194                         if (!audit_compare_rul    170                         if (!audit_compare_rule(rule, &e->rule)) {
195                                 list_del_rcu(&    171                                 list_del_rcu(&e->list);
196                                 call_rcu(&e->r    172                                 call_rcu(&e->rcu, audit_free_rule);
197                                 return 0;         173                                 return 0;
198                         }                         174                         }
199                 }                                 175                 }
200                 return -EFAULT;         /* No     176                 return -EFAULT;         /* No matching rule */
201         }                                         177         }
202                                                   178 
203         static inline int audit_add_rule(struc    179         static inline int audit_add_rule(struct audit_entry *entry,
204                                          struc    180                                          struct list_head *list)
205         {                                         181         {
206                 if (entry->rule.flags & AUDIT_    182                 if (entry->rule.flags & AUDIT_PREPEND) {
207                         entry->rule.flags &= ~    183                         entry->rule.flags &= ~AUDIT_PREPEND;
208                         list_add_rcu(&entry->l    184                         list_add_rcu(&entry->list, list);
209                 } else {                          185                 } else {
210                         list_add_tail_rcu(&ent    186                         list_add_tail_rcu(&entry->list, list);
211                 }                                 187                 }
212                 return 0;                         188                 return 0;
213         }                                         189         }
214                                                   190 
215 Normally, the write_lock() and write_unlock()  !! 191 Normally, the ``write_lock()`` and ``write_unlock()`` would be replaced by a
216 spin_lock() and a spin_unlock(). But in this c    192 spin_lock() and a spin_unlock(). But in this case, all callers hold
217 ``audit_filter_mutex``, so no additional locki    193 ``audit_filter_mutex``, so no additional locking is required. The
218 auditsc_lock can therefore be eliminated, sinc !! 194 ``auditsc_lock`` can therefore be eliminated, since use of RCU eliminates the
219 need for writers to exclude readers.              195 need for writers to exclude readers.
220                                                   196 
221 The list_del(), list_add(), and list_add_tail(    197 The list_del(), list_add(), and list_add_tail() primitives have been
222 replaced by list_del_rcu(), list_add_rcu(), an    198 replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
223 The **_rcu()** list-manipulation primitives ad !! 199 The **_rcu()** list-manipulation primitives add memory barriers that are needed on
224 needed on weakly ordered CPUs.  The list_del_r !! 200 weakly ordered CPUs (most of them!).  The list_del_rcu() primitive omits the
225 pointer poisoning debug-assist code that would    201 pointer poisoning debug-assist code that would otherwise cause concurrent
226 readers to fail spectacularly.                    202 readers to fail spectacularly.
227                                                   203 
228 So, when readers can tolerate stale data and w    204 So, when readers can tolerate stale data and when entries are either added or
229 deleted, without in-place modification, it is     205 deleted, without in-place modification, it is very easy to use RCU!
230                                                   206 
231                                                   207 
232 Example 3: Handling In-Place Updates              208 Example 3: Handling In-Place Updates
233 ------------------------------------              209 ------------------------------------
234                                                   210 
235 The system-call auditing code does not update     211 The system-call auditing code does not update auditing rules in place.  However,
236 if it did, the reader-writer-locked code to do    212 if it did, the reader-writer-locked code to do so might look as follows
237 (assuming only ``field_count`` is updated, oth    213 (assuming only ``field_count`` is updated, otherwise, the added fields would
238 need to be filled in)::                           214 need to be filled in)::
239                                                   215 
240         static inline int audit_upd_rule(struc    216         static inline int audit_upd_rule(struct audit_rule *rule,
241                                          struc    217                                          struct list_head *list,
242                                          __u32    218                                          __u32 newaction,
243                                          __u32    219                                          __u32 newfield_count)
244         {                                         220         {
245                 struct audit_entry *e;            221                 struct audit_entry *e;
246                 struct audit_entry *ne;           222                 struct audit_entry *ne;
247                                                   223 
248                 write_lock(&auditsc_lock);        224                 write_lock(&auditsc_lock);
249                 /* Note: audit_filter_mutex he    225                 /* Note: audit_filter_mutex held by caller. */
250                 list_for_each_entry(e, list, l    226                 list_for_each_entry(e, list, list) {
251                         if (!audit_compare_rul    227                         if (!audit_compare_rule(rule, &e->rule)) {
252                                 e->rule.action    228                                 e->rule.action = newaction;
253                                 e->rule.field_    229                                 e->rule.field_count = newfield_count;
254                                 write_unlock(&    230                                 write_unlock(&auditsc_lock);
255                                 return 0;         231                                 return 0;
256                         }                         232                         }
257                 }                                 233                 }
258                 write_unlock(&auditsc_lock);      234                 write_unlock(&auditsc_lock);
259                 return -EFAULT;         /* No     235                 return -EFAULT;         /* No matching rule */
260         }                                         236         }
261                                                   237 
262 The RCU version creates a copy, updates the co    238 The RCU version creates a copy, updates the copy, then replaces the old
263 entry with the newly updated entry.  This sequ    239 entry with the newly updated entry.  This sequence of actions, allowing
264 concurrent reads while making a copy to perfor    240 concurrent reads while making a copy to perform an update, is what gives
265 RCU (*read-copy update*) its name.             !! 241 RCU (*read-copy update*) its name.  The RCU code is as follows::
266                                                << 
267 The RCU version of audit_upd_rule() is as foll << 
268                                                   242 
269         static inline int audit_upd_rule(struc    243         static inline int audit_upd_rule(struct audit_rule *rule,
270                                          struc    244                                          struct list_head *list,
271                                          __u32    245                                          __u32 newaction,
272                                          __u32    246                                          __u32 newfield_count)
273         {                                         247         {
274                 struct audit_entry *e;            248                 struct audit_entry *e;
275                 struct audit_entry *ne;           249                 struct audit_entry *ne;
276                                                   250 
277                 list_for_each_entry(e, list, l    251                 list_for_each_entry(e, list, list) {
278                         if (!audit_compare_rul    252                         if (!audit_compare_rule(rule, &e->rule)) {
279                                 ne = kmalloc(s    253                                 ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
280                                 if (ne == NULL    254                                 if (ne == NULL)
281                                         return    255                                         return -ENOMEM;
282                                 audit_copy_rul    256                                 audit_copy_rule(&ne->rule, &e->rule);
283                                 ne->rule.actio    257                                 ne->rule.action = newaction;
284                                 ne->rule.field    258                                 ne->rule.field_count = newfield_count;
285                                 list_replace_r    259                                 list_replace_rcu(&e->list, &ne->list);
286                                 call_rcu(&e->r    260                                 call_rcu(&e->rcu, audit_free_rule);
287                                 return 0;         261                                 return 0;
288                         }                         262                         }
289                 }                                 263                 }
290                 return -EFAULT;         /* No     264                 return -EFAULT;         /* No matching rule */
291         }                                         265         }
292                                                   266 
293 Again, this assumes that the caller holds ``au    267 Again, this assumes that the caller holds ``audit_filter_mutex``.  Normally, the
294 writer lock would become a spinlock in this so    268 writer lock would become a spinlock in this sort of code.
295                                                   269 
296 The update_lsm_rule() does something very simi << 
297 prefer to look at real Linux-kernel code.      << 
298                                                << 
299 Another use of this pattern can be found in th    270 Another use of this pattern can be found in the openswitch driver's *connection
300 tracking table* code in ``ct_limit_set()``.  T    271 tracking table* code in ``ct_limit_set()``.  The table holds connection tracking
301 entries and has a limit on the maximum entries    272 entries and has a limit on the maximum entries.  There is one such table
302 per-zone and hence one *limit* per zone.  The     273 per-zone and hence one *limit* per zone.  The zones are mapped to their limits
303 through a hashtable using an RCU-managed hlist    274 through a hashtable using an RCU-managed hlist for the hash chains. When a new
304 limit is set, a new limit object is allocated     275 limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
305 to replace the old limit object with the new o    276 to replace the old limit object with the new one using list_replace_rcu().
306 The old limit object is then freed after a gra    277 The old limit object is then freed after a grace period using kfree_rcu().
307                                                   278 
308                                                   279 
309 Example 4: Eliminating Stale Data                 280 Example 4: Eliminating Stale Data
310 ---------------------------------                 281 ---------------------------------
311                                                   282 
312 The auditing example above tolerates stale dat    283 The auditing example above tolerates stale data, as do most algorithms
313 that are tracking external state.  After all,  !! 284 that are tracking external state.  Because there is a delay from the
314 from the time the external state changes befor !! 285 time the external state changes before Linux becomes aware of the change,
315 of the change, and so as noted earlier, a smal !! 286 additional RCU-induced staleness is generally not a problem.
316 RCU-induced staleness is generally not a probl << 
317                                                   287 
318 However, there are many examples where stale d    288 However, there are many examples where stale data cannot be tolerated.
319 One example in the Linux kernel is the System     289 One example in the Linux kernel is the System V IPC (see the shm_lock()
320 function in ipc/shm.c).  This code checks a *d    290 function in ipc/shm.c).  This code checks a *deleted* flag under a
321 per-entry spinlock, and, if the *deleted* flag    291 per-entry spinlock, and, if the *deleted* flag is set, pretends that the
322 entry does not exist.  For this to be helpful,    292 entry does not exist.  For this to be helpful, the search function must
323 return holding the per-entry spinlock, as shm_    293 return holding the per-entry spinlock, as shm_lock() does in fact do.
324                                                   294 
325 .. _quick_quiz:                                   295 .. _quick_quiz:
326                                                   296 
327 Quick Quiz:                                       297 Quick Quiz:
328         For the deleted-flag technique to be h    298         For the deleted-flag technique to be helpful, why is it necessary
329         to hold the per-entry lock while retur    299         to hold the per-entry lock while returning from the search function?
330                                                   300 
331 :ref:`Answer to Quick Quiz <quick_quiz_answer>    301 :ref:`Answer to Quick Quiz <quick_quiz_answer>`
332                                                   302 
333 If the system-call audit module were to ever n    303 If the system-call audit module were to ever need to reject stale data, one way
334 to accomplish this would be to add a ``deleted    304 to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
335 ``audit_entry`` structure, and modify audit_fi !! 305 audit_entry structure, and modify ``audit_filter_task()`` as follows::
336                                                   306 
337         static enum audit_state audit_filter_t    307         static enum audit_state audit_filter_task(struct task_struct *tsk)
338         {                                         308         {
339                 struct audit_entry *e;            309                 struct audit_entry *e;
340                 enum audit_state   state;         310                 enum audit_state   state;
341                                                   311 
342                 rcu_read_lock();                  312                 rcu_read_lock();
343                 list_for_each_entry_rcu(e, &au    313                 list_for_each_entry_rcu(e, &audit_tsklist, list) {
344                         if (audit_filter_rules    314                         if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
345                                 spin_lock(&e->    315                                 spin_lock(&e->lock);
346                                 if (e->deleted    316                                 if (e->deleted) {
347                                         spin_u    317                                         spin_unlock(&e->lock);
348                                         rcu_re    318                                         rcu_read_unlock();
349                                         return    319                                         return AUDIT_BUILD_CONTEXT;
350                                 }                 320                                 }
351                                 rcu_read_unloc    321                                 rcu_read_unlock();
352                                 if (state == A << 
353                                         *key = << 
354                                 return state;     322                                 return state;
355                         }                         323                         }
356                 }                                 324                 }
357                 rcu_read_unlock();                325                 rcu_read_unlock();
358                 return AUDIT_BUILD_CONTEXT;       326                 return AUDIT_BUILD_CONTEXT;
359         }                                         327         }
360                                                   328 
                                                   >> 329 Note that this example assumes that entries are only added and deleted.
                                                   >> 330 Additional mechanism is required to deal correctly with the update-in-place
                                                   >> 331 performed by ``audit_upd_rule()``.  For one thing, ``audit_upd_rule()`` would
                                                   >> 332 need additional memory barriers to ensure that the list_add_rcu() was really
                                                   >> 333 executed before the list_del_rcu().
                                                   >> 334 
361 The ``audit_del_rule()`` function would need t    335 The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
362 spinlock as follows::                             336 spinlock as follows::
363                                                   337 
364         static inline int audit_del_rule(struc    338         static inline int audit_del_rule(struct audit_rule *rule,
365                                          struc    339                                          struct list_head *list)
366         {                                         340         {
367                 struct audit_entry *e;            341                 struct audit_entry *e;
368                                                   342 
369                 /* No need to use the _rcu ite    343                 /* No need to use the _rcu iterator here, since this
370                  * is the only deletion routin    344                  * is the only deletion routine. */
371                 list_for_each_entry(e, list, l    345                 list_for_each_entry(e, list, list) {
372                         if (!audit_compare_rul    346                         if (!audit_compare_rule(rule, &e->rule)) {
373                                 spin_lock(&e->    347                                 spin_lock(&e->lock);
374                                 list_del_rcu(&    348                                 list_del_rcu(&e->list);
375                                 e->deleted = 1    349                                 e->deleted = 1;
376                                 spin_unlock(&e    350                                 spin_unlock(&e->lock);
377                                 call_rcu(&e->r    351                                 call_rcu(&e->rcu, audit_free_rule);
378                                 return 0;         352                                 return 0;
379                         }                         353                         }
380                 }                                 354                 }
381                 return -EFAULT;         /* No     355                 return -EFAULT;         /* No matching rule */
382         }                                         356         }
383                                                   357 
384 This too assumes that the caller holds ``audit    358 This too assumes that the caller holds ``audit_filter_mutex``.
385                                                   359 
386 Note that this example assumes that entries ar << 
387 Additional mechanism is required to deal corre << 
388 performed by audit_upd_rule().  For one thing, << 
389 need to hold the locks of both the old ``audit << 
390 while executing the list_replace_rcu().        << 
391                                                << 
392                                                   360 
393 Example 5: Skipping Stale Objects                 361 Example 5: Skipping Stale Objects
394 ---------------------------------                 362 ---------------------------------
395                                                   363 
396 For some use cases, reader performance can be  !! 364 For some usecases, reader performance can be improved by skipping stale objects
397 stale objects during read-side list traversal, !! 365 during read-side list traversal if the object in concern is pending destruction
398 are those that will be removed and destroyed a !! 366 after one or more grace periods. One such example can be found in the timerfd
399 periods. One such example can be found in the  !! 367 subsystem. When a ``CLOCK_REALTIME`` clock is reprogrammed - for example due to
400 ``CLOCK_REALTIME`` clock is reprogrammed (for  !! 368 setting of the system time, then all programmed timerfds that depend on this
401 of the system time) then all programmed ``time !! 369 clock get triggered and processes waiting on them to expire are woken up in
402 this clock get triggered and processes waiting !! 370 advance of their scheduled expiry. To facilitate this, all such timers are added
403 advance of their scheduled expiry. To facilita !! 371 to an RCU-managed ``cancel_list`` when they are setup in
404 are added to an RCU-managed ``cancel_list`` wh << 
405 ``timerfd_setup_cancel()``::                      372 ``timerfd_setup_cancel()``::
406                                                   373 
407         static void timerfd_setup_cancel(struc    374         static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
408         {                                         375         {
409                 spin_lock(&ctx->cancel_lock);     376                 spin_lock(&ctx->cancel_lock);
410                 if ((ctx->clockid == CLOCK_REA !! 377                 if ((ctx->clockid == CLOCK_REALTIME &&
411                      ctx->clockid == CLOCK_REA << 
412                     (flags & TFD_TIMER_ABSTIME    378                     (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
413                         if (!ctx->might_cancel    379                         if (!ctx->might_cancel) {
414                                 ctx->might_can    380                                 ctx->might_cancel = true;
415                                 spin_lock(&can    381                                 spin_lock(&cancel_lock);
416                                 list_add_rcu(&    382                                 list_add_rcu(&ctx->clist, &cancel_list);
417                                 spin_unlock(&c    383                                 spin_unlock(&cancel_lock);
418                         }                         384                         }
419                 } else {                       << 
420                         __timerfd_remove_cance << 
421                 }                                 385                 }
422                 spin_unlock(&ctx->cancel_lock)    386                 spin_unlock(&ctx->cancel_lock);
423         }                                         387         }
424                                                   388 
425 When a timerfd is freed (fd is closed), then t !! 389 When a timerfd is freed (fd is closed), then the ``might_cancel`` flag of the
426 flag of the timerfd object is cleared, the obj !! 390 timerfd object is cleared, the object removed from the ``cancel_list`` and
427 ``cancel_list`` and destroyed, as shown in thi !! 391 destroyed::
428 version of timerfd_release()::                 << 
429                                                   392 
430         int timerfd_release(struct inode *inod    393         int timerfd_release(struct inode *inode, struct file *file)
431         {                                         394         {
432                 struct timerfd_ctx *ctx = file    395                 struct timerfd_ctx *ctx = file->private_data;
433                                                   396 
434                 spin_lock(&ctx->cancel_lock);     397                 spin_lock(&ctx->cancel_lock);
435                 if (ctx->might_cancel) {          398                 if (ctx->might_cancel) {
436                         ctx->might_cancel = fa    399                         ctx->might_cancel = false;
437                         spin_lock(&cancel_lock    400                         spin_lock(&cancel_lock);
438                         list_del_rcu(&ctx->cli    401                         list_del_rcu(&ctx->clist);
439                         spin_unlock(&cancel_lo    402                         spin_unlock(&cancel_lock);
440                 }                                 403                 }
441                 spin_unlock(&ctx->cancel_lock)    404                 spin_unlock(&ctx->cancel_lock);
442                                                   405 
443                 if (isalarm(ctx))              !! 406                 hrtimer_cancel(&ctx->t.tmr);
444                         alarm_cancel(&ctx->t.a << 
445                 else                           << 
446                         hrtimer_cancel(&ctx->t << 
447                 kfree_rcu(ctx, rcu);              407                 kfree_rcu(ctx, rcu);
448                 return 0;                         408                 return 0;
449         }                                         409         }
450                                                   410 
451 If the ``CLOCK_REALTIME`` clock is set, for ex    411 If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
452 hrtimer framework calls ``timerfd_clock_was_se    412 hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
453 ``cancel_list`` and wakes up processes waiting    413 ``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
454 the ``cancel_list``, the ``might_cancel`` flag    414 the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
455 objects::                                         415 objects::
456                                                   416 
457         void timerfd_clock_was_set(void)          417         void timerfd_clock_was_set(void)
458         {                                         418         {
459                 ktime_t moffs = ktime_mono_to_ << 
460                 struct timerfd_ctx *ctx;          419                 struct timerfd_ctx *ctx;
461                 unsigned long flags;              420                 unsigned long flags;
462                                                   421 
463                 rcu_read_lock();                  422                 rcu_read_lock();
464                 list_for_each_entry_rcu(ctx, &    423                 list_for_each_entry_rcu(ctx, &cancel_list, clist) {
465                         if (!ctx->might_cancel    424                         if (!ctx->might_cancel)
466                                 continue;         425                                 continue;
467                         spin_lock_irqsave(&ctx    426                         spin_lock_irqsave(&ctx->wqh.lock, flags);
468                         if (ctx->moffs != moff !! 427                         if (ctx->moffs != ktime_mono_to_real(0)) {
469                                 ctx->moffs = K    428                                 ctx->moffs = KTIME_MAX;
470                                 ctx->ticks++;     429                                 ctx->ticks++;
471                                 wake_up_locked    430                                 wake_up_locked_poll(&ctx->wqh, EPOLLIN);
472                         }                         431                         }
473                         spin_unlock_irqrestore    432                         spin_unlock_irqrestore(&ctx->wqh.lock, flags);
474                 }                                 433                 }
475                 rcu_read_unlock();                434                 rcu_read_unlock();
476         }                                         435         }
477                                                   436 
478 The key point is that because RCU-protected tr !! 437 The key point here is, because RCU-traversal of the ``cancel_list`` happens
479 ``cancel_list`` happens concurrently with obje !! 438 while objects are being added and removed to the list, sometimes the traversal
480 sometimes the traversal can access an object t !! 439 can step on an object that has been removed from the list. In this example, it
481 the list. In this example, a flag is used to s !! 440 is seen that it is better to skip such objects using a flag.
482                                                   441 
483                                                   442 
484 Summary                                           443 Summary
485 -------                                           444 -------
486                                                   445 
487 Read-mostly list-based data structures that ca    446 Read-mostly list-based data structures that can tolerate stale data are
488 the most amenable to use of RCU.  The simplest    447 the most amenable to use of RCU.  The simplest case is where entries are
489 either added or deleted from the data structur    448 either added or deleted from the data structure (or atomically modified
490 in place), but non-atomic in-place modificatio    449 in place), but non-atomic in-place modifications can be handled by making
491 a copy, updating the copy, then replacing the     450 a copy, updating the copy, then replacing the original with the copy.
492 If stale data cannot be tolerated, then a *del    451 If stale data cannot be tolerated, then a *deleted* flag may be used
493 in conjunction with a per-entry spinlock in or    452 in conjunction with a per-entry spinlock in order to allow the search
494 function to reject newly deleted data.            453 function to reject newly deleted data.
495                                                   454 
496 .. _quick_quiz_answer:                            455 .. _quick_quiz_answer:
497                                                   456 
498 Answer to Quick Quiz:                             457 Answer to Quick Quiz:
499         For the deleted-flag technique to be h    458         For the deleted-flag technique to be helpful, why is it necessary
500         to hold the per-entry lock while retur    459         to hold the per-entry lock while returning from the search function?
501                                                   460 
502         If the search function drops the per-e    461         If the search function drops the per-entry lock before returning,
503         then the caller will be processing sta    462         then the caller will be processing stale data in any case.  If it
504         is really OK to be processing stale da    463         is really OK to be processing stale data, then you don't need a
505         *deleted* flag.  If processing stale d    464         *deleted* flag.  If processing stale data really is a problem,
506         then you need to hold the per-entry lo    465         then you need to hold the per-entry lock across all of the code
507         that uses the value that was returned.    466         that uses the value that was returned.
508                                                   467 
509 :ref:`Back to Quick Quiz <quick_quiz>`            468 :ref:`Back to Quick Quiz <quick_quiz>`
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php