~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/kernel-hacking/false-sharing.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 =============
  4 False Sharing
  5 =============
  6 
  7 What is False Sharing
  8 =====================
  9 False sharing is related with cache mechanism of maintaining the data
 10 coherence of one cache line stored in multiple CPU's caches; then
 11 academic definition for it is in [1]_. Consider a struct with a
 12 refcount and a string::
 13 
 14         struct foo {
 15                 refcount_t refcount;
 16                 ...
 17                 char name[16];
 18         } ____cacheline_internodealigned_in_smp;
 19 
 20 Member 'refcount'(A) and 'name'(B) _share_ one cache line like below::
 21 
 22                 +-----------+                     +-----------+
 23                 |   CPU 0   |                     |   CPU 1   |
 24                 +-----------+                     +-----------+
 25                /                                        |
 26               /                                         |
 27              V                                          V
 28          +----------------------+             +----------------------+
 29          | A      B             | Cache 0     | A       B            | Cache 1
 30          +----------------------+             +----------------------+
 31                              |                  |
 32   ---------------------------+------------------+-----------------------------
 33                              |                  |
 34                            +----------------------+
 35                            |                      |
 36                            +----------------------+
 37               Main Memory  | A       B            |
 38                            +----------------------+
 39 
 40 'refcount' is modified frequently, but 'name' is set once at object
 41 creation time and is never modified.  When many CPUs access 'foo' at
 42 the same time, with 'refcount' being only bumped by one CPU frequently
 43 and 'name' being read by other CPUs, all those reading CPUs have to
 44 reload the whole cache line over and over due to the 'sharing', even
 45 though 'name' is never changed.
 46 
 47 There are many real-world cases of performance regressions caused by
 48 false sharing.  One of these is a rw_semaphore 'mmap_lock' inside
 49 mm_struct struct, whose cache line layout change triggered a
 50 regression and Linus analyzed in [2]_.
 51 
 52 There are two key factors for a harmful false sharing:
 53 
 54 * A global datum accessed (shared) by many CPUs
 55 * In the concurrent accesses to the data, there is at least one write
 56   operation: write/write or write/read cases.
 57 
 58 The sharing could be from totally unrelated kernel components, or
 59 different code paths of the same kernel component.
 60 
 61 
 62 False Sharing Pitfalls
 63 ======================
 64 Back in time when one platform had only one or a few CPUs, hot data
 65 members could be purposely put in the same cache line to make them
 66 cache hot and save cacheline/TLB, like a lock and the data protected
 67 by it.  But for recent large system with hundreds of CPUs, this may
 68 not work when the lock is heavily contended, as the lock owner CPU
 69 could write to the data, while other CPUs are busy spinning the lock.
 70 
 71 Looking at past cases, there are several frequently occurring patterns
 72 for false sharing:
 73 
 74 * lock (spinlock/mutex/semaphore) and data protected by it are
 75   purposely put in one cache line.
 76 * global data being put together in one cache line. Some kernel
 77   subsystems have many global parameters of small size (4 bytes),
 78   which can easily be grouped together and put into one cache line.
 79 * data members of a big data structure randomly sitting together
 80   without being noticed (cache line is usually 64 bytes or more),
 81   like 'mem_cgroup' struct.
 82 
 83 Following 'mitigation' section provides real-world examples.
 84 
 85 False sharing could easily happen unless they are intentionally
 86 checked, and it is valuable to run specific tools for performance
 87 critical workloads to detect false sharing affecting performance case
 88 and optimize accordingly.
 89 
 90 
 91 How to detect and analyze False Sharing
 92 ========================================
 93 perf record/report/stat are widely used for performance tuning, and
 94 once hotspots are detected, tools like 'perf-c2c' and 'pahole' can
 95 be further used to detect and pinpoint the possible false sharing
 96 data structures.  'addr2line' is also good at decoding instruction
 97 pointer when there are multiple layers of inline functions.
 98 
 99 perf-c2c can capture the cache lines with most false sharing hits,
100 decoded functions (line number of file) accessing that cache line,
101 and in-line offset of the data. Simple commands are::
102 
103   $ perf c2c record -ag sleep 3
104   $ perf c2c report --call-graph none -k vmlinux
105 
106 When running above during testing will-it-scale's tlb_flush1 case,
107 perf reports something like::
108 
109   Total records                     :    1658231
110   Locked Load/Store Operations      :      89439
111   Load Operations                   :     623219
112   Load Local HITM                   :      92117
113   Load Remote HITM                  :        139
114 
115   #----------------------------------------------------------------------
116       4        0     2374        0        0        0  0xff1100088366d880
117   #----------------------------------------------------------------------
118     0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
119     0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
120     0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
121     0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
122     0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1
123 
124 A nice introduction for perf-c2c is [3]_.
125 
126 'pahole' decodes data structure layouts delimited in cache line
127 granularity.  Users can match the offset in perf-c2c output with
128 pahole's decoding to locate the exact data members.  For global
129 data, users can search the data address in System.map.
130 
131 
132 Possible Mitigations
133 ====================
134 False sharing does not always need to be mitigated.  False sharing
135 mitigations should balance performance gains with complexity and
136 space consumption.  Sometimes, lower performance is OK, and it's
137 unnecessary to hyper-optimize every rarely used data structure or
138 a cold data path.
139 
140 False sharing hurting performance cases are seen more frequently with
141 core count increasing.  Because of these detrimental effects, many
142 patches have been proposed across variety of subsystems (like
143 networking and memory management) and merged.  Some common mitigations
144 (with examples) are:
145 
146 * Separate hot global data in its own dedicated cache line, even if it
147   is just a 'short' type. The downside is more consumption of memory,
148   cache line and TLB entries.
149 
150   - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated")
151 
152 * Reorganize the data structure, separate the interfering members to
153   different cache lines.  One downside is it may introduce new false
154   sharing of other members.
155 
156   - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing")
157 
158 * Replace 'write' with 'read' when possible, especially in loops.
159   Like for some global variable, use compare(read)-then-write instead
160   of unconditional write. For example, use::
161 
162         if (!test_bit(XXX))
163                 set_bit(XXX);
164 
165   instead of directly "set_bit(XXX);", similarly for atomic_t data::
166 
167         if (atomic_read(XXX) == AAA)
168                 atomic_set(XXX, BBB);
169 
170   - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing")
171   - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP")
172 
173 * Turn hot global data to 'per-cpu data + global data' when possible,
174   or reasonably increase the threshold for syncing per-cpu data to
175   global data, to reduce or postpone the 'write' to that global data.
176 
177   - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses")
178   - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy")
179 
180 Surely, all mitigations should be carefully verified to not cause side
181 effects.  To avoid introducing false sharing when coding, it's better
182 to:
183 
184 * Be aware of cache line boundaries
185 * Group mostly read-only fields together
186 * Group things that are written at the same time together
187 * Separate frequently read and frequently written fields on
188   different cache lines.
189 
190 and better add a comment stating the false sharing consideration.
191 
192 One note is, sometimes even after a severe false sharing is detected
193 and solved, the performance may still have no obvious improvement as
194 the hotspot switches to a new place.
195 
196 
197 Miscellaneous
198 =============
199 One open issue is that kernel has an optional data structure
200 randomization mechanism, which also randomizes the situation of cache
201 line sharing of data members.
202 
203 
204 .. [1] https://en.wikipedia.org/wiki/False_sharing
205 .. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/
206 .. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php