~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/core-api/pin_user_pages.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 ====================================================
  4 pin_user_pages() and related calls
  5 ====================================================
  6 
  7 .. contents:: :local:
  8 
  9 Overview
 10 ========
 11 
 12 This document describes the following functions::
 13 
 14  pin_user_pages()
 15  pin_user_pages_fast()
 16  pin_user_pages_remote()
 17 
 18 Basic description of FOLL_PIN
 19 =============================
 20 
 21 FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
 22 ("gup") family of functions. FOLL_PIN has significant interactions and
 23 interdependencies with FOLL_LONGTERM, so both are covered here.
 24 
 25 FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
 26 sites. This allows the associated wrapper functions  (pin_user_pages*() and
 27 others) to set the correct combination of these flags, and to check for problems
 28 as well.
 29 
 30 FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
 31 This is in order to avoid creating a large number of wrapper functions to cover
 32 all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
 33 pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
 34 that's a natural dividing line, and a good point to make separate wrapper calls.
 35 In other words, use pin_user_pages*() for DMA-pinned pages, and
 36 get_user_pages*() for other cases. There are five cases described later on in
 37 this document, to further clarify that concept.
 38 
 39 FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
 40 multiple threads and call sites are free to pin the same struct pages, via both
 41 FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
 42 other, not the struct page(s).
 43 
 44 The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
 45 uses a different reference counting technique.
 46 
 47 FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
 48 FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
 49 
 50 Which flags are set by each wrapper
 51 ===================================
 52 
 53 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 54 flags the caller provides. The caller is required to pass in a non-null struct
 55 pages* array, and the function then pins pages by incrementing each by a special
 56 value: GUP_PIN_COUNTING_BIAS.
 57 
 58 For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
 59 the extra space available in the struct folio is used to store the
 60 pincount directly.
 61 
 62 This approach for large folios avoids the counting upper limit problems
 63 that are discussed below. Those limitations would have been aggravated
 64 severely by huge pages, because each tail page adds a refcount to the
 65 head page. And in fact, testing revealed that, without a separate pincount
 66 field, refcount overflows were seen in some huge page stress tests.
 67 
 68 This also means that huge pages and large folios do not suffer
 69 from the false positives problem that is mentioned below.::
 70 
 71  Function
 72  --------
 73  pin_user_pages          FOLL_PIN is always set internally by this function.
 74  pin_user_pages_fast     FOLL_PIN is always set internally by this function.
 75  pin_user_pages_remote   FOLL_PIN is always set internally by this function.
 76 
 77 For these get_user_pages*() functions, FOLL_GET might not even be specified.
 78 Behavior is a little more complex than above. If FOLL_GET was *not* specified,
 79 but the caller passed in a non-null struct pages* array, then the function
 80 sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
 81 of each page by +1.::
 82 
 83  Function
 84  --------
 85  get_user_pages           FOLL_GET is sometimes set internally by this function.
 86  get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
 87  get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
 88 
 89 Tracking dma-pinned pages
 90 =========================
 91 
 92 Some of the key design constraints, and solutions, for tracking dma-pinned
 93 pages:
 94 
 95 * An actual reference count, per struct page, is required. This is because
 96   multiple processes may pin and unpin a page.
 97 
 98 * False positives (reporting that a page is dma-pinned, when in fact it is not)
 99   are acceptable, but false negatives are not.
100 
101 * struct page may not be increased in size for this, and all fields are already
102   used.
103 
104 * Given the above, we can overload the page->_refcount field by using, sort of,
105   the upper bits in that field for a dma-pinned count. "Sort of", means that,
106   rather than dividing page->_refcount into bit fields, we simple add a medium-
107   large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
108   page->_refcount. This provides fuzzy behavior: if a page has get_page() called
109   on it 1024 times, then it will appear to have a single dma-pinned count.
110   And again, that's acceptable.
111 
112 This also leads to limitations: there are only 31-10==21 bits available for a
113 counter that increments 10 bits at a time.
114 
115 * Because of that limitation, special handling is applied to the zero pages
116   when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
117   refcount or pincount at all (it is permanent, so there's no need).  The
118   unpinning functions also don't do anything to a zero page.  This is
119   transparent to the caller.
120 
121 * Callers must specifically request "dma-pinned tracking of pages". In other
122   words, just calling get_user_pages() will not suffice; a new set of functions,
123   pin_user_page() and related, must be used.
124 
125 FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
126 ==========================================================
127 
128 Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
129 these categories:
130 
131 CASE 1: Direct IO (DIO)
132 -----------------------
133 There are GUP references to pages that are serving
134 as DIO buffers. These buffers are needed for a relatively short time (so they
135 are not "long term"). No special synchronization with folio_mkclean() or
136 munmap() is provided. Therefore, flags to set at the call site are: ::
137 
138     FOLL_PIN
139 
140 ...but rather than setting FOLL_PIN directly, call sites should use one of
141 the pin_user_pages*() routines that set FOLL_PIN.
142 
143 CASE 2: RDMA
144 ------------
145 There are GUP references to pages that are serving as DMA
146 buffers. These buffers are needed for a long time ("long term"). No special
147 synchronization with folio_mkclean() or munmap() is provided. Therefore, flags
148 to set at the call site are: ::
149 
150     FOLL_PIN | FOLL_LONGTERM
151 
152 NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
153 because DAX pages do not have a separate page cache, and so "pinning" implies
154 locking down file system blocks, which is not (yet) supported in that way.
155 
156 .. _mmu-notifier-registration-case:
157 
158 CASE 3: MMU notifier registration, with or without page faulting hardware
159 -------------------------------------------------------------------------
160 Device drivers can pin pages via get_user_pages*(), and register for mmu
161 notifier callbacks for the memory range. Then, upon receiving a notifier
162 "invalidate range" callback , stop the device from using the range, and unpin
163 the pages. There may be other possible schemes, such as for example explicitly
164 synchronizing against pending IO, that accomplish approximately the same thing.
165 
166 Or, if the hardware supports replayable page faults, then the device driver can
167 avoid pinning entirely (this is ideal), as follows: register for mmu notifier
168 callbacks as above, but instead of stopping the device and unpinning in the
169 callback, simply remove the range from the device's page tables.
170 
171 Either way, as long as the driver unpins the pages upon mmu notifier callback,
172 then there is proper synchronization with both filesystem and mm
173 (folio_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
174 
175 CASE 4: Pinning for struct page manipulation only
176 -------------------------------------------------
177 If only struct page data (as opposed to the actual memory contents that a page
178 is tracking) is affected, then normal GUP calls are sufficient, and neither flag
179 needs to be set.
180 
181 CASE 5: Pinning in order to write to the data within the page
182 -------------------------------------------------------------
183 Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
184 write to a page's data, unpin" can cause a problem. Case 5 may be considered a
185 superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
186 other words, if the code is neither Case 1 nor Case 2, it may still require
187 FOLL_PIN, for patterns like this:
188 
189 Correct (uses FOLL_PIN calls):
190     pin_user_pages()
191     write to the data within the pages
192     unpin_user_pages()
193 
194 INCORRECT (uses FOLL_GET calls):
195     get_user_pages()
196     write to the data within the pages
197     put_page()
198 
199 folio_maybe_dma_pinned(): the whole point of pinning
200 ====================================================
201 
202 The whole point of marking folios as "DMA-pinned" or "gup-pinned" is to be able
203 to query, "is this folio DMA-pinned?" That allows code such as folio_mkclean()
204 (and file system writeback code in general) to make informed decisions about
205 what to do when a folio cannot be unmapped due to such pins.
206 
207 What to do in those cases is the subject of a years-long series of discussions
208 and debates (see the References at the end of this document). It's a TODO item
209 here: fill in the details once that's worked out. Meanwhile, it's safe to say
210 that having this available: ::
211 
212         static inline bool folio_maybe_dma_pinned(struct folio *folio)
213 
214 ...is a prerequisite to solving the long-running gup+DMA problem.
215 
216 Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
217 ===================================================================
218 
219 Another way of thinking about these flags is as a progression of restrictions:
220 FOLL_GET is for struct page manipulation, without affecting the data that the
221 struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
222 short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
223 a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
224 restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
225 will be pinned longterm, and whose data will be accessed.
226 
227 Unit testing
228 ============
229 This file::
230 
231  tools/testing/selftests/mm/gup_test.c
232 
233 has the following new calls to exercise the new pin*() wrapper functions:
234 
235 * PIN_FAST_BENCHMARK (./gup_test -a)
236 * PIN_BASIC_TEST (./gup_test -b)
237 
238 You can monitor how many total dma-pinned pages have been acquired and released
239 since the system was booted, via two new /proc/vmstat entries: ::
240 
241     /proc/vmstat/nr_foll_pin_acquired
242     /proc/vmstat/nr_foll_pin_released
243 
244 Under normal conditions, these two values will be equal unless there are any
245 long-term [R]DMA pins in place, or during pin/unpin transitions.
246 
247 * nr_foll_pin_acquired: This is the number of logical pins that have been
248   acquired since the system was powered on. For huge pages, the head page is
249   pinned once for each page (head page and each tail page) within the huge page.
250   This follows the same sort of behavior that get_user_pages() uses for huge
251   pages: the head page is refcounted once for each tail or head page in the huge
252   page, when get_user_pages() is applied to a huge page.
253 
254 * nr_foll_pin_released: The number of logical pins that have been released since
255   the system was powered on. Note that pages are released (unpinned) on a
256   PAGE_SIZE granularity, even if the original pin was applied to a huge page.
257   Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
258   the accounting balances out, so that after doing this::
259 
260     pin_user_pages(huge_page);
261     for (each page in huge_page)
262         unpin_user_page(page);
263 
264 ...the following is expected::
265 
266     nr_foll_pin_released == nr_foll_pin_acquired
267 
268 (...unless it was already out of balance due to a long-term RDMA pin being in
269 place.)
270 
271 Other diagnostics
272 =================
273 
274 dump_page() has been enhanced slightly to handle these new counting
275 fields, and to better report on large folios in general.  Specifically,
276 for large folios, the exact pincount is reported.
277 
278 References
279 ==========
280 
281 * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
282 * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
283 * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
284 * `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
285 
286 John Hubbard, October, 2019

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php