~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/idmappings.rst

Version: ~ [ linux-6.11.5 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.58 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.114 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.169 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.228 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.284 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.322 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.9 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 .. SPDX-License-Identifier: GPL-2.0
  2 
  3 Idmappings
  4 ==========
  5 
  6 Most filesystem developers will have encountered idmappings. They are used when
  7 reading from or writing ownership to disk, reporting ownership to userspace, or
  8 for permission checking. This document is aimed at filesystem developers that
  9 want to know how idmappings work.
 10 
 11 Formal notes
 12 ------------
 13 
 14 An idmapping is essentially a translation of a range of ids into another or the
 15 same range of ids. The notational convention for idmappings that is widely used
 16 in userspace is::
 17 
 18  u:k:r
 19 
 20 ``u`` indicates the first element in the upper idmapset ``U`` and ``k``
 21 indicates the first element in the lower idmapset ``K``. The ``r`` parameter
 22 indicates the range of the idmapping, i.e. how many ids are mapped. From now
 23 on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
 24 we're talking about an id in the upper or lower idmapset.
 25 
 26 To see what this looks like in practice, let's take the following idmapping::
 27 
 28  u22:k10000:r3
 29 
 30 and write down the mappings it will generate::
 31 
 32  u22 -> k10000
 33  u23 -> k10001
 34  u24 -> k10002
 35 
 36 From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
 37 idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
 38 order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
 39 the set of all possible ids usable on a given system.
 40 
 41 Looking at this mathematically briefly will help us highlight some properties
 42 that make it easier to understand how we can translate between idmappings. For
 43 example, we know that the inverse idmapping is an order isomorphism as well::
 44 
 45  k10000 -> u22
 46  k10001 -> u23
 47  k10002 -> u24
 48 
 49 Given that we are dealing with order isomorphisms plus the fact that we're
 50 dealing with subsets we can embed idmappings into each other, i.e. we can
 51 sensibly translate between different idmappings. For example, assume we've been
 52 given the three idmappings::
 53 
 54  1. u0:k10000:r10000
 55  2. u0:k20000:r10000
 56  3. u0:k30000:r10000
 57 
 58 and id ``k11000`` which has been generated by the first idmapping by mapping
 59 ``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
 60 
 61 Because we're dealing with order isomorphic subsets it is meaningful to ask
 62 what id ``k11000`` corresponds to in the second or third idmapping. The
 63 straightforward algorithm to use is to apply the inverse of the first idmapping,
 64 mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
 65 either the second idmapping mapping or third idmapping mapping. The second
 66 idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
 67 ``u1000`` down to ``u31000``.
 68 
 69 If we were given the same task for the following three idmappings::
 70 
 71  1. u0:k10000:r10000
 72  2. u0:k20000:r200
 73  3. u0:k30000:r300
 74 
 75 we would fail to translate as the sets aren't order isomorphic over the full
 76 range of the first idmapping anymore (However they are order isomorphic over
 77 the full range of the second idmapping.). Neither the second or third idmapping
 78 contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
 79 an id mapped. We can simply say that ``u1000`` is unmapped in the second and
 80 third idmapping. The kernel will report unmapped ids as the overflowuid
 81 ``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
 82 
 83 The algorithm to calculate what a given id maps to is pretty simple. First, we
 84 need to verify that the range can contain our target id. We will skip this step
 85 for simplicity. After that if we want to know what ``id`` maps to we can do
 86 simple calculations:
 87 
 88 - If we want to map from left to right::
 89 
 90    u:k:r
 91    id - u + k = n
 92 
 93 - If we want to map from right to left::
 94 
 95    u:k:r
 96    id - k + u = n
 97 
 98 Instead of "left to right" we can also say "down" and instead of "right to
 99 left" we can also say "up". Obviously mapping down and up invert each other.
100 
101 To see whether the simple formulas above work, consider the following two
102 idmappings::
103 
104  1. u0:k20000:r10000
105  2. u500:k30000:r10000
106 
107 Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
108 want to know what id this was mapped from in the upper idmapset of the first
109 idmapping. So we're mapping up in the first idmapping::
110 
111  id     - k      + u  = n
112  k21000 - k20000 + u0 = u1000
113 
114 Now assume we are given the id ``u1100`` in the upper idmapset of the second
115 idmapping and we want to know what this id maps down to in the lower idmapset
116 of the second idmapping. This means we're mapping down in the second
117 idmapping::
118 
119  id    - u    + k      = n
120  u1100 - u500 + k30000 = k30600
121 
122 General notes
123 -------------
124 
125 In the context of the kernel an idmapping can be interpreted as mapping a range
126 of userspace ids into a range of kernel ids::
127 
128  userspace-id:kernel-id:range
129 
130 A userspace id is always an element in the upper idmapset of an idmapping of
131 type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
132 idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
133 "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
134 types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
135 
136 The kernel is mostly concerned with kernel ids. They are used when performing
137 permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
138 A userspace id on the other hand is an id that is reported to userspace by the
139 kernel, or is passed by userspace to the kernel, or a raw device id that is
140 written or read from disk.
141 
142 Note that we are only concerned with idmappings as the kernel stores them not
143 how userspace would specify them.
144 
145 For the rest of this document we will prefix all userspace ids with ``u`` and
146 all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
147 an idmapping will be written as ``u0:k10000:r10000``.
148 
149 For example, within this idmapping, the id ``u1000`` is an id in the upper
150 idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
151 ``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
152 starting with ``k10000``.
153 
154 A kernel id is always created by an idmapping. Such idmappings are associated
155 with user namespaces. Since we mainly care about how idmappings work we're not
156 going to be concerned with how idmappings are created nor how they are used
157 outside of the filesystem context. This is best left to an explanation of user
158 namespaces.
159 
160 The initial user namespace is special. It always has an idmapping of the
161 following form::
162 
163  u0:k0:r4294967295
164 
165 which is an identity idmapping over the full range of ids available on this
166 system.
167 
168 Other user namespaces usually have non-identity idmappings such as::
169 
170  u0:k10000:r10000
171 
172 When a process creates or wants to change ownership of a file, or when the
173 ownership of a file is read from disk by a filesystem, the userspace id is
174 immediately translated into a kernel id according to the idmapping associated
175 with the relevant user namespace.
176 
177 For instance, consider a file that is stored on disk by a filesystem as being
178 owned by ``u1000``:
179 
180 - If a filesystem were to be mounted in the initial user namespaces (as most
181   filesystems are) then the initial idmapping will be used. As we saw this is
182   simply the identity idmapping. This would mean id ``u1000`` read from disk
183   would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
184   would contain ``k1000``.
185 
186 - If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
187   then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
188   ``i_uid`` and ``i_gid`` would contain ``k11000``.
189 
190 Translation algorithms
191 ----------------------
192 
193 We've already seen briefly that it is possible to translate between different
194 idmappings. We'll now take a closer look how that works.
195 
196 Crossmapping
197 ~~~~~~~~~~~~
198 
199 This translation algorithm is used by the kernel in quite a few places. For
200 example, it is used when reporting back the ownership of a file to userspace
201 via the ``stat()`` system call family.
202 
203 If we've been given ``k11000`` from one idmapping we can map that id up in
204 another idmapping. In order for this to work both idmappings need to contain
205 the same kernel id in their kernel idmapsets. For example, consider the
206 following idmappings::
207 
208  1. u0:k10000:r10000
209  2. u20000:k10000:r10000
210 
211 and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
212 then translate ``k11000`` into a userspace id in the second idmapping using the
213 kernel idmapset of the second idmapping::
214 
215  /* Map the kernel id up into a userspace id in the second idmapping. */
216  from_kuid(u20000:k10000:r10000, k11000) = u21000
217 
218 Note, how we can get back to the kernel id in the first idmapping by inverting
219 the algorithm::
220 
221  /* Map the userspace id down into a kernel id in the second idmapping. */
222  make_kuid(u20000:k10000:r10000, u21000) = k11000
223 
224  /* Map the kernel id up into a userspace id in the first idmapping. */
225  from_kuid(u0:k10000:r10000, k11000) = u1000
226 
227 This algorithm allows us to answer the question what userspace id a given
228 kernel id corresponds to in a given idmapping. In order to be able to answer
229 this question both idmappings need to contain the same kernel id in their
230 respective kernel idmapsets.
231 
232 For example, when the kernel reads a raw userspace id from disk it maps it down
233 into a kernel id according to the idmapping associated with the filesystem.
234 Let's assume the filesystem was mounted with an idmapping of
235 ``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
236 means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
237 the inode's ``i_uid`` and ``i_gid`` field.
238 
239 When someone in userspace calls ``stat()`` or a related function to get
240 ownership information about the file the kernel can't simply map the id back up
241 according to the filesystem's idmapping as this would give the wrong owner if
242 the caller is using an idmapping.
243 
244 So the kernel will map the id back up in the idmapping of the caller. Let's
245 assume the caller has the somewhat unconventional idmapping
246 ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
247 Consequently the user would see that this file is owned by ``u4000``.
248 
249 Remapping
250 ~~~~~~~~~
251 
252 It is possible to translate a kernel id from one idmapping to another one via
253 the userspace idmapset of the two idmappings. This is equivalent to remapping
254 a kernel id.
255 
256 Let's look at an example. We are given the following two idmappings::
257 
258  1. u0:k10000:r10000
259  2. u0:k20000:r10000
260 
261 and we are given ``k11000`` in the first idmapping. In order to translate this
262 kernel id in the first idmapping into a kernel id in the second idmapping we
263 need to perform two steps:
264 
265 1. Map the kernel id up into a userspace id in the first idmapping::
266 
267     /* Map the kernel id up into a userspace id in the first idmapping. */
268     from_kuid(u0:k10000:r10000, k11000) = u1000
269 
270 2. Map the userspace id down into a kernel id in the second idmapping::
271 
272     /* Map the userspace id down into a kernel id in the second idmapping. */
273     make_kuid(u0:k20000:r10000, u1000) = k21000
274 
275 As you can see we used the userspace idmapset in both idmappings to translate
276 the kernel id in one idmapping to a kernel id in another idmapping.
277 
278 This allows us to answer the question what kernel id we would need to use to
279 get the same userspace id in another idmapping. In order to be able to answer
280 this question both idmappings need to contain the same userspace id in their
281 respective userspace idmapsets.
282 
283 Note, how we can easily get back to the kernel id in the first idmapping by
284 inverting the algorithm:
285 
286 1. Map the kernel id up into a userspace id in the second idmapping::
287 
288     /* Map the kernel id up into a userspace id in the second idmapping. */
289     from_kuid(u0:k20000:r10000, k21000) = u1000
290 
291 2. Map the userspace id down into a kernel id in the first idmapping::
292 
293     /* Map the userspace id down into a kernel id in the first idmapping. */
294     make_kuid(u0:k10000:r10000, u1000) = k11000
295 
296 Another way to look at this translation is to treat it as inverting one
297 idmapping and applying another idmapping if both idmappings have the relevant
298 userspace id mapped. This will come in handy when working with idmapped mounts.
299 
300 Invalid translations
301 ~~~~~~~~~~~~~~~~~~~~
302 
303 It is never valid to use an id in the kernel idmapset of one idmapping as the
304 id in the userspace idmapset of another or the same idmapping. While the kernel
305 idmapset always indicates an idmapset in the kernel id space the userspace
306 idmapset indicates a userspace id. So the following translations are forbidden::
307 
308  /* Map the userspace id down into a kernel id in the first idmapping. */
309  make_kuid(u0:k10000:r10000, u1000) = k11000
310 
311  /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
312  make_kuid(u10000:k20000:r10000, k110000) = k21000
313                                  ~~~~~~~
314 
315 and equally wrong::
316 
317  /* Map the kernel id up into a userspace id in the first idmapping. */
318  from_kuid(u0:k10000:r10000, k11000) = u1000
319 
320  /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
321  from_kuid(u20000:k0:r10000, u1000) = k21000
322                              ~~~~~
323 
324 Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
325 ``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
326 conflated. So the two examples above would cause a compilation failure.
327 
328 Idmappings when creating filesystem objects
329 -------------------------------------------
330 
331 The concepts of mapping an id down or mapping an id up are expressed in the two
332 kernel functions filesystem developers are rather familiar with and which we've
333 already used in this document::
334 
335  /* Map the userspace id down into a kernel id. */
336  make_kuid(idmapping, uid)
337 
338  /* Map the kernel id up into a userspace id. */
339  from_kuid(idmapping, kuid)
340 
341 We will take an abbreviated look into how idmappings figure into creating
342 filesystem objects. For simplicity we will only look at what happens when the
343 VFS has already completed path lookup right before it calls into the filesystem
344 itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
345 called. We will also assume that the directory we're creating filesystem
346 objects in is readable and writable for everyone.
347 
348 When creating a filesystem object the caller will look at the caller's
349 filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
350 but they are exclusively used when determining file ownership which is why they
351 are called "filesystem ids". They are usually identical to the uid and gid of
352 the caller but can differ. We will just assume they are always identical to not
353 get lost in too many details.
354 
355 When the caller enters the kernel two things happen:
356 
357 1. Map the caller's userspace ids down into kernel ids in the caller's
358    idmapping.
359    (To be precise, the kernel will simply look at the kernel ids stashed in the
360    credentials of the current task but for our education we'll pretend this
361    translation happens just in time.)
362 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
363    filesystem's idmapping.
364 
365 The second step is important as regular filesystem will ultimately need to map
366 the kernel id back up into a userspace id when writing to disk.
367 So with the second step the kernel guarantees that a valid userspace id can be
368 written to disk. If it can't the kernel will refuse the creation request to not
369 even remotely risk filesystem corruption.
370 
371 The astute reader will have realized that this is simply a variation of the
372 crossmapping algorithm we mentioned above in a previous section. First, the
373 kernel maps the caller's userspace id down into a kernel id according to the
374 caller's idmapping and then maps that kernel id up according to the
375 filesystem's idmapping.
376 
377 From the implementation point it's worth mentioning how idmappings are represented.
378 All idmappings are taken from the corresponding user namespace.
379 
380     - caller's idmapping (usually taken from ``current_user_ns()``)
381     - filesystem's idmapping (``sb->s_user_ns``)
382     - mount's idmapping (``mnt_idmap(vfsmnt)``)
383 
384 Let's see some examples with caller/filesystem idmapping but without mount
385 idmappings. This will exhibit some problems we can hit. After that we will
386 revisit/reconsider these examples, this time using mount idmappings, to see how
387 they can solve the problems we observed before.
388 
389 Example 1
390 ~~~~~~~~~
391 
392 ::
393 
394  caller id:            u1000
395  caller idmapping:     u0:k0:r4294967295
396  filesystem idmapping: u0:k0:r4294967295
397 
398 Both the caller and the filesystem use the identity idmapping:
399 
400 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
401 
402     make_kuid(u0:k0:r4294967295, u1000) = k1000
403 
404 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
405    filesystem's idmapping.
406 
407    For this second step the kernel will call the function
408    ``fsuidgid_has_mapping()`` which ultimately boils down to calling
409    ``from_kuid()``::
410 
411     from_kuid(u0:k0:r4294967295, k1000) = u1000
412 
413 In this example both idmappings are the same so there's nothing exciting going
414 on. Ultimately the userspace id that lands on disk will be ``u1000``.
415 
416 Example 2
417 ~~~~~~~~~
418 
419 ::
420 
421  caller id:            u1000
422  caller idmapping:     u0:k10000:r10000
423  filesystem idmapping: u0:k20000:r10000
424 
425 1. Map the caller's userspace ids down into kernel ids in the caller's
426    idmapping::
427 
428     make_kuid(u0:k10000:r10000, u1000) = k11000
429 
430 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
431    filesystem's idmapping::
432 
433     from_kuid(u0:k20000:r10000, k11000) = u-1
434 
435 It's immediately clear that while the caller's userspace id could be
436 successfully mapped down into kernel ids in the caller's idmapping the kernel
437 ids could not be mapped up according to the filesystem's idmapping. So the
438 kernel will deny this creation request.
439 
440 Note that while this example is less common, because most filesystem can't be
441 mounted with non-initial idmappings this is a general problem as we can see in
442 the next examples.
443 
444 Example 3
445 ~~~~~~~~~
446 
447 ::
448 
449  caller id:            u1000
450  caller idmapping:     u0:k10000:r10000
451  filesystem idmapping: u0:k0:r4294967295
452 
453 1. Map the caller's userspace ids down into kernel ids in the caller's
454    idmapping::
455 
456     make_kuid(u0:k10000:r10000, u1000) = k11000
457 
458 2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
459    filesystem's idmapping::
460 
461     from_kuid(u0:k0:r4294967295, k11000) = u11000
462 
463 We can see that the translation always succeeds. The userspace id that the
464 filesystem will ultimately put to disk will always be identical to the value of
465 the kernel id that was created in the caller's idmapping. This has mainly two
466 consequences.
467 
468 First, that we can't allow a caller to ultimately write to disk with another
469 userspace id. We could only do this if we were to mount the whole filesystem
470 with the caller's or another idmapping. But that solution is limited to a few
471 filesystems and not very flexible. But this is a use-case that is pretty
472 important in containerized workloads.
473 
474 Second, the caller will usually not be able to create any files or access
475 directories that have stricter permissions because none of the filesystem's
476 kernel ids map up into valid userspace ids in the caller's idmapping
477 
478 1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
479 
480     make_kuid(u0:k0:r4294967295, u1000) = k1000
481 
482 2. Map kernel ids up to userspace ids in the caller's idmapping::
483 
484     from_kuid(u0:k10000:r10000, k1000) = u-1
485 
486 Example 4
487 ~~~~~~~~~
488 
489 ::
490 
491  file id:              u1000
492  caller idmapping:     u0:k10000:r10000
493  filesystem idmapping: u0:k0:r4294967295
494 
495 In order to report ownership to userspace the kernel uses the crossmapping
496 algorithm introduced in a previous section:
497 
498 1. Map the userspace id on disk down into a kernel id in the filesystem's
499    idmapping::
500 
501     make_kuid(u0:k0:r4294967295, u1000) = k1000
502 
503 2. Map the kernel id up into a userspace id in the caller's idmapping::
504 
505     from_kuid(u0:k10000:r10000, k1000) = u-1
506 
507 The crossmapping algorithm fails in this case because the kernel id in the
508 filesystem idmapping cannot be mapped up to a userspace id in the caller's
509 idmapping. Thus, the kernel will report the ownership of this file as the
510 overflowid.
511 
512 Example 5
513 ~~~~~~~~~
514 
515 ::
516 
517  file id:              u1000
518  caller idmapping:     u0:k10000:r10000
519  filesystem idmapping: u0:k20000:r10000
520 
521 In order to report ownership to userspace the kernel uses the crossmapping
522 algorithm introduced in a previous section:
523 
524 1. Map the userspace id on disk down into a kernel id in the filesystem's
525    idmapping::
526 
527     make_kuid(u0:k20000:r10000, u1000) = k21000
528 
529 2. Map the kernel id up into a userspace id in the caller's idmapping::
530 
531     from_kuid(u0:k10000:r10000, k21000) = u-1
532 
533 Again, the crossmapping algorithm fails in this case because the kernel id in
534 the filesystem idmapping cannot be mapped to a userspace id in the caller's
535 idmapping. Thus, the kernel will report the ownership of this file as the
536 overflowid.
537 
538 Note how in the last two examples things would be simple if the caller would be
539 using the initial idmapping. For a filesystem mounted with the initial
540 idmapping it would be trivial. So we only consider a filesystem with an
541 idmapping of ``u0:k20000:r10000``:
542 
543 1. Map the userspace id on disk down into a kernel id in the filesystem's
544    idmapping::
545 
546     make_kuid(u0:k20000:r10000, u1000) = k21000
547 
548 2. Map the kernel id up into a userspace id in the caller's idmapping::
549 
550     from_kuid(u0:k0:r4294967295, k21000) = u21000
551 
552 Idmappings on idmapped mounts
553 -----------------------------
554 
555 The examples we've seen in the previous section where the caller's idmapping
556 and the filesystem's idmapping are incompatible causes various issues for
557 workloads. For a more complex but common example, consider two containers
558 started on the host. To completely prevent the two containers from affecting
559 each other, an administrator may often use different non-overlapping idmappings
560 for the two containers::
561 
562  container1 idmapping:  u0:k10000:r10000
563  container2 idmapping:  u0:k20000:r10000
564  filesystem idmapping:  u0:k30000:r10000
565 
566 An administrator wanting to provide easy read-write access to the following set
567 of files::
568 
569  dir id:       u0
570  dir/file1 id: u1000
571  dir/file2 id: u2000
572 
573 to both containers currently can't.
574 
575 Of course the administrator has the option to recursively change ownership via
576 ``chown()``. For example, they could change ownership so that ``dir`` and all
577 files below it can be crossmapped from the filesystem's into the container's
578 idmapping. Let's assume they change ownership so it is compatible with the
579 first container's idmapping::
580 
581  dir id:       u10000
582  dir/file1 id: u11000
583  dir/file2 id: u12000
584 
585 This would still leave ``dir`` rather useless to the second container. In fact,
586 ``dir`` and all files below it would continue to appear owned by the overflowid
587 for the second container.
588 
589 Or consider another increasingly popular example. Some service managers such as
590 systemd implement a concept called "portable home directories". A user may want
591 to use their home directories on different machines where they are assigned
592 different login userspace ids. Most users will have ``u1000`` as the login id
593 on their machine at home and all files in their home directory will usually be
594 owned by ``u1000``. At uni or at work they may have another login id such as
595 ``u1125``. This makes it rather difficult to interact with their home directory
596 on their work machine.
597 
598 In both cases changing ownership recursively has grave implications. The most
599 obvious one is that ownership is changed globally and permanently. In the home
600 directory case this change in ownership would even need to happen every time the
601 user switches from their home to their work machine. For really large sets of
602 files this becomes increasingly costly.
603 
604 If the user is lucky, they are dealing with a filesystem that is mountable
605 inside user namespaces. But this would also change ownership globally and the
606 change in ownership is tied to the lifetime of the filesystem mount, i.e. the
607 superblock. The only way to change ownership is to completely unmount the
608 filesystem and mount it again in another user namespace. This is usually
609 impossible because it would mean that all users currently accessing the
610 filesystem can't anymore. And it means that ``dir`` still can't be shared
611 between two containers with different idmappings.
612 But usually the user doesn't even have this option since most filesystems
613 aren't mountable inside containers. And not having them mountable might be
614 desirable as it doesn't require the filesystem to deal with malicious
615 filesystem images.
616 
617 But the usecases mentioned above and more can be handled by idmapped mounts.
618 They allow to expose the same set of dentries with different ownership at
619 different mounts. This is achieved by marking the mounts with a user namespace
620 through the ``mount_setattr()`` system call. The idmapping associated with it
621 is then used to translate from the caller's idmapping to the filesystem's
622 idmapping and vica versa using the remapping algorithm we introduced above.
623 
624 Idmapped mounts make it possible to change ownership in a temporary and
625 localized way. The ownership changes are restricted to a specific mount and the
626 ownership changes are tied to the lifetime of the mount. All other users and
627 locations where the filesystem is exposed are unaffected.
628 
629 Filesystems that support idmapped mounts don't have any real reason to support
630 being mountable inside user namespaces. A filesystem could be exposed
631 completely under an idmapped mount to get the same effect. This has the
632 advantage that filesystems can leave the creation of the superblock to
633 privileged users in the initial user namespace.
634 
635 However, it is perfectly possible to combine idmapped mounts with filesystems
636 mountable inside user namespaces. We will touch on this further below.
637 
638 Filesystem types vs idmapped mount types
639 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
640 
641 With the introduction of idmapped mounts we need to distinguish between
642 filesystem ownership and mount ownership of a VFS object such as an inode. The
643 owner of a inode might be different when looked at from a filesystem
644 perspective than when looked at from an idmapped mount. Such fundamental
645 conceptual distinctions should almost always be clearly expressed in the code.
646 So, to distinguish idmapped mount ownership from filesystem ownership separate
647 types have been introduced.
648 
649 If a uid or gid has been generated using the filesystem or caller's idmapping
650 then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
651 has been generated using a mount idmapping then we will be using the dedicated
652 ``vfsuid_t`` and ``vfsgid_t`` types.
653 
654 All VFS helpers that generate or take uids and gids as arguments use the
655 ``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
656 to catch errors that originate from conflating filesystem and VFS uids and gids.
657 
658 The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
659 and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
660 from and to ``uid_t`` and ``gid_t`` types::
661 
662  uid_t <--> kuid_t <--> vfsuid_t
663  gid_t <--> kgid_t <--> vfsgid_t
664 
665 Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
666 e.g., during ``stat()``, or store ownership information in a shared VFS object
667 based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
668 use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
669 
670 To illustrate why this helper currently exists, consider what happens when we
671 change ownership of an inode from an idmapped mount. After we generated
672 a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
673 this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
674 Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
675 or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
676 ``vfsgid_into_kgid()``.
677 
678 Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
679 ``struct posix_acl``, stores ownership information a filesystem or "global"
680 ``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
681 and ``vfsgid_t`` is specific to an idmapped mount.
682 
683 We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
684 on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
685 on filesystem idmappings. To prevent abusing filesystem idmappings to generate
686 ``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
687 or ``kgid_t`` types filesystem idmappings and mount idmappings are different
688 types as well.
689 
690 All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
691 a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
692 a filesystem or caller idmapping will cause a compilation error.
693 
694 Similar to how we prefix all userspace ids in this document with ``u`` and all
695 kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
696 idmapping will be written as: ``u0:v10000:r10000``.
697 
698 Remapping helpers
699 ~~~~~~~~~~~~~~~~~
700 
701 Idmapping functions were added that translate between idmappings. They make use
702 of the remapping algorithm we've introduced earlier. We're going to look at:
703 
704 - ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
705 
706   The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
707   VFS ids in the mount's idmapping::
708 
709    /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
710    from_kuid(filesystem, kid) = uid
711 
712    /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
713    make_kuid(mount, uid) = kuid
714 
715 - ``mapped_fsuid()`` and ``mapped_fsgid()``
716 
717   The ``mapped_fs*id()`` functions translate the caller's kernel ids into
718   kernel ids in the filesystem's idmapping. This translation is achieved by
719   remapping the caller's VFS ids using the mount's idmapping::
720 
721    /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
722    from_kuid(mount, kid) = uid
723 
724    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
725    make_kuid(filesystem, uid) = kuid
726 
727 - ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
728 
729    Whenever
730 
731 Note that these two functions invert each other. Consider the following
732 idmappings::
733 
734  caller idmapping:     u0:k10000:r10000
735  filesystem idmapping: u0:k20000:r10000
736  mount idmapping:      u0:v10000:r10000
737 
738 Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
739 to ``k21000`` according to its idmapping. This is what is stored in the
740 inode's ``i_uid`` and ``i_gid`` fields.
741 
742 When the caller queries the ownership of this file via ``stat()`` the kernel
743 would usually simply use the crossmapping algorithm and map the filesystem's
744 kernel id up to a userspace id in the caller's idmapping.
745 
746 But when the caller is accessing the file on an idmapped mount the kernel will
747 first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
748 id into a VFS id in the mount's idmapping::
749 
750  i_uid_into_vfsuid(k21000):
751    /* Map the filesystem's kernel id up into a userspace id. */
752    from_kuid(u0:k20000:r10000, k21000) = u1000
753 
754    /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
755    make_kuid(u0:v10000:r10000, u1000) = v11000
756 
757 Finally, when the kernel reports the owner to the caller it will turn the
758 VFS id in the mount's idmapping into a userspace id in the caller's
759 idmapping::
760 
761   k11000 = vfsuid_into_kuid(v11000)
762   from_kuid(u0:k10000:r10000, k11000) = u1000
763 
764 We can test whether this algorithm really works by verifying what happens when
765 we create a new file. Let's say the user is creating a file with ``u1000``.
766 
767 The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
768 kernel would now apply the crossmapping, verifying that ``k11000`` can be
769 mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
770 be mapped up in the filesystem's idmapping directly this creation request
771 fails.
772 
773 But when the caller is accessing the file on an idmapped mount the kernel will
774 first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
775 a VFS id according to the mount's idmapping::
776 
777  mapped_fsuid(k11000):
778     /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
779     from_kuid(u0:k10000:r10000, k11000) = u1000
780 
781     /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
782     make_kuid(u0:v20000:r10000, u1000) = v21000
783 
784 When finally writing to disk the kernel will then map ``v21000`` up into a
785 userspace id in the filesystem's idmapping::
786 
787    k21000 = vfsuid_into_kuid(v21000)
788    from_kuid(u0:k20000:r10000, k21000) = u1000
789 
790 As we can see, we end up with an invertible and therefore information
791 preserving algorithm. A file created from ``u1000`` on an idmapped mount will
792 also be reported as being owned by ``u1000`` and vica versa.
793 
794 Let's now briefly reconsider the failing examples from earlier in the context
795 of idmapped mounts.
796 
797 Example 2 reconsidered
798 ~~~~~~~~~~~~~~~~~~~~~~
799 
800 ::
801 
802  caller id:            u1000
803  caller idmapping:     u0:k10000:r10000
804  filesystem idmapping: u0:k20000:r10000
805  mount idmapping:      u0:v10000:r10000
806 
807 When the caller is using a non-initial idmapping the common case is to attach
808 the same idmapping to the mount. We now perform three steps:
809 
810 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
811 
812     make_kuid(u0:k10000:r10000, u1000) = k11000
813 
814 2. Translate the caller's VFS id into a kernel id in the filesystem's
815    idmapping::
816 
817     mapped_fsuid(v11000):
818       /* Map the VFS id up into a userspace id in the mount's idmapping. */
819       from_kuid(u0:v10000:r10000, v11000) = u1000
820 
821       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
822       make_kuid(u0:k20000:r10000, u1000) = k21000
823 
824 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
825    filesystem's idmapping::
826 
827     from_kuid(u0:k20000:r10000, k21000) = u1000
828 
829 So the ownership that lands on disk will be ``u1000``.
830 
831 Example 3 reconsidered
832 ~~~~~~~~~~~~~~~~~~~~~~
833 
834 ::
835 
836  caller id:            u1000
837  caller idmapping:     u0:k10000:r10000
838  filesystem idmapping: u0:k0:r4294967295
839  mount idmapping:      u0:v10000:r10000
840 
841 The same translation algorithm works with the third example.
842 
843 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
844 
845     make_kuid(u0:k10000:r10000, u1000) = k11000
846 
847 2. Translate the caller's VFS id into a kernel id in the filesystem's
848    idmapping::
849 
850     mapped_fsuid(v11000):
851        /* Map the VFS id up into a userspace id in the mount's idmapping. */
852        from_kuid(u0:v10000:r10000, v11000) = u1000
853 
854        /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
855        make_kuid(u0:k0:r4294967295, u1000) = k1000
856 
857 2. Verify that the caller's kernel ids can be mapped to userspace ids in the
858    filesystem's idmapping::
859 
860     from_kuid(u0:k0:r4294967295, k21000) = u1000
861 
862 So the ownership that lands on disk will be ``u1000``.
863 
864 Example 4 reconsidered
865 ~~~~~~~~~~~~~~~~~~~~~~
866 
867 ::
868 
869  file id:              u1000
870  caller idmapping:     u0:k10000:r10000
871  filesystem idmapping: u0:k0:r4294967295
872  mount idmapping:      u0:v10000:r10000
873 
874 In order to report ownership to userspace the kernel now does three steps using
875 the translation algorithm we introduced earlier:
876 
877 1. Map the userspace id on disk down into a kernel id in the filesystem's
878    idmapping::
879 
880     make_kuid(u0:k0:r4294967295, u1000) = k1000
881 
882 2. Translate the kernel id into a VFS id in the mount's idmapping::
883 
884     i_uid_into_vfsuid(k1000):
885       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
886       from_kuid(u0:k0:r4294967295, k1000) = u1000
887 
888       /* Map the userspace id down into a VFS id in the mounts's idmapping. */
889       make_kuid(u0:v10000:r10000, u1000) = v11000
890 
891 3. Map the VFS id up into a userspace id in the caller's idmapping::
892 
893     k11000 = vfsuid_into_kuid(v11000)
894     from_kuid(u0:k10000:r10000, k11000) = u1000
895 
896 Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
897 idmapping. With the idmapped mount in place it now can be crossmapped into the
898 filesystem's idmapping via the mount's idmapping. The file will now be created
899 with ``u1000`` according to the mount's idmapping.
900 
901 Example 5 reconsidered
902 ~~~~~~~~~~~~~~~~~~~~~~
903 
904 ::
905 
906  file id:              u1000
907  caller idmapping:     u0:k10000:r10000
908  filesystem idmapping: u0:k20000:r10000
909  mount idmapping:      u0:v10000:r10000
910 
911 Again, in order to report ownership to userspace the kernel now does three
912 steps using the translation algorithm we introduced earlier:
913 
914 1. Map the userspace id on disk down into a kernel id in the filesystem's
915    idmapping::
916 
917     make_kuid(u0:k20000:r10000, u1000) = k21000
918 
919 2. Translate the kernel id into a VFS id in the mount's idmapping::
920 
921     i_uid_into_vfsuid(k21000):
922       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
923       from_kuid(u0:k20000:r10000, k21000) = u1000
924 
925       /* Map the userspace id down into a VFS id in the mounts's idmapping. */
926       make_kuid(u0:v10000:r10000, u1000) = v11000
927 
928 3. Map the VFS id up into a userspace id in the caller's idmapping::
929 
930     k11000 = vfsuid_into_kuid(v11000)
931     from_kuid(u0:k10000:r10000, k11000) = u1000
932 
933 Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
934 idmapping. With the idmapped mount in place it now can be crossmapped into the
935 filesystem's idmapping via the mount's idmapping. The file is now owned by
936 ``u1000`` according to the mount's idmapping.
937 
938 Changing ownership on a home directory
939 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
940 
941 We've seen above how idmapped mounts can be used to translate between
942 idmappings when either the caller, the filesystem or both uses a non-initial
943 idmapping. A wide range of usecases exist when the caller is using
944 a non-initial idmapping. This mostly happens in the context of containerized
945 workloads. The consequence is as we have seen that for both, filesystem's
946 mounted with the initial idmapping and filesystems mounted with non-initial
947 idmappings, access to the filesystem isn't working because the kernel ids can't
948 be crossmapped between the caller's and the filesystem's idmapping.
949 
950 As we've seen above idmapped mounts provide a solution to this by remapping the
951 caller's or filesystem's idmapping according to the mount's idmapping.
952 
953 Aside from containerized workloads, idmapped mounts have the advantage that
954 they also work when both the caller and the filesystem use the initial
955 idmapping which means users on the host can change the ownership of directories
956 and files on a per-mount basis.
957 
958 Consider our previous example where a user has their home directory on portable
959 storage. At home they have id ``u1000`` and all files in their home directory
960 are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
961 
962 Taking their home directory with them becomes problematic. They can't easily
963 access their files, they might not be able to write to disk without applying
964 lax permissions or ACLs and even if they can, they will end up with an annoying
965 mix of files and directories owned by ``u1000`` and ``u1125``.
966 
967 Idmapped mounts allow to solve this problem. A user can create an idmapped
968 mount for their home directory on their work computer or their computer at home
969 depending on what ownership they would prefer to end up on the portable storage
970 itself.
971 
972 Let's assume they want all files on disk to belong to ``u1000``. When the user
973 plugs in their portable storage at their work station they can setup a job that
974 creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
975 when they create a file the kernel performs the following steps we already know
976 from above:::
977 
978  caller id:            u1125
979  caller idmapping:     u0:k0:r4294967295
980  filesystem idmapping: u0:k0:r4294967295
981  mount idmapping:      u1000:v1125:r1
982 
983 1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
984 
985     make_kuid(u0:k0:r4294967295, u1125) = k1125
986 
987 2. Translate the caller's VFS id into a kernel id in the filesystem's
988    idmapping::
989 
990     mapped_fsuid(v1125):
991       /* Map the VFS id up into a userspace id in the mount's idmapping. */
992       from_kuid(u1000:v1125:r1, v1125) = u1000
993 
994       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
995       make_kuid(u0:k0:r4294967295, u1000) = k1000
996 
997 2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
998    filesystem's idmapping::
999 
1000     from_kuid(u0:k0:r4294967295, k1000) = u1000
1001 
1002 So ultimately the file will be created with ``u1000`` on disk.
1003 
1004 Now let's briefly look at what ownership the caller with id ``u1125`` will see
1005 on their work computer:
1006 
1007 ::
1008 
1009  file id:              u1000
1010  caller idmapping:     u0:k0:r4294967295
1011  filesystem idmapping: u0:k0:r4294967295
1012  mount idmapping:      u1000:v1125:r1
1013 
1014 1. Map the userspace id on disk down into a kernel id in the filesystem's
1015    idmapping::
1016 
1017     make_kuid(u0:k0:r4294967295, u1000) = k1000
1018 
1019 2. Translate the kernel id into a VFS id in the mount's idmapping::
1020 
1021     i_uid_into_vfsuid(k1000):
1022       /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
1023       from_kuid(u0:k0:r4294967295, k1000) = u1000
1024 
1025       /* Map the userspace id down into a VFS id in the mounts's idmapping. */
1026       make_kuid(u1000:v1125:r1, u1000) = v1125
1027 
1028 3. Map the VFS id up into a userspace id in the caller's idmapping::
1029 
1030     k1125 = vfsuid_into_kuid(v1125)
1031     from_kuid(u0:k0:r4294967295, k1125) = u1125
1032 
1033 So ultimately the caller will be reported that the file belongs to ``u1125``
1034 which is the caller's userspace id on their workstation in our example.
1035 
1036 The raw userspace id that is put on disk is ``u1000`` so when the user takes
1037 their home directory back to their home computer where they are assigned
1038 ``u1000`` using the initial idmapping and mount the filesystem with the initial
1039 idmapping they will see all those files owned by ``u1000``.

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php