~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/userspace-api/unshare.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 unshare system call
  2 ===================
  3 
  4 This document describes the new system call, unshare(). The document
  5 provides an overview of the feature, why it is needed, how it can
  6 be used, its interface specification, design, implementation and
  7 how it can be tested.
  8 
  9 Change Log
 10 ----------
 11 version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
 12 
 13 Contents
 14 --------
 15         1) Overview
 16         2) Benefits
 17         3) Cost
 18         4) Requirements
 19         5) Functional Specification
 20         6) High Level Design
 21         7) Low Level Design
 22         8) Test Specification
 23         9) Future Work
 24 
 25 1) Overview
 26 -----------
 27 
 28 Most legacy operating system kernels support an abstraction of threads
 29 as multiple execution contexts within a process. These kernels provide
 30 special resources and mechanisms to maintain these "threads". The Linux
 31 kernel, in a clever and simple manner, does not make distinction
 32 between processes and "threads". The kernel allows processes to share
 33 resources and thus they can achieve legacy "threads" behavior without
 34 requiring additional data structures and mechanisms in the kernel. The
 35 power of implementing threads in this manner comes not only from
 36 its simplicity but also from allowing application programmers to work
 37 outside the confinement of all-or-nothing shared resources of legacy
 38 threads. On Linux, at the time of thread creation using the clone system
 39 call, applications can selectively choose which resources to share
 40 between threads.
 41 
 42 unshare() system call adds a primitive to the Linux thread model that
 43 allows threads to selectively 'unshare' any resources that were being
 44 shared at the time of their creation. unshare() was conceptualized by
 45 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
 46 of the discussion on POSIX threads on Linux.  unshare() augments the
 47 usefulness of Linux threads for applications that would like to control
 48 shared resources without creating a new process. unshare() is a natural
 49 addition to the set of available primitives on Linux that implement
 50 the concept of process/thread as a virtual machine.
 51 
 52 2) Benefits
 53 -----------
 54 
 55 unshare() would be useful to large application frameworks such as PAM
 56 where creating a new process to control sharing/unsharing of process
 57 resources is not possible. Since namespaces are shared by default
 58 when creating a new process using fork or clone, unshare() can benefit
 59 even non-threaded applications if they have a need to disassociate
 60 from default shared namespace. The following lists two use-cases
 61 where unshare() can be used.
 62 
 63 2.1 Per-security context namespaces
 64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 65 
 66 unshare() can be used to implement polyinstantiated directories using
 67 the kernel's per-process namespace mechanism. Polyinstantiated directories,
 68 such as per-user and/or per-security context instance of /tmp, /var/tmp or
 69 per-security context instance of a user's home directory, isolate user
 70 processes when working with these directories. Using unshare(), a PAM
 71 module can easily setup a private namespace for a user at login.
 72 Polyinstantiated directories are required for Common Criteria certification
 73 with Labeled System Protection Profile, however, with the availability
 74 of shared-tree feature in the Linux kernel, even regular Linux systems
 75 can benefit from setting up private namespaces at login and
 76 polyinstantiating /tmp, /var/tmp and other directories deemed
 77 appropriate by system administrators.
 78 
 79 2.2 unsharing of virtual memory and/or open files
 80 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 81 
 82 Consider a client/server application where the server is processing
 83 client requests by creating processes that share resources such as
 84 virtual memory and open files. Without unshare(), the server has to
 85 decide what needs to be shared at the time of creating the process
 86 which services the request. unshare() allows the server an ability to
 87 disassociate parts of the context during the servicing of the
 88 request. For large and complex middleware application frameworks, this
 89 ability to unshare() after the process was created can be very
 90 useful.
 91 
 92 3) Cost
 93 -------
 94 
 95 In order to not duplicate code and to handle the fact that unshare()
 96 works on an active task (as opposed to clone/fork working on a newly
 97 allocated inactive task) unshare() had to make minor reorganizational
 98 changes to copy_* functions utilized by clone/fork system call.
 99 There is a cost associated with altering existing, well tested and
100 stable code to implement a new feature that may not get exercised
101 extensively in the beginning. However, with proper design and code
102 review of the changes and creation of an unshare() test for the LTP
103 the benefits of this new feature can exceed its cost.
104 
105 4) Requirements
106 ---------------
107 
108 unshare() reverses sharing that was done using clone(2) system call,
109 so unshare() should have a similar interface as clone(2). That is,
110 since flags in clone(int flags, void \*stack) specifies what should
111 be shared, similar flags in unshare(int flags) should specify
112 what should be unshared. Unfortunately, this may appear to invert
113 the meaning of the flags from the way they are used in clone(2).
114 However, there was no easy solution that was less confusing and that
115 allowed incremental context unsharing in future without an ABI change.
116 
117 unshare() interface should accommodate possible future addition of
118 new context flags without requiring a rebuild of old applications.
119 If and when new context flags are added, unshare() design should allow
120 incremental unsharing of those resources on an as needed basis.
121 
122 5) Functional Specification
123 ---------------------------
124 
125 NAME
126         unshare - disassociate parts of the process execution context
127 
128 SYNOPSIS
129         #include <sched.h>
130 
131         int unshare(int flags);
132 
133 DESCRIPTION
134         unshare() allows a process to disassociate parts of its execution
135         context that are currently being shared with other processes. Part
136         of execution context, such as the namespace, is shared by default
137         when a new process is created using fork(2), while other parts,
138         such as the virtual memory, open file descriptors, etc, may be
139         shared by explicit request to share them when creating a process
140         using clone(2).
141 
142         The main use of unshare() is to allow a process to control its
143         shared execution context without creating a new process.
144 
145         The flags argument specifies one or bitwise-or'ed of several of
146         the following constants.
147 
148         CLONE_FS
149                 If CLONE_FS is set, file system information of the caller
150                 is disassociated from the shared file system information.
151 
152         CLONE_FILES
153                 If CLONE_FILES is set, the file descriptor table of the
154                 caller is disassociated from the shared file descriptor
155                 table.
156 
157         CLONE_NEWNS
158                 If CLONE_NEWNS is set, the namespace of the caller is
159                 disassociated from the shared namespace.
160 
161         CLONE_VM
162                 If CLONE_VM is set, the virtual memory of the caller is
163                 disassociated from the shared virtual memory.
164 
165 RETURN VALUE
166         On success, zero returned. On failure, -1 is returned and errno is
167 
168 ERRORS
169         EPERM   CLONE_NEWNS was specified by a non-root process (process
170                 without CAP_SYS_ADMIN).
171 
172         ENOMEM  Cannot allocate sufficient memory to copy parts of caller's
173                 context that need to be unshared.
174 
175         EINVAL  Invalid flag was specified as an argument.
176 
177 CONFORMING TO
178         The unshare() call is Linux-specific and  should  not be used
179         in programs intended to be portable.
180 
181 SEE ALSO
182         clone(2), fork(2)
183 
184 6) High Level Design
185 --------------------
186 
187 Depending on the flags argument, the unshare() system call allocates
188 appropriate process context structures, populates it with values from
189 the current shared version, associates newly duplicated structures
190 with the current task structure and releases corresponding shared
191 versions. Helper functions of clone (copy_*) could not be used
192 directly by unshare() because of the following two reasons.
193 
194   1) clone operates on a newly allocated not-yet-active task
195      structure, where as unshare() operates on the current active
196      task. Therefore unshare() has to take appropriate task_lock()
197      before associating newly duplicated context structures
198 
199   2) unshare() has to allocate and duplicate all context structures
200      that are being unshared, before associating them with the
201      current task and releasing older shared structures. Failure
202      do so will create race conditions and/or oops when trying
203      to backout due to an error. Consider the case of unsharing
204      both virtual memory and namespace. After successfully unsharing
205      vm, if the system call encounters an error while allocating
206      new namespace structure, the error return code will have to
207      reverse the unsharing of vm. As part of the reversal the
208      system call will have to go back to older, shared, vm
209      structure, which may not exist anymore.
210 
211 Therefore code from copy_* functions that allocated and duplicated
212 current context structure was moved into new dup_* functions. Now,
213 copy_* functions call dup_* functions to allocate and duplicate
214 appropriate context structures and then associate them with the
215 task structure that is being constructed. unshare() system call on
216 the other hand performs the following:
217 
218   1) Check flags to force missing, but implied, flags
219 
220   2) For each context structure, call the corresponding unshare()
221      helper function to allocate and duplicate a new context
222      structure, if the appropriate bit is set in the flags argument.
223 
224   3) If there is no error in allocation and duplication and there
225      are new context structures then lock the current task structure,
226      associate new context structures with the current task structure,
227      and release the lock on the current task structure.
228 
229   4) Appropriately release older, shared, context structures.
230 
231 7) Low Level Design
232 -------------------
233 
234 Implementation of unshare() can be grouped in the following 4 different
235 items:
236 
237   a) Reorganization of existing copy_* functions
238 
239   b) unshare() system call service function
240 
241   c) unshare() helper functions for each different process context
242 
243   d) Registration of system call number for different architectures
244 
245 7.1) Reorganization of copy_* functions
246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
247 
248 Each copy function such as copy_mm, copy_namespace, copy_files,
249 etc, had roughly two components. The first component allocated
250 and duplicated the appropriate structure and the second component
251 linked it to the task structure passed in as an argument to the copy
252 function. The first component was split into its own function.
253 These dup_* functions allocated and duplicated the appropriate
254 context structure. The reorganized copy_* functions invoked
255 their corresponding dup_* functions and then linked the newly
256 duplicated structures to the task structure with which the
257 copy function was called.
258 
259 7.2) unshare() system call service function
260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
261 
262        * Check flags
263          Force implied flags. If CLONE_THREAD is set force CLONE_VM.
264          If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
265          set and signals are also being shared, force CLONE_THREAD. If
266          CLONE_NEWNS is set, force CLONE_FS.
267 
268        * For each context flag, invoke the corresponding unshare_*
269          helper routine with flags passed into the system call and a
270          reference to pointer pointing the new unshared structure
271 
272        * If any new structures are created by unshare_* helper
273          functions, take the task_lock() on the current task,
274          modify appropriate context pointers, and release the
275          task lock.
276 
277        * For all newly unshared structures, release the corresponding
278          older, shared, structures.
279 
280 7.3) unshare_* helper functions
281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
282 
283 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
284 and CLONE_THREAD, return -EINVAL since they are not implemented yet.
285 For others, check the flag value to see if the unsharing is
286 required for that structure. If it is, invoke the corresponding
287 dup_* function to allocate and duplicate the structure and return
288 a pointer to it.
289 
290 7.4) Finally
291 ~~~~~~~~~~~~
292 
293 Appropriately modify architecture specific code to register the
294 new system call.
295 
296 8) Test Specification
297 ---------------------
298 
299 The test for unshare() should test the following:
300 
301   1) Valid flags: Test to check that clone flags for signal and
302      signal handlers, for which unsharing is not implemented
303      yet, return -EINVAL.
304 
305   2) Missing/implied flags: Test to make sure that if unsharing
306      namespace without specifying unsharing of filesystem, correctly
307      unshares both namespace and filesystem information.
308 
309   3) For each of the four (namespace, filesystem, files and vm)
310      supported unsharing, verify that the system call correctly
311      unshares the appropriate structure. Verify that unsharing
312      them individually as well as in combination with each
313      other works as expected.
314 
315   4) Concurrent execution: Use shared memory segments and futex on
316      an address in the shm segment to synchronize execution of
317      about 10 threads. Have a couple of threads execute execve,
318      a couple _exit and the rest unshare with different combination
319      of flags. Verify that unsharing is performed as expected and
320      that there are no oops or hangs.
321 
322 9) Future Work
323 --------------
324 
325 The current implementation of unshare() does not allow unsharing of
326 signals and signal handlers. Signals are complex to begin with and
327 to unshare signals and/or signal handlers of a currently running
328 process is even more complex. If in the future there is a specific
329 need to allow unsharing of signals and/or signal handlers, it can
330 be incrementally added to unshare() without affecting legacy
331 applications using unshare().
332 

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php