1 unshare system call 1 unshare system call 2 =================== 2 =================== 3 3 4 This document describes the new system call, u 4 This document describes the new system call, unshare(). The document 5 provides an overview of the feature, why it is 5 provides an overview of the feature, why it is needed, how it can 6 be used, its interface specification, design, 6 be used, its interface specification, design, implementation and 7 how it can be tested. 7 how it can be tested. 8 8 9 Change Log 9 Change Log 10 ---------- 10 ---------- 11 version 0.1 Initial document, Janak Desai (ja 11 version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 12 12 13 Contents 13 Contents 14 -------- 14 -------- 15 1) Overview 15 1) Overview 16 2) Benefits 16 2) Benefits 17 3) Cost 17 3) Cost 18 4) Requirements 18 4) Requirements 19 5) Functional Specification 19 5) Functional Specification 20 6) High Level Design 20 6) High Level Design 21 7) Low Level Design 21 7) Low Level Design 22 8) Test Specification 22 8) Test Specification 23 9) Future Work 23 9) Future Work 24 24 25 1) Overview 25 1) Overview 26 ----------- 26 ----------- 27 27 28 Most legacy operating system kernels support a 28 Most legacy operating system kernels support an abstraction of threads 29 as multiple execution contexts within a proces 29 as multiple execution contexts within a process. These kernels provide 30 special resources and mechanisms to maintain t 30 special resources and mechanisms to maintain these "threads". The Linux 31 kernel, in a clever and simple manner, does no 31 kernel, in a clever and simple manner, does not make distinction 32 between processes and "threads". The kernel al 32 between processes and "threads". The kernel allows processes to share 33 resources and thus they can achieve legacy "th 33 resources and thus they can achieve legacy "threads" behavior without 34 requiring additional data structures and mecha 34 requiring additional data structures and mechanisms in the kernel. The 35 power of implementing threads in this manner c 35 power of implementing threads in this manner comes not only from 36 its simplicity but also from allowing applicat 36 its simplicity but also from allowing application programmers to work 37 outside the confinement of all-or-nothing shar 37 outside the confinement of all-or-nothing shared resources of legacy 38 threads. On Linux, at the time of thread creat 38 threads. On Linux, at the time of thread creation using the clone system 39 call, applications can selectively choose whic 39 call, applications can selectively choose which resources to share 40 between threads. 40 between threads. 41 41 42 unshare() system call adds a primitive to the 42 unshare() system call adds a primitive to the Linux thread model that 43 allows threads to selectively 'unshare' any re 43 allows threads to selectively 'unshare' any resources that were being 44 shared at the time of their creation. unshare( 44 shared at the time of their creation. unshare() was conceptualized by 45 Al Viro in the August of 2000, on the Linux-Ke 45 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part 46 of the discussion on POSIX threads on Linux. 46 of the discussion on POSIX threads on Linux. unshare() augments the 47 usefulness of Linux threads for applications t 47 usefulness of Linux threads for applications that would like to control 48 shared resources without creating a new proces 48 shared resources without creating a new process. unshare() is a natural 49 addition to the set of available primitives on 49 addition to the set of available primitives on Linux that implement 50 the concept of process/thread as a virtual mac 50 the concept of process/thread as a virtual machine. 51 51 52 2) Benefits 52 2) Benefits 53 ----------- 53 ----------- 54 54 55 unshare() would be useful to large application 55 unshare() would be useful to large application frameworks such as PAM 56 where creating a new process to control sharin 56 where creating a new process to control sharing/unsharing of process 57 resources is not possible. Since namespaces ar 57 resources is not possible. Since namespaces are shared by default 58 when creating a new process using fork or clon 58 when creating a new process using fork or clone, unshare() can benefit 59 even non-threaded applications if they have a 59 even non-threaded applications if they have a need to disassociate 60 from default shared namespace. The following l 60 from default shared namespace. The following lists two use-cases 61 where unshare() can be used. 61 where unshare() can be used. 62 62 63 2.1 Per-security context namespaces 63 2.1 Per-security context namespaces 64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 64 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 65 65 66 unshare() can be used to implement polyinstant 66 unshare() can be used to implement polyinstantiated directories using 67 the kernel's per-process namespace mechanism. 67 the kernel's per-process namespace mechanism. Polyinstantiated directories, 68 such as per-user and/or per-security context i 68 such as per-user and/or per-security context instance of /tmp, /var/tmp or 69 per-security context instance of a user's home 69 per-security context instance of a user's home directory, isolate user 70 processes when working with these directories. 70 processes when working with these directories. Using unshare(), a PAM 71 module can easily setup a private namespace fo 71 module can easily setup a private namespace for a user at login. 72 Polyinstantiated directories are required for 72 Polyinstantiated directories are required for Common Criteria certification 73 with Labeled System Protection Profile, howeve 73 with Labeled System Protection Profile, however, with the availability 74 of shared-tree feature in the Linux kernel, ev 74 of shared-tree feature in the Linux kernel, even regular Linux systems 75 can benefit from setting up private namespaces 75 can benefit from setting up private namespaces at login and 76 polyinstantiating /tmp, /var/tmp and other dir 76 polyinstantiating /tmp, /var/tmp and other directories deemed 77 appropriate by system administrators. 77 appropriate by system administrators. 78 78 79 2.2 unsharing of virtual memory and/or open fi 79 2.2 unsharing of virtual memory and/or open files 80 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 80 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 81 81 82 Consider a client/server application where the 82 Consider a client/server application where the server is processing 83 client requests by creating processes that sha 83 client requests by creating processes that share resources such as 84 virtual memory and open files. Without unshare 84 virtual memory and open files. Without unshare(), the server has to 85 decide what needs to be shared at the time of 85 decide what needs to be shared at the time of creating the process 86 which services the request. unshare() allows t 86 which services the request. unshare() allows the server an ability to 87 disassociate parts of the context during the s 87 disassociate parts of the context during the servicing of the 88 request. For large and complex middleware appl 88 request. For large and complex middleware application frameworks, this 89 ability to unshare() after the process was cre 89 ability to unshare() after the process was created can be very 90 useful. 90 useful. 91 91 92 3) Cost 92 3) Cost 93 ------- 93 ------- 94 94 95 In order to not duplicate code and to handle t 95 In order to not duplicate code and to handle the fact that unshare() 96 works on an active task (as opposed to clone/f 96 works on an active task (as opposed to clone/fork working on a newly 97 allocated inactive task) unshare() had to make 97 allocated inactive task) unshare() had to make minor reorganizational 98 changes to copy_* functions utilized by clone/ 98 changes to copy_* functions utilized by clone/fork system call. 99 There is a cost associated with altering exist 99 There is a cost associated with altering existing, well tested and 100 stable code to implement a new feature that ma 100 stable code to implement a new feature that may not get exercised 101 extensively in the beginning. However, with pr 101 extensively in the beginning. However, with proper design and code 102 review of the changes and creation of an unsha 102 review of the changes and creation of an unshare() test for the LTP 103 the benefits of this new feature can exceed it 103 the benefits of this new feature can exceed its cost. 104 104 105 4) Requirements 105 4) Requirements 106 --------------- 106 --------------- 107 107 108 unshare() reverses sharing that was done using 108 unshare() reverses sharing that was done using clone(2) system call, 109 so unshare() should have a similar interface a 109 so unshare() should have a similar interface as clone(2). That is, 110 since flags in clone(int flags, void \*stack) 110 since flags in clone(int flags, void \*stack) specifies what should 111 be shared, similar flags in unshare(int flags) 111 be shared, similar flags in unshare(int flags) should specify 112 what should be unshared. Unfortunately, this m 112 what should be unshared. Unfortunately, this may appear to invert 113 the meaning of the flags from the way they are 113 the meaning of the flags from the way they are used in clone(2). 114 However, there was no easy solution that was l 114 However, there was no easy solution that was less confusing and that 115 allowed incremental context unsharing in futur 115 allowed incremental context unsharing in future without an ABI change. 116 116 117 unshare() interface should accommodate possibl 117 unshare() interface should accommodate possible future addition of 118 new context flags without requiring a rebuild 118 new context flags without requiring a rebuild of old applications. 119 If and when new context flags are added, unsha 119 If and when new context flags are added, unshare() design should allow 120 incremental unsharing of those resources on an 120 incremental unsharing of those resources on an as needed basis. 121 121 122 5) Functional Specification 122 5) Functional Specification 123 --------------------------- 123 --------------------------- 124 124 125 NAME 125 NAME 126 unshare - disassociate parts of the pr 126 unshare - disassociate parts of the process execution context 127 127 128 SYNOPSIS 128 SYNOPSIS 129 #include <sched.h> 129 #include <sched.h> 130 130 131 int unshare(int flags); 131 int unshare(int flags); 132 132 133 DESCRIPTION 133 DESCRIPTION 134 unshare() allows a process to disassoc 134 unshare() allows a process to disassociate parts of its execution 135 context that are currently being share 135 context that are currently being shared with other processes. Part 136 of execution context, such as the name 136 of execution context, such as the namespace, is shared by default 137 when a new process is created using fo 137 when a new process is created using fork(2), while other parts, 138 such as the virtual memory, open file 138 such as the virtual memory, open file descriptors, etc, may be 139 shared by explicit request to share th 139 shared by explicit request to share them when creating a process 140 using clone(2). 140 using clone(2). 141 141 142 The main use of unshare() is to allow 142 The main use of unshare() is to allow a process to control its 143 shared execution context without creat 143 shared execution context without creating a new process. 144 144 145 The flags argument specifies one or bi 145 The flags argument specifies one or bitwise-or'ed of several of 146 the following constants. 146 the following constants. 147 147 148 CLONE_FS 148 CLONE_FS 149 If CLONE_FS is set, file syste 149 If CLONE_FS is set, file system information of the caller 150 is disassociated from the shar 150 is disassociated from the shared file system information. 151 151 152 CLONE_FILES 152 CLONE_FILES 153 If CLONE_FILES is set, the fil 153 If CLONE_FILES is set, the file descriptor table of the 154 caller is disassociated from t 154 caller is disassociated from the shared file descriptor 155 table. 155 table. 156 156 157 CLONE_NEWNS 157 CLONE_NEWNS 158 If CLONE_NEWNS is set, the nam 158 If CLONE_NEWNS is set, the namespace of the caller is 159 disassociated from the shared 159 disassociated from the shared namespace. 160 160 161 CLONE_VM 161 CLONE_VM 162 If CLONE_VM is set, the virtua 162 If CLONE_VM is set, the virtual memory of the caller is 163 disassociated from the shared 163 disassociated from the shared virtual memory. 164 164 165 RETURN VALUE 165 RETURN VALUE 166 On success, zero returned. On failure, 166 On success, zero returned. On failure, -1 is returned and errno is 167 167 168 ERRORS 168 ERRORS 169 EPERM CLONE_NEWNS was specified by a 169 EPERM CLONE_NEWNS was specified by a non-root process (process 170 without CAP_SYS_ADMIN). 170 without CAP_SYS_ADMIN). 171 171 172 ENOMEM Cannot allocate sufficient mem 172 ENOMEM Cannot allocate sufficient memory to copy parts of caller's 173 context that need to be unshar 173 context that need to be unshared. 174 174 175 EINVAL Invalid flag was specified as 175 EINVAL Invalid flag was specified as an argument. 176 176 177 CONFORMING TO 177 CONFORMING TO 178 The unshare() call is Linux-specific a 178 The unshare() call is Linux-specific and should not be used 179 in programs intended to be portable. 179 in programs intended to be portable. 180 180 181 SEE ALSO 181 SEE ALSO 182 clone(2), fork(2) 182 clone(2), fork(2) 183 183 184 6) High Level Design 184 6) High Level Design 185 -------------------- 185 -------------------- 186 186 187 Depending on the flags argument, the unshare() 187 Depending on the flags argument, the unshare() system call allocates 188 appropriate process context structures, popula 188 appropriate process context structures, populates it with values from 189 the current shared version, associates newly d 189 the current shared version, associates newly duplicated structures 190 with the current task structure and releases c 190 with the current task structure and releases corresponding shared 191 versions. Helper functions of clone (copy_*) c 191 versions. Helper functions of clone (copy_*) could not be used 192 directly by unshare() because of the following 192 directly by unshare() because of the following two reasons. 193 193 194 1) clone operates on a newly allocated not-y 194 1) clone operates on a newly allocated not-yet-active task 195 structure, where as unshare() operates on 195 structure, where as unshare() operates on the current active 196 task. Therefore unshare() has to take app 196 task. Therefore unshare() has to take appropriate task_lock() 197 before associating newly duplicated conte 197 before associating newly duplicated context structures 198 198 199 2) unshare() has to allocate and duplicate a 199 2) unshare() has to allocate and duplicate all context structures 200 that are being unshared, before associati 200 that are being unshared, before associating them with the 201 current task and releasing older shared s 201 current task and releasing older shared structures. Failure 202 do so will create race conditions and/or 202 do so will create race conditions and/or oops when trying 203 to backout due to an error. Consider the 203 to backout due to an error. Consider the case of unsharing 204 both virtual memory and namespace. After 204 both virtual memory and namespace. After successfully unsharing 205 vm, if the system call encounters an erro 205 vm, if the system call encounters an error while allocating 206 new namespace structure, the error return 206 new namespace structure, the error return code will have to 207 reverse the unsharing of vm. As part of t 207 reverse the unsharing of vm. As part of the reversal the 208 system call will have to go back to older 208 system call will have to go back to older, shared, vm 209 structure, which may not exist anymore. 209 structure, which may not exist anymore. 210 210 211 Therefore code from copy_* functions that allo 211 Therefore code from copy_* functions that allocated and duplicated 212 current context structure was moved into new d 212 current context structure was moved into new dup_* functions. Now, 213 copy_* functions call dup_* functions to alloc 213 copy_* functions call dup_* functions to allocate and duplicate 214 appropriate context structures and then associ 214 appropriate context structures and then associate them with the 215 task structure that is being constructed. unsh 215 task structure that is being constructed. unshare() system call on 216 the other hand performs the following: 216 the other hand performs the following: 217 217 218 1) Check flags to force missing, but implied 218 1) Check flags to force missing, but implied, flags 219 219 220 2) For each context structure, call the corr 220 2) For each context structure, call the corresponding unshare() 221 helper function to allocate and duplicate 221 helper function to allocate and duplicate a new context 222 structure, if the appropriate bit is set 222 structure, if the appropriate bit is set in the flags argument. 223 223 224 3) If there is no error in allocation and du 224 3) If there is no error in allocation and duplication and there 225 are new context structures then lock the 225 are new context structures then lock the current task structure, 226 associate new context structures with the 226 associate new context structures with the current task structure, 227 and release the lock on the current task 227 and release the lock on the current task structure. 228 228 229 4) Appropriately release older, shared, cont 229 4) Appropriately release older, shared, context structures. 230 230 231 7) Low Level Design 231 7) Low Level Design 232 ------------------- 232 ------------------- 233 233 234 Implementation of unshare() can be grouped in 234 Implementation of unshare() can be grouped in the following 4 different 235 items: 235 items: 236 236 237 a) Reorganization of existing copy_* functio 237 a) Reorganization of existing copy_* functions 238 238 239 b) unshare() system call service function 239 b) unshare() system call service function 240 240 241 c) unshare() helper functions for each diffe 241 c) unshare() helper functions for each different process context 242 242 243 d) Registration of system call number for di 243 d) Registration of system call number for different architectures 244 244 245 7.1) Reorganization of copy_* functions 245 7.1) Reorganization of copy_* functions 246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 246 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 247 247 248 Each copy function such as copy_mm, copy_names 248 Each copy function such as copy_mm, copy_namespace, copy_files, 249 etc, had roughly two components. The first com 249 etc, had roughly two components. The first component allocated 250 and duplicated the appropriate structure and t 250 and duplicated the appropriate structure and the second component 251 linked it to the task structure passed in as a 251 linked it to the task structure passed in as an argument to the copy 252 function. The first component was split into i 252 function. The first component was split into its own function. 253 These dup_* functions allocated and duplicated 253 These dup_* functions allocated and duplicated the appropriate 254 context structure. The reorganized copy_* func 254 context structure. The reorganized copy_* functions invoked 255 their corresponding dup_* functions and then l 255 their corresponding dup_* functions and then linked the newly 256 duplicated structures to the task structure wi 256 duplicated structures to the task structure with which the 257 copy function was called. 257 copy function was called. 258 258 259 7.2) unshare() system call service function 259 7.2) unshare() system call service function 260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 260 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 261 261 262 * Check flags 262 * Check flags 263 Force implied flags. If CLONE_THREAD 263 Force implied flags. If CLONE_THREAD is set force CLONE_VM. 264 If CLONE_VM is set, force CLONE_SIGHA 264 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is 265 set and signals are also being shared 265 set and signals are also being shared, force CLONE_THREAD. If 266 CLONE_NEWNS is set, force CLONE_FS. 266 CLONE_NEWNS is set, force CLONE_FS. 267 267 268 * For each context flag, invoke the cor 268 * For each context flag, invoke the corresponding unshare_* 269 helper routine with flags passed into 269 helper routine with flags passed into the system call and a 270 reference to pointer pointing the new 270 reference to pointer pointing the new unshared structure 271 271 272 * If any new structures are created by 272 * If any new structures are created by unshare_* helper 273 functions, take the task_lock() on th 273 functions, take the task_lock() on the current task, 274 modify appropriate context pointers, 274 modify appropriate context pointers, and release the 275 task lock. 275 task lock. 276 276 277 * For all newly unshared structures, re 277 * For all newly unshared structures, release the corresponding 278 older, shared, structures. 278 older, shared, structures. 279 279 280 7.3) unshare_* helper functions 280 7.3) unshare_* helper functions 281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 281 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 282 282 283 For unshare_* helpers corresponding to CLONE_S 283 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, 284 and CLONE_THREAD, return -EINVAL since they ar 284 and CLONE_THREAD, return -EINVAL since they are not implemented yet. 285 For others, check the flag value to see if the 285 For others, check the flag value to see if the unsharing is 286 required for that structure. If it is, invoke 286 required for that structure. If it is, invoke the corresponding 287 dup_* function to allocate and duplicate the s 287 dup_* function to allocate and duplicate the structure and return 288 a pointer to it. 288 a pointer to it. 289 289 290 7.4) Finally 290 7.4) Finally 291 ~~~~~~~~~~~~ 291 ~~~~~~~~~~~~ 292 292 293 Appropriately modify architecture specific cod 293 Appropriately modify architecture specific code to register the 294 new system call. 294 new system call. 295 295 296 8) Test Specification 296 8) Test Specification 297 --------------------- 297 --------------------- 298 298 299 The test for unshare() should test the followi 299 The test for unshare() should test the following: 300 300 301 1) Valid flags: Test to check that clone fla 301 1) Valid flags: Test to check that clone flags for signal and 302 signal handlers, for which unsharing is n 302 signal handlers, for which unsharing is not implemented 303 yet, return -EINVAL. 303 yet, return -EINVAL. 304 304 305 2) Missing/implied flags: Test to make sure 305 2) Missing/implied flags: Test to make sure that if unsharing 306 namespace without specifying unsharing of 306 namespace without specifying unsharing of filesystem, correctly 307 unshares both namespace and filesystem in 307 unshares both namespace and filesystem information. 308 308 309 3) For each of the four (namespace, filesyst 309 3) For each of the four (namespace, filesystem, files and vm) 310 supported unsharing, verify that the syst 310 supported unsharing, verify that the system call correctly 311 unshares the appropriate structure. Verif 311 unshares the appropriate structure. Verify that unsharing 312 them individually as well as in combinati 312 them individually as well as in combination with each 313 other works as expected. 313 other works as expected. 314 314 315 4) Concurrent execution: Use shared memory s 315 4) Concurrent execution: Use shared memory segments and futex on 316 an address in the shm segment to synchron 316 an address in the shm segment to synchronize execution of 317 about 10 threads. Have a couple of thread 317 about 10 threads. Have a couple of threads execute execve, 318 a couple _exit and the rest unshare with 318 a couple _exit and the rest unshare with different combination 319 of flags. Verify that unsharing is perfor 319 of flags. Verify that unsharing is performed as expected and 320 that there are no oops or hangs. 320 that there are no oops or hangs. 321 321 322 9) Future Work 322 9) Future Work 323 -------------- 323 -------------- 324 324 325 The current implementation of unshare() does n 325 The current implementation of unshare() does not allow unsharing of 326 signals and signal handlers. Signals are compl 326 signals and signal handlers. Signals are complex to begin with and 327 to unshare signals and/or signal handlers of a 327 to unshare signals and/or signal handlers of a currently running 328 process is even more complex. If in the future 328 process is even more complex. If in the future there is a specific 329 need to allow unsharing of signals and/or sign 329 need to allow unsharing of signals and/or signal handlers, it can 330 be incrementally added to unshare() without af 330 be incrementally added to unshare() without affecting legacy 331 applications using unshare(). 331 applications using unshare(). 332 332
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.