~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/syscall-user-dispatch.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/admin-guide/syscall-user-dispatch.rst (Version linux-6.12-rc7) and /Documentation/admin-guide/syscall-user-dispatch.rst (Version linux-6.8.12)


  1 .. SPDX-License-Identifier: GPL-2.0                 1 .. SPDX-License-Identifier: GPL-2.0
  2                                                     2 
  3 =====================                               3 =====================
  4 Syscall User Dispatch                               4 Syscall User Dispatch
  5 =====================                               5 =====================
  6                                                     6 
  7 Background                                          7 Background
  8 ----------                                          8 ----------
  9                                                     9 
 10 Compatibility layers like Wine need a way to e     10 Compatibility layers like Wine need a way to efficiently emulate system
 11 calls of only a part of their process - the pa     11 calls of only a part of their process - the part that has the
 12 incompatible code - while being able to execut     12 incompatible code - while being able to execute native syscalls without
 13 a high performance penalty on the native part      13 a high performance penalty on the native part of the process.  Seccomp
 14 falls short on this task, since it has limited     14 falls short on this task, since it has limited support to efficiently
 15 filter syscalls based on memory regions, and i     15 filter syscalls based on memory regions, and it doesn't support removing
 16 filters.  Therefore a new mechanism is necessa     16 filters.  Therefore a new mechanism is necessary.
 17                                                    17 
 18 Syscall User Dispatch brings the filtering of      18 Syscall User Dispatch brings the filtering of the syscall dispatcher
 19 address back to userspace.  The application is     19 address back to userspace.  The application is in control of a flip
 20 switch, indicating the current personality of      20 switch, indicating the current personality of the process.  A
 21 multiple-personality application can then flip     21 multiple-personality application can then flip the switch without
 22 invoking the kernel, when crossing the compati     22 invoking the kernel, when crossing the compatibility layer API
 23 boundaries, to enable/disable the syscall redi     23 boundaries, to enable/disable the syscall redirection and execute
 24 syscalls directly (disabled) or send them to b     24 syscalls directly (disabled) or send them to be emulated in userspace
 25 through a SIGSYS.                                  25 through a SIGSYS.
 26                                                    26 
 27 The goal of this design is to provide very qui     27 The goal of this design is to provide very quick compatibility layer
 28 boundary crosses, which is achieved by not exe     28 boundary crosses, which is achieved by not executing a syscall to change
 29 personality every time the compatibility layer     29 personality every time the compatibility layer executes.  Instead, a
 30 userspace memory region exposed to the kernel      30 userspace memory region exposed to the kernel indicates the current
 31 personality, and the application simply modifi     31 personality, and the application simply modifies that variable to
 32 configure the mechanism.                           32 configure the mechanism.
 33                                                    33 
 34 There is a relatively high cost associated wit     34 There is a relatively high cost associated with handling signals on most
 35 architectures, like x86, but at least for Wine     35 architectures, like x86, but at least for Wine, syscalls issued by
 36 native Windows code are currently not known to     36 native Windows code are currently not known to be a performance problem,
 37 since they are quite rare, at least for modern     37 since they are quite rare, at least for modern gaming applications.
 38                                                    38 
 39 Since this mechanism is designed to capture sy     39 Since this mechanism is designed to capture syscalls issued by
 40 non-native applications, it must function on s     40 non-native applications, it must function on syscalls whose invocation
 41 ABI is completely unexpected to Linux.  Syscal     41 ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
 42 doesn't rely on any of the syscall ABI to make     42 doesn't rely on any of the syscall ABI to make the filtering.  It uses
 43 only the syscall dispatcher address and the us     43 only the syscall dispatcher address and the userspace key.
 44                                                    44 
 45 As the ABI of these intercepted syscalls is un     45 As the ABI of these intercepted syscalls is unknown to Linux, these
 46 syscalls are not instrumentable via ptrace or      46 syscalls are not instrumentable via ptrace or the syscall tracepoints.
 47                                                    47 
 48 Interface                                          48 Interface
 49 ---------                                          49 ---------
 50                                                    50 
 51 A thread can setup this mechanism on supported     51 A thread can setup this mechanism on supported kernels by executing the
 52 following prctl:                                   52 following prctl:
 53                                                    53 
 54   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <o     54   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
 55                                                    55 
 56 <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DI     56 <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
 57 disable the mechanism globally for that thread     57 disable the mechanism globally for that thread.  When
 58 PR_SYS_DISPATCH_OFF is used, the other fields      58 PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
 59                                                    59 
 60 [<offset>, <offset>+<length>) delimit a memory     60 [<offset>, <offset>+<length>) delimit a memory region interval
 61 from which syscalls are always executed direct     61 from which syscalls are always executed directly, regardless of the
 62 userspace selector.  This provides a fast path     62 userspace selector.  This provides a fast path for the C library, which
 63 includes the most common syscall dispatchers i     63 includes the most common syscall dispatchers in the native code
 64 applications, and also provides a way for the      64 applications, and also provides a way for the signal handler to return
 65 without triggering a nested SIGSYS on (rt\_)si     65 without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
 66 interface should make sure that at least the s     66 interface should make sure that at least the signal trampoline code is
 67 included in this region. In addition, for sysc     67 included in this region. In addition, for syscalls that implement the
 68 trampoline code on the vDSO, that trampoline i     68 trampoline code on the vDSO, that trampoline is never intercepted.
 69                                                    69 
 70 [selector] is a pointer to a char-sized region     70 [selector] is a pointer to a char-sized region in the process memory
 71 region, that provides a quick way to enable di     71 region, that provides a quick way to enable disable syscall redirection
 72 thread-wide, without the need to invoke the ke     72 thread-wide, without the need to invoke the kernel directly.  selector
 73 can be set to SYSCALL_DISPATCH_FILTER_ALLOW or     73 can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
 74 Any other value should terminate the program w     74 Any other value should terminate the program with a SIGSYS.
 75                                                    75 
 76 Additionally, a tasks syscall user dispatch co     76 Additionally, a tasks syscall user dispatch configuration can be peeked
 77 and poked via the PTRACE_(GET|SET)_SYSCALL_USE     77 and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
 78 requests. This is useful for checkpoint/restar     78 requests. This is useful for checkpoint/restart software.
 79                                                    79 
 80 Security Notes                                     80 Security Notes
 81 --------------                                     81 --------------
 82                                                    82 
 83 Syscall User Dispatch provides functionality f     83 Syscall User Dispatch provides functionality for compatibility layers to
 84 quickly capture system calls issued by a non-n     84 quickly capture system calls issued by a non-native part of the
 85 application, while not impacting the Linux nat     85 application, while not impacting the Linux native regions of the
 86 process.  It is not a mechanism for sandboxing     86 process.  It is not a mechanism for sandboxing system calls, and it
 87 should not be seen as a security mechanism, si     87 should not be seen as a security mechanism, since it is trivial for a
 88 malicious application to subvert the mechanism     88 malicious application to subvert the mechanism by jumping to an allowed
 89 dispatcher region prior to executing the sysca     89 dispatcher region prior to executing the syscall, or to discover the
 90 address and modify the selector value.  If the     90 address and modify the selector value.  If the use case requires any
 91 kind of security sandboxing, Seccomp should be     91 kind of security sandboxing, Seccomp should be used instead.
 92                                                    92 
 93 Any fork or exec of the existing process reset     93 Any fork or exec of the existing process resets the mechanism to
 94 PR_SYS_DISPATCH_OFF.                               94 PR_SYS_DISPATCH_OFF.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php