1 ========================================= 1 ========================================= 2 user_events: User-based Event Tracing 2 user_events: User-based Event Tracing 3 ========================================= 3 ========================================= 4 4 5 :Author: Beau Belgrave 5 :Author: Beau Belgrave 6 6 7 Overview 7 Overview 8 -------- 8 -------- 9 User based trace events allow user processes t 9 User based trace events allow user processes to create events and trace data 10 that can be viewed via existing tools, such as 10 that can be viewed via existing tools, such as ftrace and perf. 11 To enable this feature, build your kernel with 11 To enable this feature, build your kernel with CONFIG_USER_EVENTS=y. 12 12 13 Programs can view status of the events via 13 Programs can view status of the events via 14 /sys/kernel/tracing/user_events_status and can !! 14 /sys/kernel/debug/tracing/user_events_status and can both register and write 15 data out via /sys/kernel/tracing/user_events_d !! 15 data out via /sys/kernel/debug/tracing/user_events_data. 16 16 17 Programs can also use /sys/kernel/tracing/dyna !! 17 Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and 18 delete user based events via the u: prefix. Th 18 delete user based events via the u: prefix. The format of the command to 19 dynamic_events is the same as the ioctl with t !! 19 dynamic_events is the same as the ioctl with the u: prefix applied. 20 requires CAP_PERFMON due to the event persisti << 21 20 22 Typically programs will register a set of even 21 Typically programs will register a set of events that they wish to expose to 23 tools that can read trace_events (such as ftra 22 tools that can read trace_events (such as ftrace and perf). The registration 24 process tells the kernel which address and bit !! 23 process gives back two ints to the program for each event. The first int is 25 enabled the event and data should be written. !! 24 the status bit. This describes which bit in little-endian format in the 26 a write index which describes the data when a !! 25 /sys/kernel/debug/tracing/user_events_status file represents this event. The 27 on the /sys/kernel/tracing/user_events_data fi !! 26 second int is the write index which describes the data when a write() or >> 27 writev() is called on the /sys/kernel/debug/tracing/user_events_data file. 28 28 29 The structures referenced in this document are 29 The structures referenced in this document are contained within the 30 /include/uapi/linux/user_events.h file in the 30 /include/uapi/linux/user_events.h file in the source tree. 31 31 32 **NOTE:** *Both user_events_status and user_ev 32 **NOTE:** *Both user_events_status and user_events_data are under the tracefs 33 filesystem and may be mounted at different pat 33 filesystem and may be mounted at different paths than above.* 34 34 35 Registering 35 Registering 36 ----------- 36 ----------- 37 Registering within a user process is done via 37 Registering within a user process is done via ioctl() out to the 38 /sys/kernel/tracing/user_events_data file. The !! 38 /sys/kernel/debug/tracing/user_events_data file. The command to issue is 39 DIAG_IOCSREG. 39 DIAG_IOCSREG. 40 40 41 This command takes a packed struct user_reg as 41 This command takes a packed struct user_reg as an argument:: 42 42 43 struct user_reg { 43 struct user_reg { 44 /* Input: Size of the user_reg structu !! 44 u32 size; 45 __u32 size; !! 45 u64 name_args; 46 !! 46 u32 status_bit; 47 /* Input: Bit in enable address to use !! 47 u32 write_index; 48 __u8 enable_bit; !! 48 }; 49 << 50 /* Input: Enable size in bytes at addr << 51 __u8 enable_size; << 52 49 53 /* Input: Flags to use, if any */ !! 50 The struct user_reg requires two inputs, the first is the size of the structure 54 __u16 flags; !! 51 to ensure forward and backward compatibility. The second is the command string 55 !! 52 to issue for registering. Upon success two outputs are set, the status bit 56 /* Input: Address to update when enabl !! 53 and the write index. 57 __u64 enable_addr; << 58 << 59 /* Input: Pointer to string with event << 60 __u64 name_args; << 61 << 62 /* Output: Index of the event to use w << 63 __u32 write_index; << 64 } __attribute__((__packed__)); << 65 << 66 The struct user_reg requires all the above inp << 67 << 68 + size: This must be set to sizeof(struct user << 69 << 70 + enable_bit: The bit to reflect the event sta << 71 enable_addr. << 72 << 73 + enable_size: The size of the value specified << 74 This must be 4 (32-bit) or 8 (64-bit). 64-bi << 75 used on 64-bit kernels, however, 32-bit can << 76 << 77 + flags: The flags to use, if any. << 78 Callers should first attempt to use flags an << 79 support for lower versions of the kernel. If << 80 is returned. << 81 << 82 + enable_addr: The address of the value to use << 83 must be naturally aligned and write accessib << 84 << 85 + name_args: The name and arguments to describ << 86 for details. << 87 << 88 The following flags are currently supported. << 89 << 90 + USER_EVENT_REG_PERSIST: The event will not d << 91 closing. Callers may use this if an event sh << 92 process closes or unregisters the event. Req << 93 -EPERM is returned. << 94 << 95 + USER_EVENT_REG_MULTI_FORMAT: The event can c << 96 allows programs to prevent themselves from b << 97 format changes and they wish to use the same << 98 tracepoint name will be in the new format of << 99 format of "name". A tracepoint will be creat << 100 and format. This means if several processes << 101 they will use the same tracepoint. If yet an << 102 but a different format than the other proces << 103 tracepoint with a new unique id. Recording p << 104 the various different formats of the event n << 105 recording. The system name of the tracepoint << 106 instead of "user_events". This prevents sing << 107 with any multi-format event names within tra << 108 a hex string. Recording programs should ensu << 109 the event name they registered and has a suf << 110 has hex characters. For example to find all << 111 can use the regex "^test\.[0-9a-fA-F]+$". << 112 << 113 Upon successful registration the following is << 114 << 115 + write_index: The index to use for this file << 116 event when writing out data. The index is un << 117 descriptor that was used for the registratio << 118 54 119 User based events show up under tracefs like a 55 User based events show up under tracefs like any other event under the 120 subsystem named "user_events". This means tool 56 subsystem named "user_events". This means tools that wish to attach to the 121 events need to use /sys/kernel/tracing/events/ !! 57 events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable 122 or perf record -e user_events:[name] when atta 58 or perf record -e user_events:[name] when attaching/recording. 123 59 124 **NOTE:** The event subsystem name by default !! 60 **NOTE:** *The write_index returned is only valid for the FD that was used* 125 not assume it will always be "user_events". Op << 126 future to change the subsystem name per-proces << 127 In addition if the USER_EVENT_REG_MULTI_FORMAT << 128 will have a unique id appended to it and the s << 129 "user_events_multi" as described above. << 130 61 131 Command Format 62 Command Format 132 ^^^^^^^^^^^^^^ 63 ^^^^^^^^^^^^^^ 133 The command string format is as follows:: 64 The command string format is as follows:: 134 65 135 name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] 66 name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] 136 67 137 Supported Flags 68 Supported Flags 138 ^^^^^^^^^^^^^^^ 69 ^^^^^^^^^^^^^^^ 139 None yet 70 None yet 140 71 141 Field Format 72 Field Format 142 ^^^^^^^^^^^^ 73 ^^^^^^^^^^^^ 143 :: 74 :: 144 75 145 type name [size] 76 type name [size] 146 77 147 Basic types are supported (__data_loc, u32, u6 78 Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc). 148 User programs are encouraged to use clearly si 79 User programs are encouraged to use clearly sized types like u32. 149 80 150 **NOTE:** *Long is not supported since size ca 81 **NOTE:** *Long is not supported since size can vary between user and kernel.* 151 82 152 The size is only valid for types that start wi 83 The size is only valid for types that start with a struct prefix. 153 This allows user programs to describe custom s 84 This allows user programs to describe custom structs out to tools, if required. 154 85 155 For example, a struct in C that looks like thi 86 For example, a struct in C that looks like this:: 156 87 157 struct mytype { 88 struct mytype { 158 char data[20]; 89 char data[20]; 159 }; 90 }; 160 91 161 Would be represented by the following field:: 92 Would be represented by the following field:: 162 93 163 struct mytype myname 20 94 struct mytype myname 20 164 95 165 Deleting 96 Deleting 166 -------- !! 97 ----------- 167 Deleting an event from within a user process i 98 Deleting an event from within a user process is done via ioctl() out to the 168 /sys/kernel/tracing/user_events_data file. The !! 99 /sys/kernel/debug/tracing/user_events_data file. The command to issue is 169 DIAG_IOCSDEL. 100 DIAG_IOCSDEL. 170 101 171 This command only requires a single string spe 102 This command only requires a single string specifying the event to delete by 172 its name. Delete will only succeed if there ar 103 its name. Delete will only succeed if there are no references left to the 173 event (in both user and kernel space). User pr 104 event (in both user and kernel space). User programs should use a separate file 174 to request deletes than the one used for regis 105 to request deletes than the one used for registration due to this. 175 106 176 **NOTE:** By default events will auto-delete w << 177 to the event. If programs do not want auto-del << 178 USER_EVENT_REG_PERSIST flag when registering t << 179 the event exists until DIAG_IOCSDEL is invoked << 180 event that persists requires CAP_PERFMON, othe << 181 there are multiple formats of the same event n << 182 name will be attempted to be deleted. If only << 183 be deleted then the /sys/kernel/tracing/dynami << 184 that specific format of the event. << 185 << 186 Unregistering << 187 ------------- << 188 If after registering an event it is no longer << 189 be disabled via ioctl() out to the /sys/kernel << 190 The command to issue is DIAG_IOCSUNREG. This i << 191 deleting actually removes the event from the s << 192 the kernel your process is no longer intereste << 193 << 194 This command takes a packed struct user_unreg << 195 << 196 struct user_unreg { << 197 /* Input: Size of the user_unreg struc << 198 __u32 size; << 199 << 200 /* Input: Bit to unregister */ << 201 __u8 disable_bit; << 202 << 203 /* Input: Reserved, set to 0 */ << 204 __u8 __reserved; << 205 << 206 /* Input: Reserved, set to 0 */ << 207 __u16 __reserved2; << 208 << 209 /* Input: Address to unregister */ << 210 __u64 disable_addr; << 211 } __attribute__((__packed__)); << 212 << 213 The struct user_unreg requires all the above i << 214 << 215 + size: This must be set to sizeof(struct user << 216 << 217 + disable_bit: This must be set to the bit to << 218 previously registered via enable_bit). << 219 << 220 + disable_addr: This must be set to the addres << 221 previously registered via enable_addr). << 222 << 223 **NOTE:** Events are automatically unregistere << 224 fork() the registered events will be retained << 225 in each process if wanted. << 226 << 227 Status 107 Status 228 ------ 108 ------ 229 When tools attach/record user based events the 109 When tools attach/record user based events the status of the event is updated 230 in realtime. This allows user programs to only 110 in realtime. This allows user programs to only incur the cost of the write() or 231 writev() calls when something is actively atta 111 writev() calls when something is actively attached to the event. 232 112 233 The kernel will update the specified bit that !! 113 User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to 234 tools attach/detach from the event. User progr !! 114 check the status for each event that is registered. The bit to check in the 235 to see if something is attached or not. !! 115 file is given back after the register ioctl() via user_reg.status_bit. The bit >> 116 is always in little-endian format. Programs can check if the bit is set either >> 117 using a byte-wise index with a mask or a long-wise index with a little-endian >> 118 mask. >> 119 >> 120 Currently the size of user_events_status is a single page, however, custom >> 121 kernel configurations can change this size to allow more user based events. In >> 122 all cases the size of the file is a multiple of a page size. >> 123 >> 124 For example, if the register ioctl() gives back a status_bit of 3 you would >> 125 check byte 0 (3 / 8) of the returned mmap data and then AND the result with 8 >> 126 (1 << (3 % 8)) to see if anything is attached to that event. >> 127 >> 128 A byte-wise index check is performed as follows:: >> 129 >> 130 int index, mask; >> 131 char *status_page; >> 132 >> 133 index = status_bit / 8; >> 134 mask = 1 << (status_bit % 8); >> 135 >> 136 ... >> 137 >> 138 if (status_page[index] & mask) { >> 139 /* Enabled */ >> 140 } >> 141 >> 142 A long-wise index check is performed as follows:: >> 143 >> 144 #include <asm/bitsperlong.h> >> 145 #include <endian.h> >> 146 >> 147 #if __BITS_PER_LONG == 64 >> 148 #define endian_swap(x) htole64(x) >> 149 #else >> 150 #define endian_swap(x) htole32(x) >> 151 #endif >> 152 >> 153 long index, mask, *status_page; >> 154 >> 155 index = status_bit / __BITS_PER_LONG; >> 156 mask = 1L << (status_bit % __BITS_PER_LONG); >> 157 mask = endian_swap(mask); >> 158 >> 159 ... >> 160 >> 161 if (status_page[index] & mask) { >> 162 /* Enabled */ >> 163 } 236 164 237 Administrators can easily check the status of 165 Administrators can easily check the status of all registered events by reading 238 the user_events_status file directly via a ter 166 the user_events_status file directly via a terminal. The output is as follows:: 239 167 240 Name [# Comments] !! 168 Byte:Name [# Comments] 241 ... 169 ... 242 170 243 Active: ActiveCount 171 Active: ActiveCount 244 Busy: BusyCount 172 Busy: BusyCount >> 173 Max: MaxCount 245 174 246 For example, on a system that has a single eve 175 For example, on a system that has a single event the output looks like this:: 247 176 248 test !! 177 1:test 249 178 250 Active: 1 179 Active: 1 251 Busy: 0 180 Busy: 0 >> 181 Max: 32768 252 182 253 If a user enables the user event via ftrace, t 183 If a user enables the user event via ftrace, the output would change to this:: 254 184 255 test # Used by ftrace !! 185 1:test # Used by ftrace 256 186 257 Active: 1 187 Active: 1 258 Busy: 1 188 Busy: 1 >> 189 Max: 32768 >> 190 >> 191 **NOTE:** *A status bit of 0 will never be returned. This allows user programs >> 192 to have a bit that can be used on error cases.* 259 193 260 Writing Data 194 Writing Data 261 ------------ 195 ------------ 262 After registering an event the same fd that wa 196 After registering an event the same fd that was used to register can be used 263 to write an entry for that event. The write_in 197 to write an entry for that event. The write_index returned must be at the start 264 of the data, then the remaining data is treate 198 of the data, then the remaining data is treated as the payload of the event. 265 199 266 For example, if write_index returned was 1 and 200 For example, if write_index returned was 1 and I wanted to write out an int 267 payload of the event. Then the data would have 201 payload of the event. Then the data would have to be 8 bytes (2 ints) in size, 268 with the first 4 bytes being equal to 1 and th 202 with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the 269 value I want as the payload. 203 value I want as the payload. 270 204 271 In memory this would look like this:: 205 In memory this would look like this:: 272 206 273 int index; 207 int index; 274 int payload; 208 int payload; 275 209 276 User programs might have well known structs th 210 User programs might have well known structs that they wish to use to emit out 277 as payloads. In those cases writev() can be us 211 as payloads. In those cases writev() can be used, with the first vector being 278 the index and the following vector(s) being th 212 the index and the following vector(s) being the actual event payload. 279 213 280 For example, if I have a struct like this:: 214 For example, if I have a struct like this:: 281 215 282 struct payload { 216 struct payload { 283 int src; 217 int src; 284 int dst; 218 int dst; 285 int flags; 219 int flags; 286 } __attribute__((__packed__)); !! 220 }; 287 221 288 It's advised for user programs to do the follo 222 It's advised for user programs to do the following:: 289 223 290 struct iovec io[2]; 224 struct iovec io[2]; 291 struct payload e; 225 struct payload e; 292 226 293 io[0].iov_base = &write_index; 227 io[0].iov_base = &write_index; 294 io[0].iov_len = sizeof(write_index); 228 io[0].iov_len = sizeof(write_index); 295 io[1].iov_base = &e; 229 io[1].iov_base = &e; 296 io[1].iov_len = sizeof(e); 230 io[1].iov_len = sizeof(e); 297 231 298 writev(fd, (const struct iovec*)io, 2); 232 writev(fd, (const struct iovec*)io, 2); 299 233 300 **NOTE:** *The write_index is not emitted out 234 **NOTE:** *The write_index is not emitted out into the trace being recorded.* 301 235 302 Example Code 236 Example Code 303 ------------ 237 ------------ 304 See sample code in samples/user_events. 238 See sample code in samples/user_events.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.