1 ========================================= 1 ========================================= 2 user_events: User-based Event Tracing 2 user_events: User-based Event Tracing 3 ========================================= 3 ========================================= 4 4 5 :Author: Beau Belgrave 5 :Author: Beau Belgrave 6 6 7 Overview 7 Overview 8 -------- 8 -------- 9 User based trace events allow user processes t 9 User based trace events allow user processes to create events and trace data 10 that can be viewed via existing tools, such as 10 that can be viewed via existing tools, such as ftrace and perf. 11 To enable this feature, build your kernel with 11 To enable this feature, build your kernel with CONFIG_USER_EVENTS=y. 12 12 13 Programs can view status of the events via 13 Programs can view status of the events via 14 /sys/kernel/tracing/user_events_status and can !! 14 /sys/kernel/debug/tracing/user_events_status and can both register and write 15 data out via /sys/kernel/tracing/user_events_d !! 15 data out via /sys/kernel/debug/tracing/user_events_data. 16 16 17 Programs can also use /sys/kernel/tracing/dyna !! 17 Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and 18 delete user based events via the u: prefix. Th 18 delete user based events via the u: prefix. The format of the command to 19 dynamic_events is the same as the ioctl with t !! 19 dynamic_events is the same as the ioctl with the u: prefix applied. 20 requires CAP_PERFMON due to the event persisti << 21 20 22 Typically programs will register a set of even 21 Typically programs will register a set of events that they wish to expose to 23 tools that can read trace_events (such as ftra 22 tools that can read trace_events (such as ftrace and perf). The registration 24 process tells the kernel which address and bit !! 23 process gives back two ints to the program for each event. The first int is the 25 enabled the event and data should be written. !! 24 status index. This index describes which byte in the 26 a write index which describes the data when a !! 25 /sys/kernel/debug/tracing/user_events_status file represents this event. The 27 on the /sys/kernel/tracing/user_events_data fi !! 26 second int is the write index. This index describes the data when a write() or >> 27 writev() is called on the /sys/kernel/debug/tracing/user_events_data file. 28 28 29 The structures referenced in this document are !! 29 The structures referenced in this document are contained with the 30 /include/uapi/linux/user_events.h file in the !! 30 /include/uap/linux/user_events.h file in the source tree. 31 31 32 **NOTE:** *Both user_events_status and user_ev 32 **NOTE:** *Both user_events_status and user_events_data are under the tracefs 33 filesystem and may be mounted at different pat 33 filesystem and may be mounted at different paths than above.* 34 34 35 Registering 35 Registering 36 ----------- 36 ----------- 37 Registering within a user process is done via 37 Registering within a user process is done via ioctl() out to the 38 /sys/kernel/tracing/user_events_data file. The !! 38 /sys/kernel/debug/tracing/user_events_data file. The command to issue is 39 DIAG_IOCSREG. 39 DIAG_IOCSREG. 40 40 41 This command takes a packed struct user_reg as !! 41 This command takes a struct user_reg as an argument:: 42 42 43 struct user_reg { 43 struct user_reg { 44 /* Input: Size of the user_reg structu !! 44 u32 size; 45 __u32 size; !! 45 u64 name_args; 46 !! 46 u32 status_index; 47 /* Input: Bit in enable address to use !! 47 u32 write_index; 48 __u8 enable_bit; !! 48 }; 49 << 50 /* Input: Enable size in bytes at addr << 51 __u8 enable_size; << 52 49 53 /* Input: Flags to use, if any */ !! 50 The struct user_reg requires two inputs, the first is the size of the structure 54 __u16 flags; !! 51 to ensure forward and backward compatibility. The second is the command string 55 !! 52 to issue for registering. Upon success two outputs are set, the status index 56 /* Input: Address to update when enabl !! 53 and the write index. 57 __u64 enable_addr; << 58 << 59 /* Input: Pointer to string with event << 60 __u64 name_args; << 61 << 62 /* Output: Index of the event to use w << 63 __u32 write_index; << 64 } __attribute__((__packed__)); << 65 << 66 The struct user_reg requires all the above inp << 67 << 68 + size: This must be set to sizeof(struct user << 69 << 70 + enable_bit: The bit to reflect the event sta << 71 enable_addr. << 72 << 73 + enable_size: The size of the value specified << 74 This must be 4 (32-bit) or 8 (64-bit). 64-bi << 75 used on 64-bit kernels, however, 32-bit can << 76 << 77 + flags: The flags to use, if any. << 78 Callers should first attempt to use flags an << 79 support for lower versions of the kernel. If << 80 is returned. << 81 << 82 + enable_addr: The address of the value to use << 83 must be naturally aligned and write accessib << 84 << 85 + name_args: The name and arguments to describ << 86 for details. << 87 << 88 The following flags are currently supported. << 89 << 90 + USER_EVENT_REG_PERSIST: The event will not d << 91 closing. Callers may use this if an event sh << 92 process closes or unregisters the event. Req << 93 -EPERM is returned. << 94 << 95 + USER_EVENT_REG_MULTI_FORMAT: The event can c << 96 allows programs to prevent themselves from b << 97 format changes and they wish to use the same << 98 tracepoint name will be in the new format of << 99 format of "name". A tracepoint will be creat << 100 and format. This means if several processes << 101 they will use the same tracepoint. If yet an << 102 but a different format than the other proces << 103 tracepoint with a new unique id. Recording p << 104 the various different formats of the event n << 105 recording. The system name of the tracepoint << 106 instead of "user_events". This prevents sing << 107 with any multi-format event names within tra << 108 a hex string. Recording programs should ensu << 109 the event name they registered and has a suf << 110 has hex characters. For example to find all << 111 can use the regex "^test\.[0-9a-fA-F]+$". << 112 << 113 Upon successful registration the following is << 114 << 115 + write_index: The index to use for this file << 116 event when writing out data. The index is un << 117 descriptor that was used for the registratio << 118 54 119 User based events show up under tracefs like a 55 User based events show up under tracefs like any other event under the 120 subsystem named "user_events". This means tool 56 subsystem named "user_events". This means tools that wish to attach to the 121 events need to use /sys/kernel/tracing/events/ !! 57 events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable 122 or perf record -e user_events:[name] when atta 58 or perf record -e user_events:[name] when attaching/recording. 123 59 124 **NOTE:** The event subsystem name by default !! 60 **NOTE:** *The write_index returned is only valid for the FD that was used* 125 not assume it will always be "user_events". Op << 126 future to change the subsystem name per-proces << 127 In addition if the USER_EVENT_REG_MULTI_FORMAT << 128 will have a unique id appended to it and the s << 129 "user_events_multi" as described above. << 130 61 131 Command Format 62 Command Format 132 ^^^^^^^^^^^^^^ 63 ^^^^^^^^^^^^^^ 133 The command string format is as follows:: 64 The command string format is as follows:: 134 65 135 name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] 66 name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] 136 67 137 Supported Flags 68 Supported Flags 138 ^^^^^^^^^^^^^^^ 69 ^^^^^^^^^^^^^^^ 139 None yet 70 None yet 140 71 141 Field Format 72 Field Format 142 ^^^^^^^^^^^^ 73 ^^^^^^^^^^^^ 143 :: 74 :: 144 75 145 type name [size] 76 type name [size] 146 77 147 Basic types are supported (__data_loc, u32, u6 78 Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc). 148 User programs are encouraged to use clearly si 79 User programs are encouraged to use clearly sized types like u32. 149 80 150 **NOTE:** *Long is not supported since size ca 81 **NOTE:** *Long is not supported since size can vary between user and kernel.* 151 82 152 The size is only valid for types that start wi 83 The size is only valid for types that start with a struct prefix. 153 This allows user programs to describe custom s 84 This allows user programs to describe custom structs out to tools, if required. 154 85 155 For example, a struct in C that looks like thi 86 For example, a struct in C that looks like this:: 156 87 157 struct mytype { 88 struct mytype { 158 char data[20]; 89 char data[20]; 159 }; 90 }; 160 91 161 Would be represented by the following field:: 92 Would be represented by the following field:: 162 93 163 struct mytype myname 20 94 struct mytype myname 20 164 95 165 Deleting 96 Deleting 166 -------- !! 97 ----------- 167 Deleting an event from within a user process i 98 Deleting an event from within a user process is done via ioctl() out to the 168 /sys/kernel/tracing/user_events_data file. The !! 99 /sys/kernel/debug/tracing/user_events_data file. The command to issue is 169 DIAG_IOCSDEL. 100 DIAG_IOCSDEL. 170 101 171 This command only requires a single string spe 102 This command only requires a single string specifying the event to delete by 172 its name. Delete will only succeed if there ar 103 its name. Delete will only succeed if there are no references left to the 173 event (in both user and kernel space). User pr 104 event (in both user and kernel space). User programs should use a separate file 174 to request deletes than the one used for regis 105 to request deletes than the one used for registration due to this. 175 106 176 **NOTE:** By default events will auto-delete w << 177 to the event. If programs do not want auto-del << 178 USER_EVENT_REG_PERSIST flag when registering t << 179 the event exists until DIAG_IOCSDEL is invoked << 180 event that persists requires CAP_PERFMON, othe << 181 there are multiple formats of the same event n << 182 name will be attempted to be deleted. If only << 183 be deleted then the /sys/kernel/tracing/dynami << 184 that specific format of the event. << 185 << 186 Unregistering << 187 ------------- << 188 If after registering an event it is no longer << 189 be disabled via ioctl() out to the /sys/kernel << 190 The command to issue is DIAG_IOCSUNREG. This i << 191 deleting actually removes the event from the s << 192 the kernel your process is no longer intereste << 193 << 194 This command takes a packed struct user_unreg << 195 << 196 struct user_unreg { << 197 /* Input: Size of the user_unreg struc << 198 __u32 size; << 199 << 200 /* Input: Bit to unregister */ << 201 __u8 disable_bit; << 202 << 203 /* Input: Reserved, set to 0 */ << 204 __u8 __reserved; << 205 << 206 /* Input: Reserved, set to 0 */ << 207 __u16 __reserved2; << 208 << 209 /* Input: Address to unregister */ << 210 __u64 disable_addr; << 211 } __attribute__((__packed__)); << 212 << 213 The struct user_unreg requires all the above i << 214 << 215 + size: This must be set to sizeof(struct user << 216 << 217 + disable_bit: This must be set to the bit to << 218 previously registered via enable_bit). << 219 << 220 + disable_addr: This must be set to the addres << 221 previously registered via enable_addr). << 222 << 223 **NOTE:** Events are automatically unregistere << 224 fork() the registered events will be retained << 225 in each process if wanted. << 226 << 227 Status 107 Status 228 ------ 108 ------ 229 When tools attach/record user based events the 109 When tools attach/record user based events the status of the event is updated 230 in realtime. This allows user programs to only 110 in realtime. This allows user programs to only incur the cost of the write() or 231 writev() calls when something is actively atta 111 writev() calls when something is actively attached to the event. 232 112 233 The kernel will update the specified bit that !! 113 User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to 234 tools attach/detach from the event. User progr !! 114 check the status for each event that is registered. The byte to check in the 235 to see if something is attached or not. !! 115 file is given back after the register ioctl() via user_reg.status_index. >> 116 Currently the size of user_events_status is a single page, however, custom >> 117 kernel configurations can change this size to allow more user based events. In >> 118 all cases the size of the file is a multiple of a page size. >> 119 >> 120 For example, if the register ioctl() gives back a status_index of 3 you would >> 121 check byte 3 of the returned mmap data to see if anything is attached to that >> 122 event. 236 123 237 Administrators can easily check the status of 124 Administrators can easily check the status of all registered events by reading 238 the user_events_status file directly via a ter 125 the user_events_status file directly via a terminal. The output is as follows:: 239 126 240 Name [# Comments] !! 127 Byte:Name [# Comments] 241 ... 128 ... 242 129 243 Active: ActiveCount 130 Active: ActiveCount 244 Busy: BusyCount 131 Busy: BusyCount >> 132 Max: MaxCount 245 133 246 For example, on a system that has a single eve 134 For example, on a system that has a single event the output looks like this:: 247 135 248 test !! 136 1:test 249 137 250 Active: 1 138 Active: 1 251 Busy: 0 139 Busy: 0 >> 140 Max: 4096 252 141 253 If a user enables the user event via ftrace, t 142 If a user enables the user event via ftrace, the output would change to this:: 254 143 255 test # Used by ftrace !! 144 1:test # Used by ftrace 256 145 257 Active: 1 146 Active: 1 258 Busy: 1 147 Busy: 1 >> 148 Max: 4096 >> 149 >> 150 **NOTE:** *A status index of 0 will never be returned. This allows user >> 151 programs to have an index that can be used on error cases.* >> 152 >> 153 Status Bits >> 154 ^^^^^^^^^^^ >> 155 The byte being checked will be non-zero if anything is attached. Programs can >> 156 check specific bits in the byte to see what mechanism has been attached. >> 157 >> 158 The following values are defined to aid in checking what has been attached: >> 159 >> 160 **EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0). >> 161 >> 162 **EVENT_STATUS_PERF** - Bit set if perf has been attached (Bit 1). 259 163 260 Writing Data 164 Writing Data 261 ------------ 165 ------------ 262 After registering an event the same fd that wa 166 After registering an event the same fd that was used to register can be used 263 to write an entry for that event. The write_in 167 to write an entry for that event. The write_index returned must be at the start 264 of the data, then the remaining data is treate 168 of the data, then the remaining data is treated as the payload of the event. 265 169 266 For example, if write_index returned was 1 and 170 For example, if write_index returned was 1 and I wanted to write out an int 267 payload of the event. Then the data would have 171 payload of the event. Then the data would have to be 8 bytes (2 ints) in size, 268 with the first 4 bytes being equal to 1 and th 172 with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the 269 value I want as the payload. 173 value I want as the payload. 270 174 271 In memory this would look like this:: 175 In memory this would look like this:: 272 176 273 int index; 177 int index; 274 int payload; 178 int payload; 275 179 276 User programs might have well known structs th 180 User programs might have well known structs that they wish to use to emit out 277 as payloads. In those cases writev() can be us 181 as payloads. In those cases writev() can be used, with the first vector being 278 the index and the following vector(s) being th 182 the index and the following vector(s) being the actual event payload. 279 183 280 For example, if I have a struct like this:: 184 For example, if I have a struct like this:: 281 185 282 struct payload { 186 struct payload { 283 int src; 187 int src; 284 int dst; 188 int dst; 285 int flags; 189 int flags; 286 } __attribute__((__packed__)); !! 190 }; 287 191 288 It's advised for user programs to do the follo 192 It's advised for user programs to do the following:: 289 193 290 struct iovec io[2]; 194 struct iovec io[2]; 291 struct payload e; 195 struct payload e; 292 196 293 io[0].iov_base = &write_index; 197 io[0].iov_base = &write_index; 294 io[0].iov_len = sizeof(write_index); 198 io[0].iov_len = sizeof(write_index); 295 io[1].iov_base = &e; 199 io[1].iov_base = &e; 296 io[1].iov_len = sizeof(e); 200 io[1].iov_len = sizeof(e); 297 201 298 writev(fd, (const struct iovec*)io, 2); 202 writev(fd, (const struct iovec*)io, 2); 299 203 300 **NOTE:** *The write_index is not emitted out 204 **NOTE:** *The write_index is not emitted out into the trace being recorded.* 301 205 302 Example Code 206 Example Code 303 ------------ 207 ------------ 304 See sample code in samples/user_events. 208 See sample code in samples/user_events.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.