1 ============================= 1 ============================= 2 Per-task statistics interface 2 Per-task statistics interface 3 ============================= 3 ============================= 4 4 5 5 6 Taskstats is a netlink-based interface for sen 6 Taskstats is a netlink-based interface for sending per-task and 7 per-process statistics from the kernel to user 7 per-process statistics from the kernel to userspace. 8 8 9 Taskstats was designed for the following benef 9 Taskstats was designed for the following benefits: 10 10 11 - efficiently provide statistics during lifeti 11 - efficiently provide statistics during lifetime of a task and on its exit 12 - unified interface for multiple accounting su 12 - unified interface for multiple accounting subsystems 13 - extensibility for use by future accounting p 13 - extensibility for use by future accounting patches 14 14 15 Terminology 15 Terminology 16 ----------- 16 ----------- 17 17 18 "pid", "tid" and "task" are used interchangeab 18 "pid", "tid" and "task" are used interchangeably and refer to the standard 19 Linux task defined by struct task_struct. per 19 Linux task defined by struct task_struct. per-pid stats are the same as 20 per-task stats. 20 per-task stats. 21 21 22 "tgid", "process" and "thread group" are used 22 "tgid", "process" and "thread group" are used interchangeably and refer to the 23 tasks that share an mm_struct i.e. the traditi 23 tasks that share an mm_struct i.e. the traditional Unix process. Despite the 24 use of tgid, there is no special treatment for 24 use of tgid, there is no special treatment for the task that is thread group 25 leader - a process is deemed alive as long as 25 leader - a process is deemed alive as long as it has any task belonging to it. 26 26 27 Usage 27 Usage 28 ----- 28 ----- 29 29 30 To get statistics during a task's lifetime, us 30 To get statistics during a task's lifetime, userspace opens a unicast netlink 31 socket (NETLINK_GENERIC family) and sends comm 31 socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. 32 The response contains statistics for a task (i 32 The response contains statistics for a task (if pid is specified) or the sum of 33 statistics for all tasks of the process (if tg 33 statistics for all tasks of the process (if tgid is specified). 34 34 35 To obtain statistics for tasks which are exiti 35 To obtain statistics for tasks which are exiting, the userspace listener 36 sends a register command and specifies a cpuma 36 sends a register command and specifies a cpumask. Whenever a task exits on 37 one of the cpus in the cpumask, its per-pid st 37 one of the cpus in the cpumask, its per-pid statistics are sent to the 38 registered listener. Using cpumasks allows the 38 registered listener. Using cpumasks allows the data received by one listener 39 to be limited and assists in flow control over 39 to be limited and assists in flow control over the netlink interface and is 40 explained in more detail below. 40 explained in more detail below. 41 41 42 If the exiting task is the last thread exiting 42 If the exiting task is the last thread exiting its thread group, 43 an additional record containing the per-tgid s 43 an additional record containing the per-tgid stats is also sent to userspace. 44 The latter contains the sum of per-pid stats f 44 The latter contains the sum of per-pid stats for all threads in the thread 45 group, both past and present. 45 group, both past and present. 46 46 47 getdelays.c is a simple utility demonstrating 47 getdelays.c is a simple utility demonstrating usage of the taskstats interface 48 for reporting delay accounting statistics. Use 48 for reporting delay accounting statistics. Users can register cpumasks, 49 send commands and process responses, listen fo 49 send commands and process responses, listen for per-tid/tgid exit data, 50 write the data received to a file and do basic 50 write the data received to a file and do basic flow control by increasing 51 receive buffer sizes. 51 receive buffer sizes. 52 52 53 Interface 53 Interface 54 --------- 54 --------- 55 55 56 The user-kernel interface is encapsulated in i 56 The user-kernel interface is encapsulated in include/linux/taskstats.h 57 57 58 To avoid this documentation becoming obsolete 58 To avoid this documentation becoming obsolete as the interface evolves, only 59 an outline of the current version is given. ta 59 an outline of the current version is given. taskstats.h always overrides the 60 description here. 60 description here. 61 61 62 struct taskstats is the common accounting stru 62 struct taskstats is the common accounting structure for both per-pid and 63 per-tgid data. It is versioned and can be exte 63 per-tgid data. It is versioned and can be extended by each accounting subsystem 64 that is added to the kernel. The fields and th 64 that is added to the kernel. The fields and their semantics are defined in the 65 taskstats.h file. 65 taskstats.h file. 66 66 67 The data exchanged between user and kernel spa 67 The data exchanged between user and kernel space is a netlink message belonging 68 to the NETLINK_GENERIC family and using the ne 68 to the NETLINK_GENERIC family and using the netlink attributes interface. 69 The messages are in the format:: 69 The messages are in the format:: 70 70 71 +----------+- - -+-------------+---------- 71 +----------+- - -+-------------+-------------------+ 72 | nlmsghdr | Pad | genlmsghdr | taskstats 72 | nlmsghdr | Pad | genlmsghdr | taskstats payload | 73 +----------+- - -+-------------+---------- 73 +----------+- - -+-------------+-------------------+ 74 74 75 75 76 The taskstats payload is one of the following 76 The taskstats payload is one of the following three kinds: 77 77 78 1. Commands: Sent from user to kernel. Command 78 1. Commands: Sent from user to kernel. Commands to get data on 79 a pid/tgid consist of one attribute, of type T 79 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, 80 containing a u32 pid or tgid in the attribute 80 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes 81 the task/process for which userspace wants sta 81 the task/process for which userspace wants statistics. 82 82 83 Commands to register/deregister interest in ex 83 Commands to register/deregister interest in exit data from a set of cpus 84 consist of one attribute, of type 84 consist of one attribute, of type 85 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK 85 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the 86 attribute payload. The cpumask is specified as 86 attribute payload. The cpumask is specified as an ascii string of 87 comma-separated cpu ranges e.g. to listen to e 87 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 88 the cpumask would be "1-3,5,7-8". If userspace 88 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest 89 in cpus before closing the listening socket, t 89 in cpus before closing the listening socket, the kernel cleans up its interest 90 set over time. However, for the sake of effici 90 set over time. However, for the sake of efficiency, an explicit deregistration 91 is advisable. 91 is advisable. 92 92 93 2. Response for a command: sent from the kerne 93 2. Response for a command: sent from the kernel in response to a userspace 94 command. The payload is a series of three attr 94 command. The payload is a series of three attributes of type: 95 95 96 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute co 96 a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates 97 a pid/tgid will be followed by some stats. 97 a pid/tgid will be followed by some stats. 98 98 99 b) TASKSTATS_TYPE_PID/TGID: attribute whose pa 99 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats 100 are being returned. 100 are being returned. 101 101 102 c) TASKSTATS_TYPE_STATS: attribute with a stru 102 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The 103 same structure is used for both per-pid and pe 103 same structure is used for both per-pid and per-tgid stats. 104 104 105 3. New message sent by kernel whenever a task 105 3. New message sent by kernel whenever a task exits. The payload consists of a 106 series of attributes of the following type: 106 series of attributes of the following type: 107 107 108 a) TASKSTATS_TYPE_AGGR_PID: indicates next two 108 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats 109 b) TASKSTATS_TYPE_PID: contains exiting task's 109 b) TASKSTATS_TYPE_PID: contains exiting task's pid 110 c) TASKSTATS_TYPE_STATS: contains the exiting 110 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats 111 d) TASKSTATS_TYPE_AGGR_TGID: indicates next tw 111 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats 112 e) TASKSTATS_TYPE_TGID: contains tgid of proce 112 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs 113 f) TASKSTATS_TYPE_STATS: contains the per-tgid 113 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process 114 114 115 115 116 per-tgid stats 116 per-tgid stats 117 -------------- 117 -------------- 118 118 119 Taskstats provides per-process stats, in addit 119 Taskstats provides per-process stats, in addition to per-task stats, since 120 resource management is often done at a process 120 resource management is often done at a process granularity and aggregating task 121 stats in userspace alone is inefficient and po 121 stats in userspace alone is inefficient and potentially inaccurate (due to lack 122 of atomicity). 122 of atomicity). 123 123 124 However, maintaining per-process, in addition 124 However, maintaining per-process, in addition to per-task stats, within the 125 kernel has space and time overheads. To addres 125 kernel has space and time overheads. To address this, the taskstats code 126 accumulates each exiting task's statistics int 126 accumulates each exiting task's statistics into a process-wide data structure. 127 When the last task of a process exits, the pro 127 When the last task of a process exits, the process level data accumulated also 128 gets sent to userspace (along with the per-tas 128 gets sent to userspace (along with the per-task data). 129 129 130 When a user queries to get per-tgid data, the 130 When a user queries to get per-tgid data, the sum of all other live threads in 131 the group is added up and added to the accumul 131 the group is added up and added to the accumulated total for previously exited 132 threads of the same thread group. 132 threads of the same thread group. 133 133 134 Extending taskstats 134 Extending taskstats 135 ------------------- 135 ------------------- 136 136 137 There are two ways to extend the taskstats int 137 There are two ways to extend the taskstats interface to export more 138 per-task/process stats as patches to collect t 138 per-task/process stats as patches to collect them get added to the kernel 139 in future: 139 in future: 140 140 141 1. Adding more fields to the end of the existi 141 1. Adding more fields to the end of the existing struct taskstats. Backward 142 compatibility is ensured by the version num 142 compatibility is ensured by the version number within the 143 structure. Userspace will use only the fiel 143 structure. Userspace will use only the fields of the struct that correspond 144 to the version its using. 144 to the version its using. 145 145 146 2. Defining separate statistic structs and usi 146 2. Defining separate statistic structs and using the netlink attributes 147 interface to return them. Since userspace p 147 interface to return them. Since userspace processes each netlink attribute 148 independently, it can always ignore attribu 148 independently, it can always ignore attributes whose type it does not 149 understand (because it is using an older ve 149 understand (because it is using an older version of the interface). 150 150 151 151 152 Choosing between 1. and 2. is a matter of trad 152 Choosing between 1. and 2. is a matter of trading off flexibility and 153 overhead. If only a few fields need to be adde 153 overhead. If only a few fields need to be added, then 1. is the preferable 154 path since the kernel and userspace don't need 154 path since the kernel and userspace don't need to incur the overhead of 155 processing new netlink attributes. But if the 155 processing new netlink attributes. But if the new fields expand the existing 156 struct too much, requiring disparate userspace 156 struct too much, requiring disparate userspace accounting utilities to 157 unnecessarily receive large structures whose f 157 unnecessarily receive large structures whose fields are of no interest, then 158 extending the attributes structure would be wo 158 extending the attributes structure would be worthwhile. 159 159 160 Flow control for taskstats 160 Flow control for taskstats 161 -------------------------- 161 -------------------------- 162 162 163 When the rate of task exits becomes large, a l 163 When the rate of task exits becomes large, a listener may not be able to keep 164 up with the kernel's rate of sending per-tid/t 164 up with the kernel's rate of sending per-tid/tgid exit data leading to data 165 loss. This possibility gets compounded when th 165 loss. This possibility gets compounded when the taskstats structure gets 166 extended and the number of cpus grows large. 166 extended and the number of cpus grows large. 167 167 168 To avoid losing statistics, userspace should d 168 To avoid losing statistics, userspace should do one or more of the following: 169 169 170 - increase the receive buffer sizes for the ne 170 - increase the receive buffer sizes for the netlink sockets opened by 171 listeners to receive exit data. 171 listeners to receive exit data. 172 172 173 - create more listeners and reduce the number 173 - create more listeners and reduce the number of cpus being listened to by 174 each listener. In the extreme case, there co 174 each listener. In the extreme case, there could be one listener for each cpu. 175 Users may also consider setting the cpu affi 175 Users may also consider setting the cpu affinity of the listener to the subset 176 of cpus to which it listens, especially if t 176 of cpus to which it listens, especially if they are listening to just one cpu. 177 177 178 Despite these measures, if the userspace recei 178 Despite these measures, if the userspace receives ENOBUFS error messages 179 indicated overflow of receive buffers, it shou 179 indicated overflow of receive buffers, it should take measures to handle the 180 loss of data. 180 loss of data.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.