1 ============================================== 1 ====================================================== 2 hrtimers - subsystem for high-resolution kerne 2 hrtimers - subsystem for high-resolution kernel timers 3 ============================================== 3 ====================================================== 4 4 5 This patch introduces a new subsystem for high 5 This patch introduces a new subsystem for high-resolution kernel timers. 6 6 7 One might ask the question: we already have a 7 One might ask the question: we already have a timer subsystem 8 (kernel/timers.c), why do we need two timer su 8 (kernel/timers.c), why do we need two timer subsystems? After a lot of 9 back and forth trying to integrate high-resolu 9 back and forth trying to integrate high-resolution and high-precision 10 features into the existing timer framework, an 10 features into the existing timer framework, and after testing various 11 such high-resolution timer implementations in 11 such high-resolution timer implementations in practice, we came to the 12 conclusion that the timer wheel code is fundam 12 conclusion that the timer wheel code is fundamentally not suitable for 13 such an approach. We initially didn't believe 13 such an approach. We initially didn't believe this ('there must be a way 14 to solve this'), and spent a considerable effo 14 to solve this'), and spent a considerable effort trying to integrate 15 things into the timer wheel, but we failed. In 15 things into the timer wheel, but we failed. In hindsight, there are 16 several reasons why such integration is hard/i 16 several reasons why such integration is hard/impossible: 17 17 18 - the forced handling of low-resolution and hi 18 - the forced handling of low-resolution and high-resolution timers in 19 the same way leads to a lot of compromises, 19 the same way leads to a lot of compromises, macro magic and #ifdef 20 mess. The timers.c code is very "tightly cod 20 mess. The timers.c code is very "tightly coded" around jiffies and 21 32-bitness assumptions, and has been honed a 21 32-bitness assumptions, and has been honed and micro-optimized for a 22 relatively narrow use case (jiffies in a rel 22 relatively narrow use case (jiffies in a relatively narrow HZ range) 23 for many years - and thus even small extensi 23 for many years - and thus even small extensions to it easily break 24 the wheel concept, leading to even worse com 24 the wheel concept, leading to even worse compromises. The timer wheel 25 code is very good and tight code, there's ze 25 code is very good and tight code, there's zero problems with it in its 26 current usage - but it is simply not suitabl 26 current usage - but it is simply not suitable to be extended for 27 high-res timers. 27 high-res timers. 28 28 29 - the unpredictable [O(N)] overhead of cascadi 29 - the unpredictable [O(N)] overhead of cascading leads to delays which 30 necessitate a more complex handling of high 30 necessitate a more complex handling of high resolution timers, which 31 in turn decreases robustness. Such a design 31 in turn decreases robustness. Such a design still leads to rather large 32 timing inaccuracies. Cascading is a fundamen 32 timing inaccuracies. Cascading is a fundamental property of the timer 33 wheel concept, it cannot be 'designed out' w 33 wheel concept, it cannot be 'designed out' without inevitably 34 degrading other portions of the timers.c cod 34 degrading other portions of the timers.c code in an unacceptable way. 35 35 36 - the implementation of the current posix-time 36 - the implementation of the current posix-timer subsystem on top of 37 the timer wheel has already introduced a qui 37 the timer wheel has already introduced a quite complex handling of 38 the required readjusting of absolute CLOCK_R 38 the required readjusting of absolute CLOCK_REALTIME timers at 39 settimeofday or NTP time - further underlyin 39 settimeofday or NTP time - further underlying our experience by 40 example: that the timer wheel data structure 40 example: that the timer wheel data structure is too rigid for high-res 41 timers. 41 timers. 42 42 43 - the timer wheel code is most optimal for use 43 - the timer wheel code is most optimal for use cases which can be 44 identified as "timeouts". Such timeouts are 44 identified as "timeouts". Such timeouts are usually set up to cover 45 error conditions in various I/O paths, such 45 error conditions in various I/O paths, such as networking and block 46 I/O. The vast majority of those timers never 46 I/O. The vast majority of those timers never expire and are rarely 47 recascaded because the expected correct even 47 recascaded because the expected correct event arrives in time so they 48 can be removed from the timer wheel before a 48 can be removed from the timer wheel before any further processing of 49 them becomes necessary. Thus the users of th 49 them becomes necessary. Thus the users of these timeouts can accept 50 the granularity and precision tradeoffs of t 50 the granularity and precision tradeoffs of the timer wheel, and 51 largely expect the timer subsystem to have n 51 largely expect the timer subsystem to have near-zero overhead. 52 Accurate timing for them is not a core purpo 52 Accurate timing for them is not a core purpose - in fact most of the 53 timeout values used are ad-hoc. For them it 53 timeout values used are ad-hoc. For them it is at most a necessary 54 evil to guarantee the processing of actual t 54 evil to guarantee the processing of actual timeout completions 55 (because most of the timeouts are deleted be 55 (because most of the timeouts are deleted before completion), which 56 should thus be as cheap and unintrusive as p 56 should thus be as cheap and unintrusive as possible. 57 57 58 The primary users of precision timers are user 58 The primary users of precision timers are user-space applications that 59 utilize nanosleep, posix-timers and itimer int 59 utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel 60 users like drivers and subsystems which requir 60 users like drivers and subsystems which require precise timed events 61 (e.g. multimedia) can benefit from the availab 61 (e.g. multimedia) can benefit from the availability of a separate 62 high-resolution timer subsystem as well. 62 high-resolution timer subsystem as well. 63 63 64 While this subsystem does not offer high-resol 64 While this subsystem does not offer high-resolution clock sources just 65 yet, the hrtimer subsystem can be easily exten 65 yet, the hrtimer subsystem can be easily extended with high-resolution 66 clock capabilities, and patches for that exist 66 clock capabilities, and patches for that exist and are maturing quickly. 67 The increasing demand for realtime and multime 67 The increasing demand for realtime and multimedia applications along 68 with other potential users for precise timers 68 with other potential users for precise timers gives another reason to 69 separate the "timeout" and "precise timer" sub 69 separate the "timeout" and "precise timer" subsystems. 70 70 71 Another potential benefit is that such a separ 71 Another potential benefit is that such a separation allows even more 72 special-purpose optimization of the existing t 72 special-purpose optimization of the existing timer wheel for the low 73 resolution and low precision use cases - once 73 resolution and low precision use cases - once the precision-sensitive 74 APIs are separated from the timer wheel and ar 74 APIs are separated from the timer wheel and are migrated over to 75 hrtimers. E.g. we could decrease the frequency 75 hrtimers. E.g. we could decrease the frequency of the timeout subsystem 76 from 250 Hz to 100 HZ (or even smaller). 76 from 250 Hz to 100 HZ (or even smaller). 77 77 78 hrtimer subsystem implementation details 78 hrtimer subsystem implementation details 79 ---------------------------------------- 79 ---------------------------------------- 80 80 81 the basic design considerations were: 81 the basic design considerations were: 82 82 83 - simplicity 83 - simplicity 84 84 85 - data structure not bound to jiffies or any o 85 - data structure not bound to jiffies or any other granularity. All the 86 kernel logic works at 64-bit nanoseconds res 86 kernel logic works at 64-bit nanoseconds resolution - no compromises. 87 87 88 - simplification of existing, timing related k 88 - simplification of existing, timing related kernel code 89 89 90 another basic requirement was the immediate en 90 another basic requirement was the immediate enqueueing and ordering of 91 timers at activation time. After looking at se 91 timers at activation time. After looking at several possible solutions 92 such as radix trees and hashes, we chose the r 92 such as radix trees and hashes, we chose the red black tree as the basic 93 data structure. Rbtrees are available as a lib 93 data structure. Rbtrees are available as a library in the kernel and are 94 used in various performance-critical areas of 94 used in various performance-critical areas of e.g. memory management and 95 file systems. The rbtree is solely used for ti 95 file systems. The rbtree is solely used for time sorted ordering, while 96 a separate list is used to give the expiry cod 96 a separate list is used to give the expiry code fast access to the 97 queued timers, without having to walk the rbtr 97 queued timers, without having to walk the rbtree. 98 98 99 (This separate list is also useful for later w 99 (This separate list is also useful for later when we'll introduce 100 high-resolution clocks, where we need separate 100 high-resolution clocks, where we need separate pending and expired 101 queues while keeping the time-order intact.) 101 queues while keeping the time-order intact.) 102 102 103 Time-ordered enqueueing is not purely for the 103 Time-ordered enqueueing is not purely for the purposes of 104 high-resolution clocks though, it also simplif 104 high-resolution clocks though, it also simplifies the handling of 105 absolute timers based on a low-resolution CLOC 105 absolute timers based on a low-resolution CLOCK_REALTIME. The existing 106 implementation needed to keep an extra list of 106 implementation needed to keep an extra list of all armed absolute 107 CLOCK_REALTIME timers along with complex locki 107 CLOCK_REALTIME timers along with complex locking. In case of 108 settimeofday and NTP, all the timers (!) had t 108 settimeofday and NTP, all the timers (!) had to be dequeued, the 109 time-changing code had to fix them up one by o 109 time-changing code had to fix them up one by one, and all of them had to 110 be enqueued again. The time-ordered enqueueing 110 be enqueued again. The time-ordered enqueueing and the storage of the 111 expiry time in absolute time units removes all 111 expiry time in absolute time units removes all this complex and poorly 112 scaling code from the posix-timer implementati 112 scaling code from the posix-timer implementation - the clock can simply 113 be set without having to touch the rbtree. Thi 113 be set without having to touch the rbtree. This also makes the handling 114 of posix-timers simpler in general. 114 of posix-timers simpler in general. 115 115 116 The locking and per-CPU behavior of hrtimers w 116 The locking and per-CPU behavior of hrtimers was mostly taken from the 117 existing timer wheel code, as it is mature and 117 existing timer wheel code, as it is mature and well suited. Sharing code 118 was not really a win, due to the different dat 118 was not really a win, due to the different data structures. Also, the 119 hrtimer functions now have clearer behavior an 119 hrtimer functions now have clearer behavior and clearer names - such as 120 hrtimer_try_to_cancel() and hrtimer_cancel() [ 120 hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly 121 equivalent to timer_delete() and timer_delete_ 121 equivalent to timer_delete() and timer_delete_sync()] - so there's no direct 122 1:1 mapping between them on the algorithmic le 122 1:1 mapping between them on the algorithmic level, and thus no real 123 potential for code sharing either. 123 potential for code sharing either. 124 124 125 Basic data types: every time value, absolute o 125 Basic data types: every time value, absolute or relative, is in a 126 special nanosecond-resolution 64bit type: ktim 126 special nanosecond-resolution 64bit type: ktime_t. 127 (Originally, the kernel-internal representatio 127 (Originally, the kernel-internal representation of ktime_t values and 128 operations was implemented via macros and inli 128 operations was implemented via macros and inline functions, and could be 129 switched between a "hybrid union" type and a p 129 switched between a "hybrid union" type and a plain "scalar" 64bit 130 nanoseconds representation (at compile time). 130 nanoseconds representation (at compile time). This was abandoned in the 131 context of the Y2038 work.) 131 context of the Y2038 work.) 132 132 133 hrtimers - rounding of timer values 133 hrtimers - rounding of timer values 134 ----------------------------------- 134 ----------------------------------- 135 135 136 the hrtimer code will round timer events to lo 136 the hrtimer code will round timer events to lower-resolution clocks 137 because it has to. Otherwise it will do no art 137 because it has to. Otherwise it will do no artificial rounding at all. 138 138 139 one question is, what resolution value should 139 one question is, what resolution value should be returned to the user by 140 the clock_getres() interface. This will return 140 the clock_getres() interface. This will return whatever real resolution 141 a given clock has - be it low-res, high-res, o 141 a given clock has - be it low-res, high-res, or artificially-low-res. 142 142 143 hrtimers - testing and verification 143 hrtimers - testing and verification 144 ----------------------------------- 144 ----------------------------------- 145 145 146 We used the high-resolution clock subsystem on 146 We used the high-resolution clock subsystem on top of hrtimers to verify 147 the hrtimer implementation details in praxis, 147 the hrtimer implementation details in praxis, and we also ran the posix 148 timer tests in order to ensure specification c 148 timer tests in order to ensure specification compliance. We also ran 149 tests on low-resolution clocks. 149 tests on low-resolution clocks. 150 150 151 The hrtimer patch converts the following kerne 151 The hrtimer patch converts the following kernel functionality to use 152 hrtimers: 152 hrtimers: 153 153 154 - nanosleep 154 - nanosleep 155 - itimers 155 - itimers 156 - posix-timers 156 - posix-timers 157 157 158 The conversion of nanosleep and posix-timers e 158 The conversion of nanosleep and posix-timers enabled the unification of 159 nanosleep and clock_nanosleep. 159 nanosleep and clock_nanosleep. 160 160 161 The code was successfully compiled for the fol 161 The code was successfully compiled for the following platforms: 162 162 163 i386, x86_64, ARM, PPC, PPC64, IA64 163 i386, x86_64, ARM, PPC, PPC64, IA64 164 164 165 The code was run-tested on the following platf 165 The code was run-tested on the following platforms: 166 166 167 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 167 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 168 168 169 hrtimers were also integrated into the -rt tre 169 hrtimers were also integrated into the -rt tree, along with a 170 hrtimers-based high-resolution clock implement 170 hrtimers-based high-resolution clock implementation, so the hrtimers 171 code got a healthy amount of testing and use i 171 code got a healthy amount of testing and use in practice. 172 172 173 Thomas Gleixner, Ingo Molnar 173 Thomas Gleixner, Ingo Molnar
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.