1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 =========================== 3 =========================== 4 The KVM halt polling system 4 The KVM halt polling system 5 =========================== 5 =========================== 6 6 7 The KVM halt polling system provides a feature 7 The KVM halt polling system provides a feature within KVM whereby the latency 8 of a guest can, under some circumstances, be r 8 of a guest can, under some circumstances, be reduced by polling in the host 9 for some time period after the guest has elect 9 for some time period after the guest has elected to no longer run by cedeing. 10 That is, when a guest vcpu has ceded, or in th 10 That is, when a guest vcpu has ceded, or in the case of powerpc when all of the 11 vcpus of a single vcore have ceded, the host k 11 vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions 12 before giving up the cpu to the scheduler in o 12 before giving up the cpu to the scheduler in order to let something else run. 13 13 14 Polling provides a latency advantage in cases 14 Polling provides a latency advantage in cases where the guest can be run again 15 very quickly by at least saving us a trip thro 15 very quickly by at least saving us a trip through the scheduler, normally on 16 the order of a few micro-seconds, although per 16 the order of a few micro-seconds, although performance benefits are workload 17 dependent. In the event that no wakeup source 17 dependent. In the event that no wakeup source arrives during the polling 18 interval or some other task on the runqueue is 18 interval or some other task on the runqueue is runnable the scheduler is 19 invoked. Thus halt polling is especially usefu 19 invoked. Thus halt polling is especially useful on workloads with very short 20 wakeup periods where the time spent halt polli 20 wakeup periods where the time spent halt polling is minimised and the time 21 savings of not invoking the scheduler are dist 21 savings of not invoking the scheduler are distinguishable. 22 22 23 The generic halt polling code is implemented i 23 The generic halt polling code is implemented in: 24 24 25 virt/kvm/kvm_main.c: kvm_vcpu_block() 25 virt/kvm/kvm_main.c: kvm_vcpu_block() 26 26 27 The powerpc kvm-hv specific case is implemente 27 The powerpc kvm-hv specific case is implemented in: 28 28 29 arch/powerpc/kvm/book3s_hv.c: kvmppc_v 29 arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked() 30 30 31 Halt Polling Interval 31 Halt Polling Interval 32 ===================== 32 ===================== 33 33 34 The maximum time for which to poll before invo 34 The maximum time for which to poll before invoking the scheduler, referred to 35 as the halt polling interval, is increased and 35 as the halt polling interval, is increased and decreased based on the perceived 36 effectiveness of the polling in an attempt to 36 effectiveness of the polling in an attempt to limit pointless polling. 37 This value is stored in either the vcpu struct 37 This value is stored in either the vcpu struct: 38 38 39 kvm_vcpu->halt_poll_ns 39 kvm_vcpu->halt_poll_ns 40 40 41 or in the case of powerpc kvm-hv, in the vcore 41 or in the case of powerpc kvm-hv, in the vcore struct: 42 42 43 kvmppc_vcore->halt_poll_ns 43 kvmppc_vcore->halt_poll_ns 44 44 45 Thus this is a per vcpu (or vcore) value. 45 Thus this is a per vcpu (or vcore) value. 46 46 47 During polling if a wakeup source is received 47 During polling if a wakeup source is received within the halt polling interval, 48 the interval is left unchanged. In the event t 48 the interval is left unchanged. In the event that a wakeup source isn't 49 received during the polling interval (and thus 49 received during the polling interval (and thus schedule is invoked) there are 50 two options, either the polling interval and t 50 two options, either the polling interval and total block time[0] were less than 51 the global max polling interval (see module pa 51 the global max polling interval (see module params below), or the total block 52 time was greater than the global max polling i 52 time was greater than the global max polling interval. 53 53 54 In the event that both the polling interval an 54 In the event that both the polling interval and total block time were less than 55 the global max polling interval then the polli 55 the global max polling interval then the polling interval can be increased in 56 the hope that next time during the longer poll 56 the hope that next time during the longer polling interval the wake up source 57 will be received while the host is polling and 57 will be received while the host is polling and the latency benefits will be 58 received. The polling interval is grown in the 58 received. The polling interval is grown in the function grow_halt_poll_ns() and 59 is multiplied by the module parameters halt_po 59 is multiplied by the module parameters halt_poll_ns_grow and 60 halt_poll_ns_grow_start. 60 halt_poll_ns_grow_start. 61 61 62 In the event that the total block time was gre 62 In the event that the total block time was greater than the global max polling 63 interval then the host will never poll for lon 63 interval then the host will never poll for long enough (limited by the global 64 max) to wakeup during the polling interval so 64 max) to wakeup during the polling interval so it may as well be shrunk in order 65 to avoid pointless polling. The polling interv 65 to avoid pointless polling. The polling interval is shrunk in the function 66 shrink_halt_poll_ns() and is divided by the mo 66 shrink_halt_poll_ns() and is divided by the module parameter 67 halt_poll_ns_shrink, or set to 0 iff halt_poll 67 halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0. 68 68 69 It is worth noting that this adjustment proces 69 It is worth noting that this adjustment process attempts to hone in on some 70 steady state polling interval but will only re 70 steady state polling interval but will only really do a good job for wakeups 71 which come at an approximately constant rate, 71 which come at an approximately constant rate, otherwise there will be constant 72 adjustment of the polling interval. 72 adjustment of the polling interval. 73 73 74 [0] total block time: 74 [0] total block time: 75 the time between when th 75 the time between when the halt polling function is 76 invoked and a wakeup sou 76 invoked and a wakeup source received (irrespective of 77 whether the scheduler is 77 whether the scheduler is invoked within that function). 78 78 79 Module Parameters 79 Module Parameters 80 ================= 80 ================= 81 81 82 The kvm module has 4 tunable module parameters 82 The kvm module has 4 tunable module parameters to adjust the global max polling 83 interval, the initial value (to grow from 0), 83 interval, the initial value (to grow from 0), and the rate at which the polling 84 interval is grown and shrunk. These variables 84 interval is grown and shrunk. These variables are defined in 85 include/linux/kvm_host.h and as module paramet 85 include/linux/kvm_host.h and as module parameters in virt/kvm/kvm_main.c, or 86 arch/powerpc/kvm/book3s_hv.c in the powerpc kv 86 arch/powerpc/kvm/book3s_hv.c in the powerpc kvm-hv case. 87 87 88 +-----------------------+--------------------- 88 +-----------------------+---------------------------+-------------------------+ 89 |Module Parameter | Description 89 |Module Parameter | Description | Default Value | 90 +-----------------------+--------------------- 90 +-----------------------+---------------------------+-------------------------+ 91 |halt_poll_ns | The global max polli 91 |halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT| 92 | | interval which defin 92 | | interval which defines | | 93 | | the ceiling value of 93 | | the ceiling value of the | | 94 | | polling interval for 94 | | polling interval for | (per arch value) | 95 | | each vcpu. 95 | | each vcpu. | | 96 +-----------------------+--------------------- 96 +-----------------------+---------------------------+-------------------------+ 97 |halt_poll_ns_grow | The value by which t 97 |halt_poll_ns_grow | The value by which the | 2 | 98 | | halt polling interva 98 | | halt polling interval is | | 99 | | multiplied in the 99 | | multiplied in the | | 100 | | grow_halt_poll_ns() 100 | | grow_halt_poll_ns() | | 101 | | function. 101 | | function. | | 102 +-----------------------+--------------------- 102 +-----------------------+---------------------------+-------------------------+ 103 |halt_poll_ns_grow_start| The initial value to 103 |halt_poll_ns_grow_start| The initial value to grow | 10000 | 104 | | to from zero in the 104 | | to from zero in the | | 105 | | grow_halt_poll_ns() 105 | | grow_halt_poll_ns() | | 106 | | function. 106 | | function. | | 107 +-----------------------+--------------------- 107 +-----------------------+---------------------------+-------------------------+ 108 |halt_poll_ns_shrink | The value by which t 108 |halt_poll_ns_shrink | The value by which the | 2 | 109 | | halt polling interva 109 | | halt polling interval is | | 110 | | divided in the 110 | | divided in the | | 111 | | shrink_halt_poll_ns( 111 | | shrink_halt_poll_ns() | | 112 | | function. 112 | | function. | | 113 +-----------------------+--------------------- 113 +-----------------------+---------------------------+-------------------------+ 114 114 115 These module parameters can be set from the sy 115 These module parameters can be set from the sysfs files in: 116 116 117 /sys/module/kvm/parameters/ 117 /sys/module/kvm/parameters/ 118 118 119 Note: these module parameters are system-wide 119 Note: these module parameters are system-wide values and are not able to 120 be tuned on a per vm basis. 120 be tuned on a per vm basis. 121 121 122 Any changes to these parameters will be picked 122 Any changes to these parameters will be picked up by new and existing vCPUs the 123 next time they halt, with the notable exceptio 123 next time they halt, with the notable exception of VMs using KVM_CAP_HALT_POLL 124 (see next section). 124 (see next section). 125 125 126 KVM_CAP_HALT_POLL 126 KVM_CAP_HALT_POLL 127 ================= 127 ================= 128 128 129 KVM_CAP_HALT_POLL is a VM capability that allo 129 KVM_CAP_HALT_POLL is a VM capability that allows userspace to override halt_poll_ns 130 on a per-VM basis. VMs using KVM_CAP_HALT_POLL 130 on a per-VM basis. VMs using KVM_CAP_HALT_POLL ignore halt_poll_ns completely (but 131 still obey halt_poll_ns_grow, halt_poll_ns_gro 131 still obey halt_poll_ns_grow, halt_poll_ns_grow_start, and halt_poll_ns_shrink). 132 132 133 See Documentation/virt/kvm/api.rst for more in 133 See Documentation/virt/kvm/api.rst for more information on this capability. 134 134 135 Further Notes 135 Further Notes 136 ============= 136 ============= 137 137 138 - Care should be taken when setting the halt_p 138 - Care should be taken when setting the halt_poll_ns module parameter as a large value 139 has the potential to drive the cpu usage to 139 has the potential to drive the cpu usage to 100% on a machine which would be almost 140 entirely idle otherwise. This is because eve 140 entirely idle otherwise. This is because even if a guest has wakeups during which very 141 little work is done and which are quite far 141 little work is done and which are quite far apart, if the period is shorter than the 142 global max polling interval (halt_poll_ns) t 142 global max polling interval (halt_poll_ns) then the host will always poll for the 143 entire block time and thus cpu utilisation w 143 entire block time and thus cpu utilisation will go to 100%. 144 144 145 - Halt polling essentially presents a trade-of 145 - Halt polling essentially presents a trade-off between power usage and latency and 146 the module parameters should be used to tune 146 the module parameters should be used to tune the affinity for this. Idle cpu time is 147 essentially converted to host kernel time wi 147 essentially converted to host kernel time with the aim of decreasing latency when 148 entering the guest. 148 entering the guest. 149 149 150 - Halt polling will only be conducted by the h 150 - Halt polling will only be conducted by the host when no other tasks are runnable on 151 that cpu, otherwise the polling will cease i 151 that cpu, otherwise the polling will cease immediately and schedule will be invoked to 152 allow that other task to run. Thus this does 152 allow that other task to run. Thus this doesn't allow a guest to cause denial of service 153 of the cpu. 153 of the cpu.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.