1 ========================== 1 ========================== 2 BFQ (Budget Fair Queueing) 2 BFQ (Budget Fair Queueing) 3 ========================== 3 ========================== 4 4 5 BFQ is a proportional-share I/O scheduler, wit 5 BFQ is a proportional-share I/O scheduler, with some extra 6 low-latency capabilities. In addition to cgrou 6 low-latency capabilities. In addition to cgroups support (blkio or io 7 controllers), BFQ's main features are: 7 controllers), BFQ's main features are: 8 8 9 - BFQ guarantees a high system and application 9 - BFQ guarantees a high system and application responsiveness, and a 10 low latency for time-sensitive applications, 10 low latency for time-sensitive applications, such as audio or video 11 players; 11 players; 12 - BFQ distributes bandwidth, not just time, am !! 12 - BFQ distributes bandwidth, and not just time, among processes or 13 groups (switching back to time distribution 13 groups (switching back to time distribution when needed to keep 14 throughput high). 14 throughput high). 15 15 16 In its default configuration, BFQ privileges l 16 In its default configuration, BFQ privileges latency over 17 throughput. So, when needed for achieving a lo 17 throughput. So, when needed for achieving a lower latency, BFQ builds 18 schedules that may lead to a lower throughput. 18 schedules that may lead to a lower throughput. If your main or only 19 goal, for a given device, is to achieve the ma 19 goal, for a given device, is to achieve the maximum-possible 20 throughput at all times, then do switch off al 20 throughput at all times, then do switch off all low-latency heuristics 21 for that device, by setting low_latency to 0. 21 for that device, by setting low_latency to 0. See Section 3 for 22 details on how to configure BFQ for the desire 22 details on how to configure BFQ for the desired tradeoff between 23 latency and throughput, or on how to maximize 23 latency and throughput, or on how to maximize throughput. 24 24 25 As every I/O scheduler, BFQ adds some overhead 25 As every I/O scheduler, BFQ adds some overhead to per-I/O-request 26 processing. To give an idea of this overhead, 26 processing. To give an idea of this overhead, the total, 27 single-lock-protected, per-request processing 27 single-lock-protected, per-request processing time of BFQ---i.e., the 28 sum of the execution times of the request inse 28 sum of the execution times of the request insertion, dispatch and 29 completion hooks---is, e.g., 1.9 us on an Inte 29 completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz 30 (dated CPU for notebooks; time measured with s 30 (dated CPU for notebooks; time measured with simple code 31 instrumentation, and using the throughput-sync 31 instrumentation, and using the throughput-sync.sh script of the S 32 suite [1], in performance-profiling mode). To 32 suite [1], in performance-profiling mode). To put this result into 33 context, the total, single-lock-protected, per 33 context, the total, single-lock-protected, per-request execution time 34 of the lightest I/O scheduler available in blk 34 of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 35 us (mq-deadline is ~800 LOC, against ~10500 LO 35 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). 36 36 37 Scheduling overhead further limits the maximum 37 Scheduling overhead further limits the maximum IOPS that a CPU can 38 process (already limited by the execution of t 38 process (already limited by the execution of the rest of the I/O 39 stack). To give an idea of the limits with BFQ 39 stack). To give an idea of the limits with BFQ, on slow or average 40 CPUs, here are, first, the limits of BFQ for t 40 CPUs, here are, first, the limits of BFQ for three different CPUs, on, 41 respectively, an average laptop, an old deskto 41 respectively, an average laptop, an old desktop, and a cheap embedded 42 system, in case full hierarchical support is e 42 system, in case full hierarchical support is enabled (i.e., 43 CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_B 43 CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not 44 set (Section 4-2): 44 set (Section 4-2): 45 - Intel i7-4850HQ: 400 KIOPS 45 - Intel i7-4850HQ: 400 KIOPS 46 - AMD A8-3850: 250 KIOPS 46 - AMD A8-3850: 250 KIOPS 47 - ARM CortexTM-A53 Octa-core: 80 KIOPS 47 - ARM CortexTM-A53 Octa-core: 80 KIOPS 48 48 49 If CONFIG_BFQ_CGROUP_DEBUG is set (and of cour 49 If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical 50 support is enabled), then the sustainable thro 50 support is enabled), then the sustainable throughput with BFQ 51 decreases, because all blkio.bfq* statistics a 51 decreases, because all blkio.bfq* statistics are created and updated 52 (Section 4-2). For BFQ, this leads to the foll 52 (Section 4-2). For BFQ, this leads to the following maximum 53 sustainable throughputs, on the same systems a 53 sustainable throughputs, on the same systems as above: 54 - Intel i7-4850HQ: 310 KIOPS 54 - Intel i7-4850HQ: 310 KIOPS 55 - AMD A8-3850: 200 KIOPS 55 - AMD A8-3850: 200 KIOPS 56 - ARM CortexTM-A53 Octa-core: 56 KIOPS 56 - ARM CortexTM-A53 Octa-core: 56 KIOPS 57 57 58 BFQ works for multi-queue devices too. 58 BFQ works for multi-queue devices too. 59 59 60 .. The table of contents follow. Impatients ca 60 .. The table of contents follow. Impatients can just jump to Section 3. 61 61 62 .. CONTENTS 62 .. CONTENTS 63 63 64 1. When may BFQ be useful? 64 1. When may BFQ be useful? 65 1-1 Personal systems 65 1-1 Personal systems 66 1-2 Server systems 66 1-2 Server systems 67 2. How does BFQ work? 67 2. How does BFQ work? 68 3. What are BFQ's tunables and how to prope 68 3. What are BFQ's tunables and how to properly configure BFQ? 69 4. BFQ group scheduling 69 4. BFQ group scheduling 70 4-1 Service guarantees provided 70 4-1 Service guarantees provided 71 4-2 Interface 71 4-2 Interface 72 72 73 1. When may BFQ be useful? 73 1. When may BFQ be useful? 74 ========================== 74 ========================== 75 75 76 BFQ provides the following benefits on persona 76 BFQ provides the following benefits on personal and server systems. 77 77 78 1-1 Personal systems 78 1-1 Personal systems 79 -------------------- 79 -------------------- 80 80 81 Low latency for interactive applications 81 Low latency for interactive applications 82 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 82 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 83 83 84 Regardless of the actual background workload, 84 Regardless of the actual background workload, BFQ guarantees that, for 85 interactive tasks, the storage device is virtu 85 interactive tasks, the storage device is virtually as responsive as if 86 it was idle. For example, even if one or more 86 it was idle. For example, even if one or more of the following 87 background workloads are being executed: 87 background workloads are being executed: 88 88 89 - one or more large files are being read, writ 89 - one or more large files are being read, written or copied, 90 - a tree of source files is being compiled, 90 - a tree of source files is being compiled, 91 - one or more virtual machines are performing 91 - one or more virtual machines are performing I/O, 92 - a software update is in progress, 92 - a software update is in progress, 93 - indexing daemons are scanning filesystems an 93 - indexing daemons are scanning filesystems and updating their 94 databases, 94 databases, 95 95 96 starting an application or loading a file from 96 starting an application or loading a file from within an application 97 takes about the same time as if the storage de 97 takes about the same time as if the storage device was idle. As a 98 comparison, with CFQ, NOOP or DEADLINE, and in 98 comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, 99 applications experience high latencies, or eve 99 applications experience high latencies, or even become unresponsive 100 until the background workload terminates (also 100 until the background workload terminates (also on SSDs). 101 101 102 Low latency for soft real-time applications 102 Low latency for soft real-time applications 103 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 103 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 104 Also soft real-time applications, such as audi 104 Also soft real-time applications, such as audio and video 105 players/streamers, enjoy a low latency and a l 105 players/streamers, enjoy a low latency and a low drop rate, regardless 106 of the background I/O workload. As a consequen 106 of the background I/O workload. As a consequence, these applications 107 do not suffer from almost any glitch due to th 107 do not suffer from almost any glitch due to the background workload. 108 108 109 Higher speed for code-development tasks 109 Higher speed for code-development tasks 110 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 110 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111 111 112 If some additional workload happens to be exec 112 If some additional workload happens to be executed in parallel, then 113 BFQ executes the I/O-related components of typ 113 BFQ executes the I/O-related components of typical code-development 114 tasks (compilation, checkout, merge, etc.) muc !! 114 tasks (compilation, checkout, merge, ...) much more quickly than CFQ, 115 NOOP or DEADLINE. 115 NOOP or DEADLINE. 116 116 117 High throughput 117 High throughput 118 ^^^^^^^^^^^^^^^ 118 ^^^^^^^^^^^^^^^ 119 119 120 On hard disks, BFQ achieves up to 30% higher t 120 On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and 121 up to 150% higher throughput than DEADLINE and 121 up to 150% higher throughput than DEADLINE and NOOP, with all the 122 sequential workloads considered in our tests. 122 sequential workloads considered in our tests. With random workloads, 123 and with all the workloads on flash-based devi 123 and with all the workloads on flash-based devices, BFQ achieves, 124 instead, about the same throughput as the othe 124 instead, about the same throughput as the other schedulers. 125 125 126 Strong fairness, bandwidth and delay guarantee 126 Strong fairness, bandwidth and delay guarantees 127 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 127 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 128 128 129 BFQ distributes the device throughput, and not 129 BFQ distributes the device throughput, and not just the device time, 130 among I/O-bound applications in proportion to !! 130 among I/O-bound applications in proportion their weights, with any 131 workload and regardless of the device paramete 131 workload and regardless of the device parameters. From these bandwidth 132 guarantees, it is possible to compute a tight !! 132 guarantees, it is possible to compute tight per-I/O-request delay 133 guarantees by a simple formula. If not configu 133 guarantees by a simple formula. If not configured for strict service 134 guarantees, BFQ switches to time-based resourc 134 guarantees, BFQ switches to time-based resource sharing (only) for 135 applications that would otherwise cause a thro 135 applications that would otherwise cause a throughput loss. 136 136 137 1-2 Server systems 137 1-2 Server systems 138 ------------------ 138 ------------------ 139 139 140 Most benefits for server systems follow from t 140 Most benefits for server systems follow from the same service 141 properties as above. In particular, regardless 141 properties as above. In particular, regardless of whether additional, 142 possibly heavy workloads are being served, BFQ 142 possibly heavy workloads are being served, BFQ guarantees: 143 143 144 * audio and video-streaming with zero or very 144 * audio and video-streaming with zero or very low jitter and drop 145 rate; 145 rate; 146 146 147 * fast retrieval of WEB pages and embedded obj 147 * fast retrieval of WEB pages and embedded objects; 148 148 149 * real-time recording of data in live-dumping 149 * real-time recording of data in live-dumping applications (e.g., 150 packet logging); 150 packet logging); 151 151 152 * responsiveness in local and remote access to 152 * responsiveness in local and remote access to a server. 153 153 154 154 155 2. How does BFQ work? 155 2. How does BFQ work? 156 ===================== 156 ===================== 157 157 158 BFQ is a proportional-share I/O scheduler, who 158 BFQ is a proportional-share I/O scheduler, whose general structure, 159 plus a lot of code, are borrowed from CFQ. 159 plus a lot of code, are borrowed from CFQ. 160 160 161 - Each process doing I/O on a device is associ 161 - Each process doing I/O on a device is associated with a weight and a 162 `(bfq_)queue`. 162 `(bfq_)queue`. 163 163 164 - BFQ grants exclusive access to the device, f 164 - BFQ grants exclusive access to the device, for a while, to one queue 165 (process) at a time, and implements this ser 165 (process) at a time, and implements this service model by 166 associating every queue with a budget, measu 166 associating every queue with a budget, measured in number of 167 sectors. 167 sectors. 168 168 169 - After a queue is granted access to the dev 169 - After a queue is granted access to the device, the budget of the 170 queue is decremented, on each request disp 170 queue is decremented, on each request dispatch, by the size of the 171 request. 171 request. 172 172 173 - The in-service queue is expired, i.e., its 173 - The in-service queue is expired, i.e., its service is suspended, 174 only if one of the following events occurs 174 only if one of the following events occurs: 1) the queue finishes 175 its budget, 2) the queue empties, 3) a "bu 175 its budget, 2) the queue empties, 3) a "budget timeout" fires. 176 176 177 - The budget timeout prevents processes do 177 - The budget timeout prevents processes doing random I/O from 178 holding the device for too long and dram 178 holding the device for too long and dramatically reducing 179 throughput. 179 throughput. 180 180 181 - Actually, as in CFQ, a queue associated 181 - Actually, as in CFQ, a queue associated with a process issuing 182 sync requests may not be expired immedia 182 sync requests may not be expired immediately when it empties. In 183 contrast, BFQ may idle the device for a 183 contrast, BFQ may idle the device for a short time interval, 184 giving the process the chance to go on b 184 giving the process the chance to go on being served if it issues 185 a new request in time. Device idling typ 185 a new request in time. Device idling typically boosts the 186 throughput on rotational devices and on 186 throughput on rotational devices and on non-queueing flash-based 187 devices, if processes do synchronous and 187 devices, if processes do synchronous and sequential I/O. In 188 addition, under BFQ, device idling is al 188 addition, under BFQ, device idling is also instrumental in 189 guaranteeing the desired throughput frac 189 guaranteeing the desired throughput fraction to processes 190 issuing sync requests (see the descripti 190 issuing sync requests (see the description of the slice_idle 191 tunable in this document, or [1, 2], for 191 tunable in this document, or [1, 2], for more details). 192 192 193 - With respect to idling for service gua 193 - With respect to idling for service guarantees, if several 194 processes are competing for the device 194 processes are competing for the device at the same time, but 195 all processes and groups have the same 195 all processes and groups have the same weight, then BFQ 196 guarantees the expected throughput dis 196 guarantees the expected throughput distribution without ever 197 idling the device. Throughput is thus 197 idling the device. Throughput is thus as high as possible in 198 this common scenario. 198 this common scenario. 199 199 200 - On flash-based storage with internal qu 200 - On flash-based storage with internal queueing of commands 201 (typically NCQ), device idling happens 201 (typically NCQ), device idling happens to be always detrimental 202 to throughput. So, with these devices, !! 202 for throughput. So, with these devices, BFQ performs idling 203 only when strictly needed for service g 203 only when strictly needed for service guarantees, i.e., for 204 guaranteeing low latency or fairness. I 204 guaranteeing low latency or fairness. In these cases, overall 205 throughput may be sub-optimal. No solut 205 throughput may be sub-optimal. No solution currently exists to 206 provide both strong service guarantees 206 provide both strong service guarantees and optimal throughput 207 on devices with internal queueing. 207 on devices with internal queueing. 208 208 209 - If low-latency mode is enabled (default co 209 - If low-latency mode is enabled (default configuration), BFQ 210 executes some special heuristics to detect 210 executes some special heuristics to detect interactive and soft 211 real-time applications (e.g., video or aud 211 real-time applications (e.g., video or audio players/streamers), 212 and to reduce their latency. The most impo 212 and to reduce their latency. The most important action taken to 213 achieve this goal is to give to the queues 213 achieve this goal is to give to the queues associated with these 214 applications more than their fair share of 214 applications more than their fair share of the device 215 throughput. For brevity, we call it just " !! 215 throughput. For brevity, we call just "weight-raising" the whole 216 sets of actions taken by BFQ to privilege 216 sets of actions taken by BFQ to privilege these queues. In 217 particular, BFQ provides a milder form of 217 particular, BFQ provides a milder form of weight-raising for 218 interactive applications, and a stronger f 218 interactive applications, and a stronger form for soft real-time 219 applications. 219 applications. 220 220 221 - BFQ automatically deactivates idling for q 221 - BFQ automatically deactivates idling for queues born in a burst of 222 queue creations. In fact, these queues are 222 queue creations. In fact, these queues are usually associated with 223 the processes of applications and services 223 the processes of applications and services that benefit mostly 224 from a high throughput. Examples are syste 224 from a high throughput. Examples are systemd during boot, or git 225 grep. 225 grep. 226 226 227 - As CFQ, BFQ merges queues performing inter 227 - As CFQ, BFQ merges queues performing interleaved I/O, i.e., 228 performing random I/O that becomes mostly 228 performing random I/O that becomes mostly sequential if 229 merged. Differently from CFQ, BFQ achieves 229 merged. Differently from CFQ, BFQ achieves this goal with a more 230 reactive mechanism, called Early Queue Mer 230 reactive mechanism, called Early Queue Merge (EQM). EQM is so 231 responsive in detecting interleaved I/O (c 231 responsive in detecting interleaved I/O (cooperating processes), 232 that it enables BFQ to achieve a high thro 232 that it enables BFQ to achieve a high throughput, by queue 233 merging, even for queues for which CFQ nee 233 merging, even for queues for which CFQ needs a different 234 mechanism, preemption, to get a high throu !! 234 mechanism, preemption, to get a high throughput. As such EQM is a 235 unified mechanism to achieve a high throug 235 unified mechanism to achieve a high throughput with interleaved 236 I/O. 236 I/O. 237 237 238 - Queues are scheduled according to a varian 238 - Queues are scheduled according to a variant of WF2Q+, named 239 B-WF2Q+, and implemented using an augmente 239 B-WF2Q+, and implemented using an augmented rb-tree to preserve an 240 O(log N) overall complexity. See [2] for 240 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is 241 also ready for hierarchical scheduling, de 241 also ready for hierarchical scheduling, details in Section 4. 242 242 243 - B-WF2Q+ guarantees a tight deviation with 243 - B-WF2Q+ guarantees a tight deviation with respect to an ideal, 244 perfectly fair, and smooth service. In par 244 perfectly fair, and smooth service. In particular, B-WF2Q+ 245 guarantees that each queue receives a frac 245 guarantees that each queue receives a fraction of the device 246 throughput proportional to its weight, eve 246 throughput proportional to its weight, even if the throughput 247 fluctuates, and regardless of: the device 247 fluctuates, and regardless of: the device parameters, the current 248 workload and the budgets assigned to the q 248 workload and the budgets assigned to the queue. 249 249 250 - The last, budget-independence, property (a 250 - The last, budget-independence, property (although probably 251 counterintuitive in the first place) is de 251 counterintuitive in the first place) is definitely beneficial, for 252 the following reasons: 252 the following reasons: 253 253 254 - First, with any proportional-share sched 254 - First, with any proportional-share scheduler, the maximum 255 deviation with respect to an ideal servi 255 deviation with respect to an ideal service is proportional to 256 the maximum budget (slice) assigned to q 256 the maximum budget (slice) assigned to queues. As a consequence, 257 BFQ can keep this deviation tight, not o !! 257 BFQ can keep this deviation tight not only because of the 258 accurate service of B-WF2Q+, but also be 258 accurate service of B-WF2Q+, but also because BFQ *does not* 259 need to assign a larger budget to a queu 259 need to assign a larger budget to a queue to let the queue 260 receive a higher fraction of the device 260 receive a higher fraction of the device throughput. 261 261 262 - Second, BFQ is free to choose, for every 262 - Second, BFQ is free to choose, for every process (queue), the 263 budget that best fits the needs of the p 263 budget that best fits the needs of the process, or best 264 leverages the I/O pattern of the process 264 leverages the I/O pattern of the process. In particular, BFQ 265 updates queue budgets with a simple feed 265 updates queue budgets with a simple feedback-loop algorithm that 266 allows a high throughput to be achieved, 266 allows a high throughput to be achieved, while still providing 267 tight latency guarantees to time-sensiti 267 tight latency guarantees to time-sensitive applications. When 268 the in-service queue expires, this algor 268 the in-service queue expires, this algorithm computes the next 269 budget of the queue so as to: 269 budget of the queue so as to: 270 270 271 - Let large budgets be eventually assign 271 - Let large budgets be eventually assigned to the queues 272 associated with I/O-bound applications 272 associated with I/O-bound applications performing sequential 273 I/O: in fact, the longer these applica 273 I/O: in fact, the longer these applications are served once 274 got access to the device, the higher t 274 got access to the device, the higher the throughput is. 275 275 276 - Let small budgets be eventually assign 276 - Let small budgets be eventually assigned to the queues 277 associated with time-sensitive applica 277 associated with time-sensitive applications (which typically 278 perform sporadic and short I/O), becau 278 perform sporadic and short I/O), because, the smaller the 279 budget assigned to a queue waiting for 279 budget assigned to a queue waiting for service is, the sooner 280 B-WF2Q+ will serve that queue (Subsec 280 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). 281 281 282 - If several processes are competing for the d 282 - If several processes are competing for the device at the same time, 283 but all processes and groups have the same w 283 but all processes and groups have the same weight, then BFQ 284 guarantees the expected throughput distribut 284 guarantees the expected throughput distribution without ever idling 285 the device. It uses preemption instead. Thro 285 the device. It uses preemption instead. Throughput is then much 286 higher in this common scenario. 286 higher in this common scenario. 287 287 288 - ioprio classes are served in strict priority 288 - ioprio classes are served in strict priority order, i.e., 289 lower-priority queues are not served as long 289 lower-priority queues are not served as long as there are 290 higher-priority queues. Among queues in the 290 higher-priority queues. Among queues in the same class, the 291 bandwidth is distributed in proportion to th 291 bandwidth is distributed in proportion to the weight of each 292 queue. A very thin extra bandwidth is howeve 292 queue. A very thin extra bandwidth is however guaranteed to 293 the Idle class, to prevent it from starving. 293 the Idle class, to prevent it from starving. 294 294 295 295 296 3. What are BFQ's tunables and how to properly 296 3. What are BFQ's tunables and how to properly configure BFQ? 297 ============================================== 297 ============================================================= 298 298 299 Most BFQ tunables affect service guarantees (b 299 Most BFQ tunables affect service guarantees (basically latency and 300 fairness) and throughput. For full details on 300 fairness) and throughput. For full details on how to choose the 301 desired tradeoff between service guarantees an 301 desired tradeoff between service guarantees and throughput, see the 302 parameters slice_idle, strict_guarantees and l 302 parameters slice_idle, strict_guarantees and low_latency. For details 303 on how to maximise throughput, see slice_idle, 303 on how to maximise throughput, see slice_idle, timeout_sync and 304 max_budget. The other performance-related para 304 max_budget. The other performance-related parameters have been 305 inherited from, and have been preserved mostly 305 inherited from, and have been preserved mostly for compatibility with 306 CFQ. So far, no performance improvement has be 306 CFQ. So far, no performance improvement has been reported after 307 changing the latter parameters in BFQ. 307 changing the latter parameters in BFQ. 308 308 309 In particular, the tunables back_seek-max, bac 309 In particular, the tunables back_seek-max, back_seek_penalty, 310 fifo_expire_async and fifo_expire_sync below a 310 fifo_expire_async and fifo_expire_sync below are the same as in 311 CFQ. Their description is just copied from tha 311 CFQ. Their description is just copied from that for CFQ. Some 312 considerations in the description of slice_idl 312 considerations in the description of slice_idle are copied from CFQ 313 too. 313 too. 314 314 315 per-process ioprio and weight 315 per-process ioprio and weight 316 ----------------------------- 316 ----------------------------- 317 317 318 Unless the cgroups interface is used (see "4. 318 Unless the cgroups interface is used (see "4. BFQ group scheduling"), 319 weights can be assigned to processes only indi 319 weights can be assigned to processes only indirectly, through I/O 320 priorities, and according to the relation: 320 priorities, and according to the relation: 321 weight = (IOPRIO_BE_NR - ioprio) * 10. 321 weight = (IOPRIO_BE_NR - ioprio) * 10. 322 322 323 Beware that, if low-latency is set, then BFQ a 323 Beware that, if low-latency is set, then BFQ automatically raises the 324 weight of the queues associated with interacti 324 weight of the queues associated with interactive and soft real-time 325 applications. Unset this tunable if you need/w 325 applications. Unset this tunable if you need/want to control weights. 326 326 327 slice_idle 327 slice_idle 328 ---------- 328 ---------- 329 329 330 This parameter specifies how long BFQ should i !! 330 This parameter specifies how long BFQ should idle for next I/O 331 request, when certain sync BFQ queues become e 331 request, when certain sync BFQ queues become empty. By default 332 slice_idle is a non-zero value. Idling has a d 332 slice_idle is a non-zero value. Idling has a double purpose: boosting 333 throughput and making sure that the desired th 333 throughput and making sure that the desired throughput distribution is 334 respected (see the description of how BFQ work 334 respected (see the description of how BFQ works, and, if needed, the 335 papers referred there). 335 papers referred there). 336 336 337 As for throughput, idling can be very helpful 337 As for throughput, idling can be very helpful on highly seeky media 338 like single spindle SATA/SAS disks where we ca 338 like single spindle SATA/SAS disks where we can cut down on overall 339 number of seeks and see improved throughput. 339 number of seeks and see improved throughput. 340 340 341 Setting slice_idle to 0 will remove all the id 341 Setting slice_idle to 0 will remove all the idling on queues and one 342 should see an overall improved throughput on f 342 should see an overall improved throughput on faster storage devices 343 like multiple SATA/SAS disks in hardware RAID 343 like multiple SATA/SAS disks in hardware RAID configuration, as well 344 as flash-based storage with internal command q 344 as flash-based storage with internal command queueing (and 345 parallelism). 345 parallelism). 346 346 347 So depending on storage and workload, it might 347 So depending on storage and workload, it might be useful to set 348 slice_idle=0. In general for SATA/SAS disks a 348 slice_idle=0. In general for SATA/SAS disks and software RAID of 349 SATA/SAS disks keeping slice_idle enabled shou 349 SATA/SAS disks keeping slice_idle enabled should be useful. For any 350 configurations where there are multiple spindl 350 configurations where there are multiple spindles behind single LUN 351 (Host based hardware RAID controller or for st 351 (Host based hardware RAID controller or for storage arrays), or with 352 flash-based fast storage, setting slice_idle=0 352 flash-based fast storage, setting slice_idle=0 might end up in better 353 throughput and acceptable latencies. 353 throughput and acceptable latencies. 354 354 355 Idling is however necessary to have service gu 355 Idling is however necessary to have service guarantees enforced in 356 case of differentiated weights or differentiat 356 case of differentiated weights or differentiated I/O-request lengths. 357 To see why, suppose that a given BFQ queue A m 357 To see why, suppose that a given BFQ queue A must get several I/O 358 requests served for each request served for an 358 requests served for each request served for another queue B. Idling 359 ensures that, if A makes a new I/O request sli 359 ensures that, if A makes a new I/O request slightly after becoming 360 empty, then no request of B is dispatched in t 360 empty, then no request of B is dispatched in the middle, and thus A 361 does not lose the possibility to get more than 361 does not lose the possibility to get more than one request dispatched 362 before the next request of B is dispatched. No 362 before the next request of B is dispatched. Note that idling 363 guarantees the desired differentiated treatmen 363 guarantees the desired differentiated treatment of queues only in 364 terms of I/O-request dispatches. To guarantee 364 terms of I/O-request dispatches. To guarantee that the actual service 365 order then corresponds to the dispatch order, 365 order then corresponds to the dispatch order, the strict_guarantees 366 tunable must be set too. 366 tunable must be set too. 367 367 368 There is an important flip side to idling: apa !! 368 There is an important flipside for idling: apart from the above cases 369 where it is beneficial also for throughput, id 369 where it is beneficial also for throughput, idling can severely impact 370 throughput. One important case is random workl 370 throughput. One important case is random workload. Because of this 371 issue, BFQ tends to avoid idling as much as po 371 issue, BFQ tends to avoid idling as much as possible, when it is not 372 beneficial also for throughput (as detailed in 372 beneficial also for throughput (as detailed in Section 2). As a 373 consequence of this behavior, and of further i 373 consequence of this behavior, and of further issues described for the 374 strict_guarantees tunable, short-term service 374 strict_guarantees tunable, short-term service guarantees may be 375 occasionally violated. And, in some cases, the 375 occasionally violated. And, in some cases, these guarantees may be 376 more important than guaranteeing maximum throu 376 more important than guaranteeing maximum throughput. For example, in 377 video playing/streaming, a very low drop rate 377 video playing/streaming, a very low drop rate may be more important 378 than maximum throughput. In these cases, consi 378 than maximum throughput. In these cases, consider setting the 379 strict_guarantees parameter. 379 strict_guarantees parameter. 380 380 381 slice_idle_us 381 slice_idle_us 382 ------------- 382 ------------- 383 383 384 Controls the same tuning parameter as slice_id 384 Controls the same tuning parameter as slice_idle, but in microseconds. 385 Either tunable can be used to set idling behav 385 Either tunable can be used to set idling behavior. Afterwards, the 386 other tunable will reflect the newly set value 386 other tunable will reflect the newly set value in sysfs. 387 387 388 strict_guarantees 388 strict_guarantees 389 ----------------- 389 ----------------- 390 390 391 If this parameter is set (default: unset), the 391 If this parameter is set (default: unset), then BFQ 392 392 393 - always performs idling when the in-service q 393 - always performs idling when the in-service queue becomes empty; 394 394 395 - forces the device to serve one I/O request a 395 - forces the device to serve one I/O request at a time, by dispatching a 396 new request only if there is no outstanding 396 new request only if there is no outstanding request. 397 397 398 In the presence of differentiated weights or I 398 In the presence of differentiated weights or I/O-request sizes, both 399 the above conditions are needed to guarantee t 399 the above conditions are needed to guarantee that every BFQ queue 400 receives its allotted share of the bandwidth. 400 receives its allotted share of the bandwidth. The first condition is 401 needed for the reasons explained in the descri 401 needed for the reasons explained in the description of the slice_idle 402 tunable. The second condition is needed becau 402 tunable. The second condition is needed because all modern storage 403 devices reorder internally-queued requests, wh 403 devices reorder internally-queued requests, which may trivially break 404 the service guarantees enforced by the I/O sch 404 the service guarantees enforced by the I/O scheduler. 405 405 406 Setting strict_guarantees may evidently affect 406 Setting strict_guarantees may evidently affect throughput. 407 407 408 back_seek_max 408 back_seek_max 409 ------------- 409 ------------- 410 410 411 This specifies, given in Kbytes, the maximum " 411 This specifies, given in Kbytes, the maximum "distance" for backward seeking. 412 The distance is the amount of space from the c 412 The distance is the amount of space from the current head location to the 413 sectors that are backward in terms of distance 413 sectors that are backward in terms of distance. 414 414 415 This parameter allows the scheduler to anticip 415 This parameter allows the scheduler to anticipate requests in the "backward" 416 direction and consider them as being the "next 416 direction and consider them as being the "next" if they are within this 417 distance from the current head location. 417 distance from the current head location. 418 418 419 back_seek_penalty 419 back_seek_penalty 420 ----------------- 420 ----------------- 421 421 422 This parameter is used to compute the cost of 422 This parameter is used to compute the cost of backward seeking. If the 423 backward distance of request is just 1/back_se 423 backward distance of request is just 1/back_seek_penalty from a "front" 424 request, then the seeking cost of two requests 424 request, then the seeking cost of two requests is considered equivalent. 425 425 426 So scheduler will not bias toward one or the o 426 So scheduler will not bias toward one or the other request (otherwise scheduler 427 will bias toward front request). Default value 427 will bias toward front request). Default value of back_seek_penalty is 2. 428 428 429 fifo_expire_async 429 fifo_expire_async 430 ----------------- 430 ----------------- 431 431 432 This parameter is used to set the timeout of a 432 This parameter is used to set the timeout of asynchronous requests. Default 433 value of this is 250ms. !! 433 value of this is 248ms. 434 434 435 fifo_expire_sync 435 fifo_expire_sync 436 ---------------- 436 ---------------- 437 437 438 This parameter is used to set the timeout of s 438 This parameter is used to set the timeout of synchronous requests. Default 439 value of this is 125ms. In case to favor synch !! 439 value of this is 124ms. In case to favor synchronous requests over asynchronous 440 one, this value should be decreased relative t 440 one, this value should be decreased relative to fifo_expire_async. 441 441 442 low_latency 442 low_latency 443 ----------- 443 ----------- 444 444 445 This parameter is used to enable/disable BFQ's 445 This parameter is used to enable/disable BFQ's low latency mode. By 446 default, low latency mode is enabled. If enabl 446 default, low latency mode is enabled. If enabled, interactive and soft 447 real-time applications are privileged and expe 447 real-time applications are privileged and experience a lower latency, 448 as explained in more detail in the description 448 as explained in more detail in the description of how BFQ works. 449 449 450 DISABLE this mode if you need full control on 450 DISABLE this mode if you need full control on bandwidth 451 distribution. In fact, if it is enabled, then 451 distribution. In fact, if it is enabled, then BFQ automatically 452 increases the bandwidth share of privileged ap 452 increases the bandwidth share of privileged applications, as the main 453 means to guarantee a lower latency to them. 453 means to guarantee a lower latency to them. 454 454 455 In addition, as already highlighted at the beg 455 In addition, as already highlighted at the beginning of this document, 456 DISABLE this mode if your only goal is to achi 456 DISABLE this mode if your only goal is to achieve a high throughput. 457 In fact, privileging the I/O of some applicati 457 In fact, privileging the I/O of some application over the rest may 458 entail a lower throughput. To achieve the high 458 entail a lower throughput. To achieve the highest-possible throughput 459 on a non-rotational device, setting slice_idle 459 on a non-rotational device, setting slice_idle to 0 may be needed too 460 (at the cost of giving up any strong guarantee 460 (at the cost of giving up any strong guarantee on fairness and low 461 latency). 461 latency). 462 462 463 timeout_sync 463 timeout_sync 464 ------------ 464 ------------ 465 465 466 Maximum amount of device time that can be give 466 Maximum amount of device time that can be given to a task (queue) once 467 it has been selected for service. On devices w 467 it has been selected for service. On devices with costly seeks, 468 increasing this time usually increases maximum 468 increasing this time usually increases maximum throughput. On the 469 opposite end, increasing this time coarsens th 469 opposite end, increasing this time coarsens the granularity of the 470 short-term bandwidth and latency guarantees, e 470 short-term bandwidth and latency guarantees, especially if the 471 following parameter is set to zero. 471 following parameter is set to zero. 472 472 473 max_budget 473 max_budget 474 ---------- 474 ---------- 475 475 476 Maximum amount of service, measured in sectors 476 Maximum amount of service, measured in sectors, that can be provided 477 to a BFQ queue once it is set in service (of c 477 to a BFQ queue once it is set in service (of course within the limits 478 of the above timeout). According to what was s !! 478 of the above timeout). According to what said in the description of 479 the algorithm, larger values increase the thro 479 the algorithm, larger values increase the throughput in proportion to 480 the percentage of sequential I/O requests issu 480 the percentage of sequential I/O requests issued. The price of larger 481 values is that they coarsen the granularity of 481 values is that they coarsen the granularity of short-term bandwidth 482 and latency guarantees. 482 and latency guarantees. 483 483 484 The default value is 0, which enables auto-tun 484 The default value is 0, which enables auto-tuning: BFQ sets max_budget 485 to the maximum number of sectors that can be s 485 to the maximum number of sectors that can be served during 486 timeout_sync, according to the estimated peak 486 timeout_sync, according to the estimated peak rate. 487 487 488 For specific devices, some users have occasion 488 For specific devices, some users have occasionally reported to have 489 reached a higher throughput by setting max_bud 489 reached a higher throughput by setting max_budget explicitly, i.e., by 490 setting max_budget to a higher value than 0. I 490 setting max_budget to a higher value than 0. In particular, they have 491 set max_budget to higher values than those to 491 set max_budget to higher values than those to which BFQ would have set 492 it with auto-tuning. An alternative way to ach 492 it with auto-tuning. An alternative way to achieve this goal is to 493 just increase the value of timeout_sync, leavi 493 just increase the value of timeout_sync, leaving max_budget equal to 0. 494 494 >> 495 weights >> 496 ------- >> 497 >> 498 Read-only parameter, used to show the weights of the currently active >> 499 BFQ queues. >> 500 >> 501 495 4. Group scheduling with BFQ 502 4. Group scheduling with BFQ 496 ============================ 503 ============================ 497 504 498 BFQ supports both cgroups-v1 and cgroups-v2 io 505 BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely 499 blkio and io. In particular, BFQ supports weig 506 blkio and io. In particular, BFQ supports weight-based proportional 500 share. To activate cgroups support, set BFQ_GR 507 share. To activate cgroups support, set BFQ_GROUP_IOSCHED. 501 508 502 4-1 Service guarantees provided 509 4-1 Service guarantees provided 503 ------------------------------- 510 ------------------------------- 504 511 505 With BFQ, proportional share means true propor 512 With BFQ, proportional share means true proportional share of the 506 device bandwidth, according to group weights. 513 device bandwidth, according to group weights. For example, a group 507 with weight 200 gets twice the bandwidth, and 514 with weight 200 gets twice the bandwidth, and not just twice the time, 508 of a group with weight 100. 515 of a group with weight 100. 509 516 510 BFQ supports hierarchies (group trees) of any 517 BFQ supports hierarchies (group trees) of any depth. Bandwidth is 511 distributed among groups and processes in the 518 distributed among groups and processes in the expected way: for each 512 group, the children of the group share the who 519 group, the children of the group share the whole bandwidth of the 513 group in proportion to their weights. In parti 520 group in proportion to their weights. In particular, this implies 514 that, for each leaf group, every process of th 521 that, for each leaf group, every process of the group receives the 515 same share of the whole group bandwidth, unles 522 same share of the whole group bandwidth, unless the ioprio of the 516 process is modified. 523 process is modified. 517 524 518 The resource-sharing guarantee for a group may 525 The resource-sharing guarantee for a group may partially or totally 519 switch from bandwidth to time, if providing ba 526 switch from bandwidth to time, if providing bandwidth guarantees to 520 the group lowers the throughput too much. This 527 the group lowers the throughput too much. This switch occurs on a 521 per-process basis: if a process of a leaf grou 528 per-process basis: if a process of a leaf group causes throughput loss 522 if served in such a way to receive its share o 529 if served in such a way to receive its share of the bandwidth, then 523 BFQ switches back to just time-based proportio 530 BFQ switches back to just time-based proportional share for that 524 process. 531 process. 525 532 526 4-2 Interface 533 4-2 Interface 527 ------------- 534 ------------- 528 535 529 To get proportional sharing of bandwidth with 536 To get proportional sharing of bandwidth with BFQ for a given device, 530 BFQ must of course be the active scheduler for 537 BFQ must of course be the active scheduler for that device. 531 538 532 Within each group directory, the names of the 539 Within each group directory, the names of the files associated with 533 BFQ-specific cgroup parameters and stats begin 540 BFQ-specific cgroup parameters and stats begin with the "bfq." 534 prefix. So, with cgroups-v1 or cgroups-v2, the 541 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for 535 BFQ-specific files is "blkio.bfq." or "io.bfq. 542 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group 536 parameter to set the weight of a group with BF 543 parameter to set the weight of a group with BFQ is blkio.bfq.weight 537 or io.bfq.weight. 544 or io.bfq.weight. 538 545 539 As for cgroups-v1 (blkio controller), the exac 546 As for cgroups-v1 (blkio controller), the exact set of stat files 540 created, and kept up-to-date by bfq, depends o 547 created, and kept up-to-date by bfq, depends on whether 541 CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, 548 CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all 542 the stat files documented in 549 the stat files documented in 543 Documentation/admin-guide/cgroup-v1/blkio-cont 550 Documentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead, 544 CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq c 551 CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: 545 552 546 blkio.bfq.io_service_bytes 553 blkio.bfq.io_service_bytes 547 blkio.bfq.io_service_bytes_recursive 554 blkio.bfq.io_service_bytes_recursive 548 blkio.bfq.io_serviced 555 blkio.bfq.io_serviced 549 blkio.bfq.io_serviced_recursive 556 blkio.bfq.io_serviced_recursive 550 557 551 The value of CONFIG_BFQ_CGROUP_DEBUG greatly i 558 The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum 552 throughput sustainable with bfq, because updat 559 throughput sustainable with bfq, because updating the blkio.bfq.* 553 stats is rather costly, especially for some of 560 stats is rather costly, especially for some of the stats enabled by 554 CONFIG_BFQ_CGROUP_DEBUG. 561 CONFIG_BFQ_CGROUP_DEBUG. 555 562 556 Parameters !! 563 Parameters to set 557 ---------- !! 564 ----------------- 558 << 559 For each group, the following parameters can b << 560 << 561 weight << 562 This specifies the default weight for << 563 Available values: 1..1000 (default: 10 << 564 << 565 For cgroup v1, it is set by writing th << 566 << 567 For cgroup v2, it is set by writing th << 568 (with an optional prefix of `default` << 569 << 570 The linear mapping between ioprio and << 571 of the tunable section, is still valid << 572 IOPRIO_BE_NR*10 are mapped to ioprio 0 << 573 << 574 Recall that, if low-latency is set, th << 575 weight of the queues associated with i << 576 applications. Unset this tunable if yo << 577 565 578 weight_device !! 566 For each group, there is only the following parameter to set. 579 This specifies a per-device weight for << 580 `minor:major weight`. A weight of `0` << 581 weight. << 582 567 583 For cgroup v1, it is set by writing th !! 568 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the >> 569 group inside its parent. Available values: 1..10000 (default 100). The >> 570 linear mapping between ioprio and weights, described at the beginning >> 571 of the tunable section, is still valid, but all weights higher than >> 572 IOPRIO_BE_NR*10 are mapped to ioprio 0. 584 573 585 For cgroup v2, the file name is `io.bf !! 574 Recall that, if low-latency is set, then BFQ automatically raises the >> 575 weight of the queues associated with interactive and soft real-time >> 576 applications. Unset this tunable if you need/want to control weights. 586 577 587 578 588 [1] 579 [1] 589 P. Valente, A. Avanzini, "Evolution of the 580 P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O 590 Scheduler", Proceedings of the First Works 581 Scheduler", Proceedings of the First Workshop on Mobile System 591 Technologies (MST-2015), May 2015. 582 Technologies (MST-2015), May 2015. 592 583 593 http://algogroup.unimore.it/people/paolo/d 584 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf 594 585 595 [2] 586 [2] 596 P. Valente and M. Andreolini, "Improving A 587 P. Valente and M. Andreolini, "Improving Application 597 Responsiveness with the BFQ Disk I/O Sched 588 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of 598 the 5th Annual International Systems and S 589 the 5th Annual International Systems and Storage Conference 599 (SYSTOR '12), June 2012. 590 (SYSTOR '12), June 2012. 600 591 601 Slightly extended version: 592 Slightly extended version: 602 593 603 http://algogroup.unimore.it/people/paolo/d 594 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf 604 595 605 [3] 596 [3] 606 https://github.com/Algodev-github/S 597 https://github.com/Algodev-github/S
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.