1 ========================== 2 BFQ (Budget Fair Queueing) 3 ========================== 4 5 BFQ is a proportional-share I/O scheduler, wit 6 low-latency capabilities. In addition to cgrou 7 controllers), BFQ's main features are: 8 9 - BFQ guarantees a high system and application 10 low latency for time-sensitive applications, 11 players; 12 - BFQ distributes bandwidth, not just time, am 13 groups (switching back to time distribution 14 throughput high). 15 16 In its default configuration, BFQ privileges l 17 throughput. So, when needed for achieving a lo 18 schedules that may lead to a lower throughput. 19 goal, for a given device, is to achieve the ma 20 throughput at all times, then do switch off al 21 for that device, by setting low_latency to 0. 22 details on how to configure BFQ for the desire 23 latency and throughput, or on how to maximize 24 25 As every I/O scheduler, BFQ adds some overhead 26 processing. To give an idea of this overhead, 27 single-lock-protected, per-request processing 28 sum of the execution times of the request inse 29 completion hooks---is, e.g., 1.9 us on an Inte 30 (dated CPU for notebooks; time measured with s 31 instrumentation, and using the throughput-sync 32 suite [1], in performance-profiling mode). To 33 context, the total, single-lock-protected, per 34 of the lightest I/O scheduler available in blk 35 us (mq-deadline is ~800 LOC, against ~10500 LO 36 37 Scheduling overhead further limits the maximum 38 process (already limited by the execution of t 39 stack). To give an idea of the limits with BFQ 40 CPUs, here are, first, the limits of BFQ for t 41 respectively, an average laptop, an old deskto 42 system, in case full hierarchical support is e 43 CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_B 44 set (Section 4-2): 45 - Intel i7-4850HQ: 400 KIOPS 46 - AMD A8-3850: 250 KIOPS 47 - ARM CortexTM-A53 Octa-core: 80 KIOPS 48 49 If CONFIG_BFQ_CGROUP_DEBUG is set (and of cour 50 support is enabled), then the sustainable thro 51 decreases, because all blkio.bfq* statistics a 52 (Section 4-2). For BFQ, this leads to the foll 53 sustainable throughputs, on the same systems a 54 - Intel i7-4850HQ: 310 KIOPS 55 - AMD A8-3850: 200 KIOPS 56 - ARM CortexTM-A53 Octa-core: 56 KIOPS 57 58 BFQ works for multi-queue devices too. 59 60 .. The table of contents follow. Impatients ca 61 62 .. CONTENTS 63 64 1. When may BFQ be useful? 65 1-1 Personal systems 66 1-2 Server systems 67 2. How does BFQ work? 68 3. What are BFQ's tunables and how to prope 69 4. BFQ group scheduling 70 4-1 Service guarantees provided 71 4-2 Interface 72 73 1. When may BFQ be useful? 74 ========================== 75 76 BFQ provides the following benefits on persona 77 78 1-1 Personal systems 79 -------------------- 80 81 Low latency for interactive applications 82 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 83 84 Regardless of the actual background workload, 85 interactive tasks, the storage device is virtu 86 it was idle. For example, even if one or more 87 background workloads are being executed: 88 89 - one or more large files are being read, writ 90 - a tree of source files is being compiled, 91 - one or more virtual machines are performing 92 - a software update is in progress, 93 - indexing daemons are scanning filesystems an 94 databases, 95 96 starting an application or loading a file from 97 takes about the same time as if the storage de 98 comparison, with CFQ, NOOP or DEADLINE, and in 99 applications experience high latencies, or eve 100 until the background workload terminates (also 101 102 Low latency for soft real-time applications 103 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 104 Also soft real-time applications, such as audi 105 players/streamers, enjoy a low latency and a l 106 of the background I/O workload. As a consequen 107 do not suffer from almost any glitch due to th 108 109 Higher speed for code-development tasks 110 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111 112 If some additional workload happens to be exec 113 BFQ executes the I/O-related components of typ 114 tasks (compilation, checkout, merge, etc.) muc 115 NOOP or DEADLINE. 116 117 High throughput 118 ^^^^^^^^^^^^^^^ 119 120 On hard disks, BFQ achieves up to 30% higher t 121 up to 150% higher throughput than DEADLINE and 122 sequential workloads considered in our tests. 123 and with all the workloads on flash-based devi 124 instead, about the same throughput as the othe 125 126 Strong fairness, bandwidth and delay guarantee 127 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 128 129 BFQ distributes the device throughput, and not 130 among I/O-bound applications in proportion to 131 workload and regardless of the device paramete 132 guarantees, it is possible to compute a tight 133 guarantees by a simple formula. If not configu 134 guarantees, BFQ switches to time-based resourc 135 applications that would otherwise cause a thro 136 137 1-2 Server systems 138 ------------------ 139 140 Most benefits for server systems follow from t 141 properties as above. In particular, regardless 142 possibly heavy workloads are being served, BFQ 143 144 * audio and video-streaming with zero or very 145 rate; 146 147 * fast retrieval of WEB pages and embedded obj 148 149 * real-time recording of data in live-dumping 150 packet logging); 151 152 * responsiveness in local and remote access to 153 154 155 2. How does BFQ work? 156 ===================== 157 158 BFQ is a proportional-share I/O scheduler, who 159 plus a lot of code, are borrowed from CFQ. 160 161 - Each process doing I/O on a device is associ 162 `(bfq_)queue`. 163 164 - BFQ grants exclusive access to the device, f 165 (process) at a time, and implements this ser 166 associating every queue with a budget, measu 167 sectors. 168 169 - After a queue is granted access to the dev 170 queue is decremented, on each request disp 171 request. 172 173 - The in-service queue is expired, i.e., its 174 only if one of the following events occurs 175 its budget, 2) the queue empties, 3) a "bu 176 177 - The budget timeout prevents processes do 178 holding the device for too long and dram 179 throughput. 180 181 - Actually, as in CFQ, a queue associated 182 sync requests may not be expired immedia 183 contrast, BFQ may idle the device for a 184 giving the process the chance to go on b 185 a new request in time. Device idling typ 186 throughput on rotational devices and on 187 devices, if processes do synchronous and 188 addition, under BFQ, device idling is al 189 guaranteeing the desired throughput frac 190 issuing sync requests (see the descripti 191 tunable in this document, or [1, 2], for 192 193 - With respect to idling for service gua 194 processes are competing for the device 195 all processes and groups have the same 196 guarantees the expected throughput dis 197 idling the device. Throughput is thus 198 this common scenario. 199 200 - On flash-based storage with internal qu 201 (typically NCQ), device idling happens 202 to throughput. So, with these devices, 203 only when strictly needed for service g 204 guaranteeing low latency or fairness. I 205 throughput may be sub-optimal. No solut 206 provide both strong service guarantees 207 on devices with internal queueing. 208 209 - If low-latency mode is enabled (default co 210 executes some special heuristics to detect 211 real-time applications (e.g., video or aud 212 and to reduce their latency. The most impo 213 achieve this goal is to give to the queues 214 applications more than their fair share of 215 throughput. For brevity, we call it just " 216 sets of actions taken by BFQ to privilege 217 particular, BFQ provides a milder form of 218 interactive applications, and a stronger f 219 applications. 220 221 - BFQ automatically deactivates idling for q 222 queue creations. In fact, these queues are 223 the processes of applications and services 224 from a high throughput. Examples are syste 225 grep. 226 227 - As CFQ, BFQ merges queues performing inter 228 performing random I/O that becomes mostly 229 merged. Differently from CFQ, BFQ achieves 230 reactive mechanism, called Early Queue Mer 231 responsive in detecting interleaved I/O (c 232 that it enables BFQ to achieve a high thro 233 merging, even for queues for which CFQ nee 234 mechanism, preemption, to get a high throu 235 unified mechanism to achieve a high throug 236 I/O. 237 238 - Queues are scheduled according to a varian 239 B-WF2Q+, and implemented using an augmente 240 O(log N) overall complexity. See [2] for 241 also ready for hierarchical scheduling, de 242 243 - B-WF2Q+ guarantees a tight deviation with 244 perfectly fair, and smooth service. In par 245 guarantees that each queue receives a frac 246 throughput proportional to its weight, eve 247 fluctuates, and regardless of: the device 248 workload and the budgets assigned to the q 249 250 - The last, budget-independence, property (a 251 counterintuitive in the first place) is de 252 the following reasons: 253 254 - First, with any proportional-share sched 255 deviation with respect to an ideal servi 256 the maximum budget (slice) assigned to q 257 BFQ can keep this deviation tight, not o 258 accurate service of B-WF2Q+, but also be 259 need to assign a larger budget to a queu 260 receive a higher fraction of the device 261 262 - Second, BFQ is free to choose, for every 263 budget that best fits the needs of the p 264 leverages the I/O pattern of the process 265 updates queue budgets with a simple feed 266 allows a high throughput to be achieved, 267 tight latency guarantees to time-sensiti 268 the in-service queue expires, this algor 269 budget of the queue so as to: 270 271 - Let large budgets be eventually assign 272 associated with I/O-bound applications 273 I/O: in fact, the longer these applica 274 got access to the device, the higher t 275 276 - Let small budgets be eventually assign 277 associated with time-sensitive applica 278 perform sporadic and short I/O), becau 279 budget assigned to a queue waiting for 280 B-WF2Q+ will serve that queue (Subsec 281 282 - If several processes are competing for the d 283 but all processes and groups have the same w 284 guarantees the expected throughput distribut 285 the device. It uses preemption instead. Thro 286 higher in this common scenario. 287 288 - ioprio classes are served in strict priority 289 lower-priority queues are not served as long 290 higher-priority queues. Among queues in the 291 bandwidth is distributed in proportion to th 292 queue. A very thin extra bandwidth is howeve 293 the Idle class, to prevent it from starving. 294 295 296 3. What are BFQ's tunables and how to properly 297 ============================================== 298 299 Most BFQ tunables affect service guarantees (b 300 fairness) and throughput. For full details on 301 desired tradeoff between service guarantees an 302 parameters slice_idle, strict_guarantees and l 303 on how to maximise throughput, see slice_idle, 304 max_budget. The other performance-related para 305 inherited from, and have been preserved mostly 306 CFQ. So far, no performance improvement has be 307 changing the latter parameters in BFQ. 308 309 In particular, the tunables back_seek-max, bac 310 fifo_expire_async and fifo_expire_sync below a 311 CFQ. Their description is just copied from tha 312 considerations in the description of slice_idl 313 too. 314 315 per-process ioprio and weight 316 ----------------------------- 317 318 Unless the cgroups interface is used (see "4. 319 weights can be assigned to processes only indi 320 priorities, and according to the relation: 321 weight = (IOPRIO_BE_NR - ioprio) * 10. 322 323 Beware that, if low-latency is set, then BFQ a 324 weight of the queues associated with interacti 325 applications. Unset this tunable if you need/w 326 327 slice_idle 328 ---------- 329 330 This parameter specifies how long BFQ should i 331 request, when certain sync BFQ queues become e 332 slice_idle is a non-zero value. Idling has a d 333 throughput and making sure that the desired th 334 respected (see the description of how BFQ work 335 papers referred there). 336 337 As for throughput, idling can be very helpful 338 like single spindle SATA/SAS disks where we ca 339 number of seeks and see improved throughput. 340 341 Setting slice_idle to 0 will remove all the id 342 should see an overall improved throughput on f 343 like multiple SATA/SAS disks in hardware RAID 344 as flash-based storage with internal command q 345 parallelism). 346 347 So depending on storage and workload, it might 348 slice_idle=0. In general for SATA/SAS disks a 349 SATA/SAS disks keeping slice_idle enabled shou 350 configurations where there are multiple spindl 351 (Host based hardware RAID controller or for st 352 flash-based fast storage, setting slice_idle=0 353 throughput and acceptable latencies. 354 355 Idling is however necessary to have service gu 356 case of differentiated weights or differentiat 357 To see why, suppose that a given BFQ queue A m 358 requests served for each request served for an 359 ensures that, if A makes a new I/O request sli 360 empty, then no request of B is dispatched in t 361 does not lose the possibility to get more than 362 before the next request of B is dispatched. No 363 guarantees the desired differentiated treatmen 364 terms of I/O-request dispatches. To guarantee 365 order then corresponds to the dispatch order, 366 tunable must be set too. 367 368 There is an important flip side to idling: apa 369 where it is beneficial also for throughput, id 370 throughput. One important case is random workl 371 issue, BFQ tends to avoid idling as much as po 372 beneficial also for throughput (as detailed in 373 consequence of this behavior, and of further i 374 strict_guarantees tunable, short-term service 375 occasionally violated. And, in some cases, the 376 more important than guaranteeing maximum throu 377 video playing/streaming, a very low drop rate 378 than maximum throughput. In these cases, consi 379 strict_guarantees parameter. 380 381 slice_idle_us 382 ------------- 383 384 Controls the same tuning parameter as slice_id 385 Either tunable can be used to set idling behav 386 other tunable will reflect the newly set value 387 388 strict_guarantees 389 ----------------- 390 391 If this parameter is set (default: unset), the 392 393 - always performs idling when the in-service q 394 395 - forces the device to serve one I/O request a 396 new request only if there is no outstanding 397 398 In the presence of differentiated weights or I 399 the above conditions are needed to guarantee t 400 receives its allotted share of the bandwidth. 401 needed for the reasons explained in the descri 402 tunable. The second condition is needed becau 403 devices reorder internally-queued requests, wh 404 the service guarantees enforced by the I/O sch 405 406 Setting strict_guarantees may evidently affect 407 408 back_seek_max 409 ------------- 410 411 This specifies, given in Kbytes, the maximum " 412 The distance is the amount of space from the c 413 sectors that are backward in terms of distance 414 415 This parameter allows the scheduler to anticip 416 direction and consider them as being the "next 417 distance from the current head location. 418 419 back_seek_penalty 420 ----------------- 421 422 This parameter is used to compute the cost of 423 backward distance of request is just 1/back_se 424 request, then the seeking cost of two requests 425 426 So scheduler will not bias toward one or the o 427 will bias toward front request). Default value 428 429 fifo_expire_async 430 ----------------- 431 432 This parameter is used to set the timeout of a 433 value of this is 250ms. 434 435 fifo_expire_sync 436 ---------------- 437 438 This parameter is used to set the timeout of s 439 value of this is 125ms. In case to favor synch 440 one, this value should be decreased relative t 441 442 low_latency 443 ----------- 444 445 This parameter is used to enable/disable BFQ's 446 default, low latency mode is enabled. If enabl 447 real-time applications are privileged and expe 448 as explained in more detail in the description 449 450 DISABLE this mode if you need full control on 451 distribution. In fact, if it is enabled, then 452 increases the bandwidth share of privileged ap 453 means to guarantee a lower latency to them. 454 455 In addition, as already highlighted at the beg 456 DISABLE this mode if your only goal is to achi 457 In fact, privileging the I/O of some applicati 458 entail a lower throughput. To achieve the high 459 on a non-rotational device, setting slice_idle 460 (at the cost of giving up any strong guarantee 461 latency). 462 463 timeout_sync 464 ------------ 465 466 Maximum amount of device time that can be give 467 it has been selected for service. On devices w 468 increasing this time usually increases maximum 469 opposite end, increasing this time coarsens th 470 short-term bandwidth and latency guarantees, e 471 following parameter is set to zero. 472 473 max_budget 474 ---------- 475 476 Maximum amount of service, measured in sectors 477 to a BFQ queue once it is set in service (of c 478 of the above timeout). According to what was s 479 the algorithm, larger values increase the thro 480 the percentage of sequential I/O requests issu 481 values is that they coarsen the granularity of 482 and latency guarantees. 483 484 The default value is 0, which enables auto-tun 485 to the maximum number of sectors that can be s 486 timeout_sync, according to the estimated peak 487 488 For specific devices, some users have occasion 489 reached a higher throughput by setting max_bud 490 setting max_budget to a higher value than 0. I 491 set max_budget to higher values than those to 492 it with auto-tuning. An alternative way to ach 493 just increase the value of timeout_sync, leavi 494 495 4. Group scheduling with BFQ 496 ============================ 497 498 BFQ supports both cgroups-v1 and cgroups-v2 io 499 blkio and io. In particular, BFQ supports weig 500 share. To activate cgroups support, set BFQ_GR 501 502 4-1 Service guarantees provided 503 ------------------------------- 504 505 With BFQ, proportional share means true propor 506 device bandwidth, according to group weights. 507 with weight 200 gets twice the bandwidth, and 508 of a group with weight 100. 509 510 BFQ supports hierarchies (group trees) of any 511 distributed among groups and processes in the 512 group, the children of the group share the who 513 group in proportion to their weights. In parti 514 that, for each leaf group, every process of th 515 same share of the whole group bandwidth, unles 516 process is modified. 517 518 The resource-sharing guarantee for a group may 519 switch from bandwidth to time, if providing ba 520 the group lowers the throughput too much. This 521 per-process basis: if a process of a leaf grou 522 if served in such a way to receive its share o 523 BFQ switches back to just time-based proportio 524 process. 525 526 4-2 Interface 527 ------------- 528 529 To get proportional sharing of bandwidth with 530 BFQ must of course be the active scheduler for 531 532 Within each group directory, the names of the 533 BFQ-specific cgroup parameters and stats begin 534 prefix. So, with cgroups-v1 or cgroups-v2, the 535 BFQ-specific files is "blkio.bfq." or "io.bfq. 536 parameter to set the weight of a group with BF 537 or io.bfq.weight. 538 539 As for cgroups-v1 (blkio controller), the exac 540 created, and kept up-to-date by bfq, depends o 541 CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, 542 the stat files documented in 543 Documentation/admin-guide/cgroup-v1/blkio-cont 544 CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq c 545 546 blkio.bfq.io_service_bytes 547 blkio.bfq.io_service_bytes_recursive 548 blkio.bfq.io_serviced 549 blkio.bfq.io_serviced_recursive 550 551 The value of CONFIG_BFQ_CGROUP_DEBUG greatly i 552 throughput sustainable with bfq, because updat 553 stats is rather costly, especially for some of 554 CONFIG_BFQ_CGROUP_DEBUG. 555 556 Parameters 557 ---------- 558 559 For each group, the following parameters can b 560 561 weight 562 This specifies the default weight for 563 Available values: 1..1000 (default: 10 564 565 For cgroup v1, it is set by writing th 566 567 For cgroup v2, it is set by writing th 568 (with an optional prefix of `default` 569 570 The linear mapping between ioprio and 571 of the tunable section, is still valid 572 IOPRIO_BE_NR*10 are mapped to ioprio 0 573 574 Recall that, if low-latency is set, th 575 weight of the queues associated with i 576 applications. Unset this tunable if yo 577 578 weight_device 579 This specifies a per-device weight for 580 `minor:major weight`. A weight of `0` 581 weight. 582 583 For cgroup v1, it is set by writing th 584 585 For cgroup v2, the file name is `io.bf 586 587 588 [1] 589 P. Valente, A. Avanzini, "Evolution of the 590 Scheduler", Proceedings of the First Works 591 Technologies (MST-2015), May 2015. 592 593 http://algogroup.unimore.it/people/paolo/d 594 595 [2] 596 P. Valente and M. Andreolini, "Improving A 597 Responsiveness with the BFQ Disk I/O Sched 598 the 5th Annual International Systems and S 599 (SYSTOR '12), June 2012. 600 601 Slightly extended version: 602 603 http://algogroup.unimore.it/people/paolo/d 604 605 [3] 606 https://github.com/Algodev-github/S
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.