1 .. SPDX-License-Identifier: GPL-2.0-only 2 3 ================ 4 Design of dm-vdo 5 ================ 6 7 The dm-vdo (virtual data optimizer) target pro 8 compression, zero-block elimination, and thin 9 can be backed by up to 256TB of storage, and c 10 up to 4PB. This target was originally develope 11 Corp. starting in 2009. It was first released 12 production environments ever since. It was mad 13 Permabit was acquired by Red Hat. This documen 14 dm-vdo. For usage, see vdo.rst in the same dir 15 16 Because deduplication rates fall drastically a 17 vdo target has a maximum block size of 4K. How 18 deduplication rates of 254:1, i.e. up to 254 c 19 reference a single 4K of actual storage. It ca 20 of 14:1. All zero blocks consume no storage at 21 22 Theory of Operation 23 =================== 24 25 The design of dm-vdo is based on the idea that 26 problem. The first is to recognize duplicate d 27 storing multiple copies of those duplicates. T 28 parts: a deduplication index (called UDS) that 29 duplicate data, and a data store with a refere 30 maps from logical block addresses to the actua 31 data. 32 33 Zones and Threading 34 ------------------- 35 36 Due to the complexity of data optimization, th 37 structures involved in a single write operatio 38 than most other targets. Furthermore, because 39 block sizes in order to achieve good deduplica 40 performance can only be achieved through paral 41 design attempts to be lock-free. 42 43 Most of a vdo's main data structures are desig 44 "zones" such that any given bio must only acce 45 structure. Safety with minimal locking is achi 46 normal operation, each zone is assigned to a s 47 thread will access the portion of the data str 48 Associated with each thread is a work queue. E 49 request object (the "data_vio") which will be 50 the next phase of its operation requires acces 51 zone associated with that queue. 52 53 Another way of thinking about this arrangement 54 each zone has an implicit lock on the structur 55 operations, because vdo guarantees that no oth 56 structures. 57 58 Although each structure is divided into zones, 59 reflected in the on-disk representation of eac 60 the number of zones for each structure, and he 61 can be reconfigured each time a vdo target is 62 63 The Deduplication Index 64 ----------------------- 65 66 In order to identify duplicate data efficientl 67 leverage some common characteristics of duplic 68 observations, we gathered two key insights. Th 69 sets with significant amounts of duplicate dat 70 have temporal locality. When a duplicate appea 71 other duplicates will be detected, and that th 72 written at about the same time. This is why th 73 temporal order. The second insight is that new 74 duplicate recent data than it is to duplicate 75 there are diminishing returns to looking furth 76 when the index is full, it should cull its old 77 new ones. Another important idea behind the de 78 ultimate goal of deduplication is to reduce st 79 trade-off between the storage saved and the re 80 those savings, vdo does not attempt to find ev 81 is sufficient to find and eliminate most of th 82 83 Each block of data is hashed to produce a 16-b 84 record consists of this block name paired with 85 that data on the underlying storage. However, 86 guarantee that the index is accurate. In the m 87 because it is too costly to update the index w 88 or discarded. Doing so would require either st 89 with the blocks, which is difficult to do effi 90 storage, or reading and rehashing each block b 91 Inaccuracy can also result from a hash collisi 92 have the same name. In practice, this is extre 93 vdo does not use a cryptographic hash, a malic 94 constructed. Because of these inaccuracies, vd 95 index as hints, and reads each indicated block 96 a duplicate before sharing the existing block 97 98 Records are collected into groups called chapt 99 the newest chapter, called the open chapter. T 100 format optimized for adding and modifying reco 101 open chapter is not finalized until it runs ou 102 When the open chapter fills up, it is closed a 103 created to collect new records. 104 105 Closing a chapter converts it to a different f 106 reading. The records are written to a series o 107 order in which they were received. This means 108 locality should be on a small number of pages, 109 retrieve them. The chapter also compiles an in 110 record page contains any given name. This inde 111 name can determine exactly which record page m 112 without having to load the entire chapter from 113 only a subset of the block name as its key, so 114 index entry refers to the desired block name. 115 there is a record for this name, it will be on 116 chapters are read-only structures and their co 117 any way. 118 119 Once enough records have been written to fill 120 space, the oldest chapter is removed to make s 121 time a request finds a matching record in the 122 into the open chapter. This ensures that usefu 123 in the index, while unreferenced block names a 124 125 In order to find records in older chapters, th 126 higher level structure called the volume index 127 mapping each block name to the chapter contain 128 mapping is updated as records for the block na 129 ensuring that only the newest record for a giv 130 An older record for a block name will no longe 131 not been deleted from its chapter. Like the ch 132 uses only a subset of the block name as its ke 133 say that a record exists for a name. It can on 134 contain the record if a record exists. The vol 135 in memory and is saved to storage only when th 136 137 From the viewpoint of a request for a particul 138 look up the name in the volume index. This sea 139 the name is new, or which chapter to search. I 140 request looks up its name in the chapter index 141 that the name is new, or which record page to 142 new, the request will look for its name in the 143 This process may require up to two page reads 144 chapter index page and one for the request pag 145 accessed pages are cached so that these page r 146 many block name requests. 147 148 The volume index and the chapter indexes are i 149 memory-efficient structure called a delta inde 150 entire block name (the key) for each entry, th 151 and only the difference between adjacent keys 152 Because we expect the hashes to be randomly di 153 deltas follows an exponential distribution. Be 154 the deltas are expressed using a Huffman code 155 The entire sorted list of keys is called a del 156 allows the index to use many fewer bytes per e 157 table, but it is slightly more expensive to lo 158 request must read every entry in a delta list 159 to find the record it needs. The delta index r 160 splitting its key space into many sub-lists, e 161 value, so that each individual list is short. 162 163 The default index size can hold 64 million rec 164 256GB of data. This means that the index can i 165 original data was written within the last 256G 166 called the deduplication window. If new writes 167 than that, the index will not be able to find 168 older data have been removed. This means that 169 200 GB file to a vdo target and then immediate 170 copies will deduplicate perfectly. Doing the s 171 result in no deduplication, because the beginn 172 longer be in the index by the time the second 173 is no duplication within the file itself). 174 175 If an application anticipates a data workload 176 deduplication beyond the 256GB threshold, vdo 177 larger index with a correspondingly larger ded 178 configuration can only be set when the target 179 later. It is important to consider the expecte 180 before configuring it.) There are two ways to 181 182 One way is to increase the memory size of the 183 the amount of backing storage required. Doubli 184 double the length of the deduplication window 185 the storage size and the memory requirements. 186 187 The other option is to enable sparse indexing. 188 the deduplication window by a factor of 10, at 189 increasing the storage size by a factor of 10. 190 indexing, the memory requirements do not incre 191 slightly more computation per request and a sl 192 of deduplication detected. For most workloads 193 duplicate data, sparse indexing will detect 97 194 that a standard index will detect. 195 196 The vio and data_vio Structures 197 ------------------------------- 198 199 A vio (short for Vdo I/O) is conceptually simi 200 fields and data to track vdo-specific informat 201 pointer to a bio but also tracks other fields 202 vdo. The vio is kept separate from its related 203 circumstances where vdo completes the bio but 204 related to deduplication or compression. 205 206 Metadata reads and writes, and other writes th 207 a struct vio directly. Application reads and w 208 called a data_vio to track information about t 209 data_vio contain a struct vio and also include 210 related to deduplication and other vdo feature 211 primary unit of application work in vdo. Each 212 set of steps to handle the application data, a 213 returned to a pool of data_vios for reuse. 214 215 There is a fixed pool of 2048 data_vios. This 216 the amount of work that is required to recover 217 benchmarks have indicated that increasing the 218 significantly improve performance. 219 220 The Data Store 221 -------------- 222 223 The data store is implemented by three main da 224 work in concert to reduce or amortize metadata 225 writes as possible. 226 227 *The Slab Depot* 228 229 Most of the vdo volume belongs to the slab dep 230 collection of slabs. The slabs can be up to 32 231 three sections. Most of a slab consists of a l 232 These blocks are used either to store data, or 233 block map (see below). In addition to the data 234 of reference counters, using 1 byte for each d 235 has a journal. 236 237 Reference updates are written to the slab jour 238 written out either when they are full, or when 239 requests they do so in order to allow the main 240 to free up space. The slab journal is used bot 241 recovery journal can regularly free up space, 242 of updating individual reference blocks. The r 243 memory and are written out, a block at a time 244 when there is a need to reclaim slab journal s 245 are performed in the background as needed so t 246 particular I/O operations. 247 248 Each slab is independent of every other. They 249 zones" in round-robin fashion. If there are P 250 is assigned to zone n mod P. 251 252 The slab depot maintains an additional small d 253 summary," which is used to reduce the amount o 254 online after a crash. The slab summary maintai 255 indicating whether or not the slab has ever be 256 reference count updates have been persisted to 257 how full it is. During recovery, each physical 258 at least one slab, stopping whenever it has re 259 free blocks. Once each zone has some space, or 260 available, the target can resume normal operat 261 and write requests can be serviced, perhaps wi 262 while the remainder of the dirty slabs are rec 263 264 *The Block Map* 265 266 The block map contains the logical to physical 267 of as an array with one entry per logical addr 268 36 bits of which contain the physical block nu 269 the given logical address. The other 4 bits ar 270 of the mapping. Of the 16 possible states, one 271 which is unmapped (i.e. it has never been writ 272 one represents an uncompressed block, and the 273 indicate that the mapped data is compressed, a 274 slots in the compressed block contains the dat 275 276 In practice, the array of mapping entries is d 277 pages," each of which fits in a single 4K bloc 278 consists of a header and 812 mapping entries. 279 a leaf of a radix tree which consists of block 280 There are 60 radix trees which are assigned to 281 robin fashion. (If there are L logical zones, 282 mod L.) At each level, the trees are interleav 283 0-811 belong to tree 0, logical addresses 812- 284 on. The interleaving is maintained all the way 285 Choosing 60 trees results in an evenly distrib 286 for a large number of possible logical zone co 287 tree roots is allocated at format time. All ot 288 allocated out of the slabs as needed. This fle 289 need to pre-allocate space for the entire set 290 makes growing the logical size of a vdo relati 291 292 In operation, the block map maintains two cach 293 the entire leaf level of the trees in memory, 294 maintains its own cache of leaf pages. The siz 295 configurable at target start time. The second 296 time, and is large enough to hold all the non- 297 block map. This cache is populated as pages ar 298 299 *The Recovery Journal* 300 301 The recovery journal is used to amortize updat 302 slab depot. Each write request causes an entry 303 Entries are either "data remappings" or "block 304 remapping, the journal records the logical add 305 new physical mappings. For a block map remappi 306 block map page number and the physical block a 307 pages are never reclaimed or repurposed, so th 308 309 Each journal entry is an intent record summari 310 that are required for a data_vio. The recovery 311 before each journal block write to ensure that 312 new block mappings in that block are stable on 313 writes are all issued with the FUA bit set to 314 entries themselves are stable. The journal ent 315 represents must be stable on disk before the o 316 be updated to reflect the operation. These ent 317 reconstruct the logical to physical mappings a 318 interruption such as a loss of power. 319 320 *Write Path* 321 322 All write I/O to vdo is asynchronous. Each bio 323 as vdo has done enough work to guarantee that 324 eventually. Generally, the data for acknowledg 325 can be treated as though it is cached in memor 326 requires data to be stable on storage, it must 327 data with the FUA bit set like any other async 328 the vdo target will also flush any remaining I 329 330 Application write bios follow the steps outlin 331 332 1. A data_vio is obtained from the data_vio p 333 application bio. If there are no data_vios 334 will block until a data_vio is available. 335 to the application. The data_vio pool is p 336 337 The newly acquired data_vio is reset and t 338 the data_vio if it is a write and the data 339 must be copied because the application bio 340 the data_vio processing is complete, which 341 will no longer have access to the applicat 342 may also be smaller than 4K, in which case 343 already read the underlying block and the 344 the relevant portion of the larger block. 345 346 2. The data_vio places a claim (the "logical 347 of the bio. It is vital to prevent simulta 348 same logical address, because deduplicatio 349 This claim is implemented as an entry in a 350 the logical address and the value is a poi 351 currently handling that address. 352 353 If a data_vio looks in the hashtable and f 354 already operating on that logical address, 355 operation finishes. It also sends a messag 356 lock holder that it is waiting. Most notab 357 for a logical lock will flush the previous 358 compression packer (step 8d) rather than a 359 waiting to be packed. 360 361 This stage requires the data_vio to get an 362 appropriate logical zone to prevent concur 363 hashtable. This implicit locking is handle 364 described above. 365 366 3. The data_vio traverses the block map tree 367 necessary internal tree nodes have been al 368 the leaf page for its logical address. If 369 missing, it is allocated at this time out 370 pool used to store application data. 371 372 a. If any page-node in the tree has not ye 373 allocated before the write can continue 374 data_vio to lock the page-node that nee 375 lock, like the logical block lock in st 376 that causes other data_vios to wait for 377 complete. 378 379 The implicit logical zone lock is relea 380 happening, in order to allow other oper 381 zone to proceed. The details of allocat 382 step 4. Once a new node has been alloca 383 the tree using a similar process to add 384 The data_vio journals the intent to add 385 map tree (step 10), updates the referen 386 (step 11), and reacquires the implicit 387 new mapping to the parent tree node (st 388 updated, the data_vio proceeds down the 389 waiting on this allocation also proceed 390 391 b. In the steady-state case, the block map 392 allocated, so the data_vio just travers 393 the required leaf node. The location of 394 slot") is recorded in the data_vio so t 395 to traverse the tree again. The data_vi 396 logical zone lock. 397 398 4. If the block is a zero block, skip to step 399 made to allocate a free data block. This a 400 data_vio can write its data somewhere even 401 compression are not possible. This stage g 402 physical zone to search for free space wit 403 404 The data_vio will search each slab in a zo 405 block or decides there are none. If the fi 406 it will proceed to search the next physica 407 lock for that zone and releasing the previ 408 free block or runs out of zones to search. 409 struct pbn_lock (the "physical block lock" 410 struct pbn_lock also has several fields to 411 claims that data_vios can have on physical 412 added to a hashtable like the logical bloc 413 hashtable is also covered by the implicit 414 reference count of the free block is updat 415 data_vio from considering it free. The ref 416 sub-component of the slab and are thus als 417 physical zone lock. 418 419 5. If an allocation was obtained, the data_vi 420 needs to complete the write. The applicati 421 acknowledged at this point. The acknowledg 422 thread to prevent the application callback 423 operations. 424 425 If an allocation could not be obtained, th 426 attempt to deduplicate or compress the dat 427 acknowledged because the vdo device may be 428 429 6. At this point vdo must determine where to 430 The data_vio's data is hashed and the hash 431 recorded in the data_vio. 432 433 7. The data_vio reserves or joins a struct ha 434 the data_vios currently writing the same d 435 tracked in a hashtable similar to the way 436 tracked in step 2. This hashtable is cover 437 the hash zone. 438 439 If there is no existing hash lock for this 440 data_vio obtains a hash lock from the pool 441 and sets itself as the new hash lock's "ag 442 also covered by the implicit hash zone loc 443 do all the work to decide where the applic 444 written. If a hash lock for the data_vio's 445 and the data_vio's data is the same as the 446 data_vio will wait for the agent to comple 447 its result. 448 449 In the rare case that a hash lock exists f 450 the data does not match the hash lock's ag 451 step 8h and attempts to write its data dir 452 different data blocks produce the same has 453 454 8. The hash lock agent attempts to deduplicat 455 the following steps. 456 457 a. The agent initializes and sends its emb 458 (struct uds_request) to the deduplicati 459 require the data_vio to get any locks b 460 manage their own locking. The data_vio 461 response from the index or times out. 462 463 b. If the deduplication index returns advi 464 obtain a physical block lock on the ind 465 order to read the data and verify that 466 data_vio's data, and that it can accept 467 physical address is already locked by a 468 that address may soon be overwritten so 469 address for deduplication. 470 471 c. If the data matches and the physical bl 472 agent and any other data_vios waiting o 473 physical block as their new physical ad 474 to record their new mapping. If there a 475 lock than there are references availabl 476 data_vios becomes the new agent and con 477 valid advice was returned. 478 479 d. If no usable duplicate block was found, 480 it has an allocated physical block (fro 481 to. If the agent does not have an alloc 482 the hash lock that does have an allocat 483 none of the data_vios have an allocated 484 are out of space, so they proceed to st 485 486 e. The agent attempts to compress its data 487 compress, the data_vio will continue to 488 directly. 489 490 If the compressed size is small enough, 491 implicit hash zone lock and go to the p 492 it will be placed in a bin (struct pack 493 data_vios. All compression operations r 494 the packer zone. 495 496 The packer can combine up to 14 compres 497 data block. Compression is only helpful 498 data_vios into a single data block. Thi 499 wait in the packer for an arbitrarily l 500 to fill out the compressed block. There 501 evict waiting data_vios when continuing 502 problems. Circumstances causing an evic 503 flush, device shutdown, or a subsequent 504 the same logical block address. A data_ 505 the packer if it cannot be paired with 506 before more compressible blocks need to 507 data_vio will proceed to step 8h to wri 508 509 f. If the agent fills a packer bin, either 510 are used or because it has no remaining 511 using the allocated physical block from 512 8d has already ensured that an allocati 513 514 g. Each data_vio sets the compressed block 515 The data_vio obtains an implicit lock o 516 acquires the struct pbn_lock for the co 517 modified to be a shared lock. Then it r 518 zone lock and proceeds to step 8i. 519 520 h. Any data_vio evicted from the packer wi 521 step 3. It will write its data to that 522 523 i. After the data is written, if the data_ 524 lock, it will reacquire the implicit ha 525 physical address with as many other dat 526 possible. Each data_vio will then proce 527 new mapping. 528 529 j. If the agent actually wrote new data (w 530 the deduplication index is updated to r 531 new data. The agent then releases the i 532 533 9. The data_vio determines the previous mappi 534 There is a cache for block map leaf pages 535 because there are usually too many block m 536 entirely in memory. If the desired leaf pa 537 data_vio will reserve a slot in the cache 538 into it, possibly evicting an older cached 539 finds the current physical address for thi 540 physical mapping"), if any, and records it 541 on the block map cache structures, covered 542 lock. 543 544 10. The data_vio makes an entry in the recover 545 logical block address, the old physical ma 546 mapping. Making this journal entry require 547 recovery journal lock. The data_vio will w 548 recovery blocks up to the one containing i 549 and flushed to ensure the transaction is s 550 551 11. Once the recovery journal entry is stable, 552 journal entries: an increment entry for th 553 decrement entry for the old mapping. These 554 holding a lock on the affected physical sl 555 physical zone lock. For correctness during 556 entries in any given slab journal must be 557 corresponding recovery journal entries. Th 558 are in different zones, they are made conc 559 the same zone, the increment is always mad 560 order to avoid underflow. After each slab 561 memory, the associated reference count is 562 563 12. Once both of the reference count updates a 564 acquires the implicit logical zone lock an 565 logical-to-physical mapping in the block m 566 physical block. At this point the write op 567 568 13. If the data_vio has a hash lock, it acquir 569 lock and releases its hash lock to the poo 570 571 The data_vio then acquires the implicit ph 572 the struct pbn_lock it holds for its alloc 573 allocation that it did not use, it also se 574 that block back to zero to free it for use 575 576 The data_vio then acquires the implicit lo 577 the logical block lock acquired in step 2. 578 579 The application bio is then acknowledged i 580 acknowledged, and the data_vio is returned 581 582 *Read Path* 583 584 An application read bio follows a much simpler 585 1 and 2 in the write path to obtain a data_vio 586 address. If there is already a write data_vio 587 address that is guaranteed to complete, the re 588 data from the write data_vio and return it. Ot 589 logical-to-physical mapping by traversing the 590 and then read and possibly decompress the indi 591 physical block address. A read data_vio will n 592 nodes if they are missing. If the interior blo 593 yet, the logical block map address must still 594 data_vio will return all zeroes. A read data_v 595 acknowledgment as in step 13, although it only 596 lock and return itself to the pool. 597 598 *Small Writes* 599 600 All storage within vdo is managed as 4KB block 601 as small as 512 bytes. Processing a write that 602 a read-modify-write operation that reads the r 603 new data over the approriate sectors of the bl 604 write operation for the modified data block. T 605 this operation are nearly identical to the nor 606 operations, and a single data_vio is used thro 607 608 *Recovery* 609 610 When a vdo is restarted after a crash, it will 611 recovery journal. During the pre-resume phase 612 recovery journal is read. The increment portio 613 into the block map. Next, valid entries are pl 614 into the slab journals. Finally, each physical 615 least one slab journal to reconstruct the refe 616 Once each zone has some free space (or has det 617 the vdo comes back online, while the remainder 618 used to reconstruct the rest of the reference 619 620 *Read-only Rebuild* 621 622 If a vdo encounters an unrecoverable error, it 623 This mode indicates that some previously ackno 624 lost. The vdo may be instructed to rebuild as 625 return to a writable state. However, this is n 626 to the possibility that data has been lost. Du 627 block map is recovered from the recovery journ 628 reference counts are not rebuilt from the slab 629 reference counts are zeroed, the entire block 630 reference counts are updated from the block ma 631 some data, it ensures that the block map and r 632 consistent with each other. This allows vdo to 633 accept further writes.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.