~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/device-mapper/vdo-design.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/admin-guide/device-mapper/vdo-design.rst (Version linux-6.12-rc7) and /Documentation/admin-guide/device-mapper/vdo-design.rst (Version linux-4.19.323)


  1 .. SPDX-License-Identifier: GPL-2.0-only          
  2                                                   
  3 ================                                  
  4 Design of dm-vdo                                  
  5 ================                                  
  6                                                   
  7 The dm-vdo (virtual data optimizer) target pro    
  8 compression, zero-block elimination, and thin     
  9 can be backed by up to 256TB of storage, and c    
 10 up to 4PB. This target was originally develope    
 11 Corp. starting in 2009. It was first released     
 12 production environments ever since. It was mad    
 13 Permabit was acquired by Red Hat. This documen    
 14 dm-vdo. For usage, see vdo.rst in the same dir    
 15                                                   
 16 Because deduplication rates fall drastically a    
 17 vdo target has a maximum block size of 4K. How    
 18 deduplication rates of 254:1, i.e. up to 254 c    
 19 reference a single 4K of actual storage. It ca    
 20 of 14:1. All zero blocks consume no storage at    
 21                                                   
 22 Theory of Operation                               
 23 ===================                               
 24                                                   
 25 The design of dm-vdo is based on the idea that    
 26 problem. The first is to recognize duplicate d    
 27 storing multiple copies of those duplicates. T    
 28 parts: a deduplication index (called UDS) that    
 29 duplicate data, and a data store with a refere    
 30 maps from logical block addresses to the actua    
 31 data.                                             
 32                                                   
 33 Zones and Threading                               
 34 -------------------                               
 35                                                   
 36 Due to the complexity of data optimization, th    
 37 structures involved in a single write operatio    
 38 than most other targets. Furthermore, because     
 39 block sizes in order to achieve good deduplica    
 40 performance can only be achieved through paral    
 41 design attempts to be lock-free.                  
 42                                                   
 43 Most of a vdo's main data structures are desig    
 44 "zones" such that any given bio must only acce    
 45 structure. Safety with minimal locking is achi    
 46 normal operation, each zone is assigned to a s    
 47 thread will access the portion of the data str    
 48 Associated with each thread is a work queue. E    
 49 request object (the "data_vio") which will be     
 50 the next phase of its operation requires acces    
 51 zone associated with that queue.                  
 52                                                   
 53 Another way of thinking about this arrangement    
 54 each zone has an implicit lock on the structur    
 55 operations, because vdo guarantees that no oth    
 56 structures.                                       
 57                                                   
 58 Although each structure is divided into zones,    
 59 reflected in the on-disk representation of eac    
 60 the number of zones for each structure, and he    
 61 can be reconfigured each time a vdo target is     
 62                                                   
 63 The Deduplication Index                           
 64 -----------------------                           
 65                                                   
 66 In order to identify duplicate data efficientl    
 67 leverage some common characteristics of duplic    
 68 observations, we gathered two key insights. Th    
 69 sets with significant amounts of duplicate dat    
 70 have temporal locality. When a duplicate appea    
 71 other duplicates will be detected, and that th    
 72 written at about the same time. This is why th    
 73 temporal order. The second insight is that new    
 74 duplicate recent data than it is to duplicate     
 75 there are diminishing returns to looking furth    
 76 when the index is full, it should cull its old    
 77 new ones. Another important idea behind the de    
 78 ultimate goal of deduplication is to reduce st    
 79 trade-off between the storage saved and the re    
 80 those savings, vdo does not attempt to find ev    
 81 is sufficient to find and eliminate most of th    
 82                                                   
 83 Each block of data is hashed to produce a 16-b    
 84 record consists of this block name paired with    
 85 that data on the underlying storage. However,     
 86 guarantee that the index is accurate. In the m    
 87 because it is too costly to update the index w    
 88 or discarded. Doing so would require either st    
 89 with the blocks, which is difficult to do effi    
 90 storage, or reading and rehashing each block b    
 91 Inaccuracy can also result from a hash collisi    
 92 have the same name. In practice, this is extre    
 93 vdo does not use a cryptographic hash, a malic    
 94 constructed. Because of these inaccuracies, vd    
 95 index as hints, and reads each indicated block    
 96 a duplicate before sharing the existing block     
 97                                                   
 98 Records are collected into groups called chapt    
 99 the newest chapter, called the open chapter. T    
100 format optimized for adding and modifying reco    
101 open chapter is not finalized until it runs ou    
102 When the open chapter fills up, it is closed a    
103 created to collect new records.                   
104                                                   
105 Closing a chapter converts it to a different f    
106 reading. The records are written to a series o    
107 order in which they were received. This means     
108 locality should be on a small number of pages,    
109 retrieve them. The chapter also compiles an in    
110 record page contains any given name. This inde    
111 name can determine exactly which record page m    
112 without having to load the entire chapter from    
113 only a subset of the block name as its key, so    
114 index entry refers to the desired block name.     
115 there is a record for this name, it will be on    
116 chapters are read-only structures and their co    
117 any way.                                          
118                                                   
119 Once enough records have been written to fill     
120 space, the oldest chapter is removed to make s    
121 time a request finds a matching record in the     
122 into the open chapter. This ensures that usefu    
123 in the index, while unreferenced block names a    
124                                                   
125 In order to find records in older chapters, th    
126 higher level structure called the volume index    
127 mapping each block name to the chapter contain    
128 mapping is updated as records for the block na    
129 ensuring that only the newest record for a giv    
130 An older record for a block name will no longe    
131 not been deleted from its chapter. Like the ch    
132 uses only a subset of the block name as its ke    
133 say that a record exists for a name. It can on    
134 contain the record if a record exists. The vol    
135 in memory and is saved to storage only when th    
136                                                   
137 From the viewpoint of a request for a particul    
138 look up the name in the volume index. This sea    
139 the name is new, or which chapter to search. I    
140 request looks up its name in the chapter index    
141 that the name is new, or which record page to     
142 new, the request will look for its name in the    
143 This process may require up to two page reads     
144 chapter index page and one for the request pag    
145 accessed pages are cached so that these page r    
146 many block name requests.                         
147                                                   
148 The volume index and the chapter indexes are i    
149 memory-efficient structure called a delta inde    
150 entire block name (the key) for each entry, th    
151 and only the difference between adjacent keys     
152 Because we expect the hashes to be randomly di    
153 deltas follows an exponential distribution. Be    
154 the deltas are expressed using a Huffman code     
155 The entire sorted list of keys is called a del    
156 allows the index to use many fewer bytes per e    
157 table, but it is slightly more expensive to lo    
158 request must read every entry in a delta list     
159 to find the record it needs. The delta index r    
160 splitting its key space into many sub-lists, e    
161 value, so that each individual list is short.     
162                                                   
163 The default index size can hold 64 million rec    
164 256GB of data. This means that the index can i    
165 original data was written within the last 256G    
166 called the deduplication window. If new writes    
167 than that, the index will not be able to find     
168 older data have been removed. This means that     
169 200 GB file to a vdo target and then immediate    
170 copies will deduplicate perfectly. Doing the s    
171 result in no deduplication, because the beginn    
172 longer be in the index by the time the second     
173 is no duplication within the file itself).        
174                                                   
175 If an application anticipates a data workload     
176 deduplication beyond the 256GB threshold, vdo     
177 larger index with a correspondingly larger ded    
178 configuration can only be set when the target     
179 later. It is important to consider the expecte    
180 before configuring it.)  There are two ways to    
181                                                   
182 One way is to increase the memory size of the     
183 the amount of backing storage required. Doubli    
184 double the length of the deduplication window     
185 the storage size and the memory requirements.     
186                                                   
187 The other option is to enable sparse indexing.    
188 the deduplication window by a factor of 10, at    
189 increasing the storage size by a factor of 10.    
190 indexing, the memory requirements do not incre    
191 slightly more computation per request and a sl    
192 of deduplication detected. For most workloads     
193 duplicate data, sparse indexing will detect 97    
194 that a standard index will detect.                
195                                                   
196 The vio and data_vio Structures                   
197 -------------------------------                   
198                                                   
199 A vio (short for Vdo I/O) is conceptually simi    
200 fields and data to track vdo-specific informat    
201 pointer to a bio but also tracks other fields     
202 vdo. The vio is kept separate from its related    
203 circumstances where vdo completes the bio but     
204 related to deduplication or compression.          
205                                                   
206 Metadata reads and writes, and other writes th    
207 a struct vio directly. Application reads and w    
208 called a data_vio to track information about t    
209 data_vio contain a struct vio and also include    
210 related to deduplication and other vdo feature    
211 primary unit of application work in vdo. Each     
212 set of steps to handle the application data, a    
213 returned to a pool of data_vios for reuse.        
214                                                   
215 There is a fixed pool of 2048 data_vios. This     
216 the amount of work that is required to recover    
217 benchmarks have indicated that increasing the     
218 significantly improve performance.                
219                                                   
220 The Data Store                                    
221 --------------                                    
222                                                   
223 The data store is implemented by three main da    
224 work in concert to reduce or amortize metadata    
225 writes as possible.                               
226                                                   
227 *The Slab Depot*                                  
228                                                   
229 Most of the vdo volume belongs to the slab dep    
230 collection of slabs. The slabs can be up to 32    
231 three sections. Most of a slab consists of a l    
232 These blocks are used either to store data, or    
233 block map (see below). In addition to the data    
234 of reference counters, using 1 byte for each d    
235 has a journal.                                    
236                                                   
237 Reference updates are written to the slab jour    
238 written out either when they are full, or when    
239 requests they do so in order to allow the main    
240 to free up space. The slab journal is used bot    
241 recovery journal can regularly free up space,     
242 of updating individual reference blocks. The r    
243 memory and are written out, a block at a time     
244 when there is a need to reclaim slab journal s    
245 are performed in the background as needed so t    
246 particular I/O operations.                        
247                                                   
248 Each slab is independent of every other. They     
249 zones" in round-robin fashion. If there are P     
250 is assigned to zone n mod P.                      
251                                                   
252 The slab depot maintains an additional small d    
253 summary," which is used to reduce the amount o    
254 online after a crash. The slab summary maintai    
255 indicating whether or not the slab has ever be    
256 reference count updates have been persisted to    
257 how full it is. During recovery, each physical    
258 at least one slab, stopping whenever it has re    
259 free blocks. Once each zone has some space, or    
260 available, the target can resume normal operat    
261 and write requests can be serviced, perhaps wi    
262 while the remainder of the dirty slabs are rec    
263                                                   
264 *The Block Map*                                   
265                                                   
266 The block map contains the logical to physical    
267 of as an array with one entry per logical addr    
268 36 bits of which contain the physical block nu    
269 the given logical address. The other 4 bits ar    
270 of the mapping. Of the 16 possible states, one    
271 which is unmapped (i.e. it has never been writ    
272 one represents an uncompressed block, and the     
273 indicate that the mapped data is compressed, a    
274 slots in the compressed block contains the dat    
275                                                   
276 In practice, the array of mapping entries is d    
277 pages," each of which fits in a single 4K bloc    
278 consists of a header and 812 mapping entries.     
279 a leaf of a radix tree which consists of block    
280 There are 60 radix trees which are assigned to    
281 robin fashion. (If there are L logical zones,     
282 mod L.) At each level, the trees are interleav    
283 0-811 belong to tree 0, logical addresses 812-    
284 on. The interleaving is maintained all the way    
285 Choosing 60 trees results in an evenly distrib    
286 for a large number of possible logical zone co    
287 tree roots is allocated at format time. All ot    
288 allocated out of the slabs as needed. This fle    
289 need to pre-allocate space for the entire set     
290 makes growing the logical size of a vdo relati    
291                                                   
292 In operation, the block map maintains two cach    
293 the entire leaf level of the trees in memory,     
294 maintains its own cache of leaf pages. The siz    
295 configurable at target start time. The second     
296 time, and is large enough to hold all the non-    
297 block map. This cache is populated as pages ar    
298                                                   
299 *The Recovery Journal*                            
300                                                   
301 The recovery journal is used to amortize updat    
302 slab depot. Each write request causes an entry    
303 Entries are either "data remappings" or "block    
304 remapping, the journal records the logical add    
305 new physical mappings. For a block map remappi    
306 block map page number and the physical block a    
307 pages are never reclaimed or repurposed, so th    
308                                                   
309 Each journal entry is an intent record summari    
310 that are required for a data_vio. The recovery    
311 before each journal block write to ensure that    
312 new block mappings in that block are stable on    
313 writes are all issued with the FUA bit set to     
314 entries themselves are stable. The journal ent    
315 represents must be stable on disk before the o    
316 be updated to reflect the operation. These ent    
317 reconstruct the logical to physical mappings a    
318 interruption such as a loss of power.             
319                                                   
320 *Write Path*                                      
321                                                   
322 All write I/O to vdo is asynchronous. Each bio    
323 as vdo has done enough work to guarantee that     
324 eventually. Generally, the data for acknowledg    
325 can be treated as though it is cached in memor    
326 requires data to be stable on storage, it must    
327 data with the FUA bit set like any other async    
328 the vdo target will also flush any remaining I    
329                                                   
330 Application write bios follow the steps outlin    
331                                                   
332 1.  A data_vio is obtained from the data_vio p    
333     application bio. If there are no data_vios    
334     will block until a data_vio is available.     
335     to the application. The data_vio pool is p    
336                                                   
337     The newly acquired data_vio is reset and t    
338     the data_vio if it is a write and the data    
339     must be copied because the application bio    
340     the data_vio processing is complete, which    
341     will no longer have access to the applicat    
342     may also be smaller than 4K, in which case    
343     already read the underlying block and the     
344     the relevant portion of the larger block.     
345                                                   
346 2.  The data_vio places a claim (the "logical     
347     of the bio. It is vital to prevent simulta    
348     same logical address, because deduplicatio    
349     This claim is implemented as an entry in a    
350     the logical address and the value is a poi    
351     currently handling that address.              
352                                                   
353     If a data_vio looks in the hashtable and f    
354     already operating on that logical address,    
355     operation finishes. It also sends a messag    
356     lock holder that it is waiting. Most notab    
357     for a logical lock will flush the previous    
358     compression packer (step 8d) rather than a    
359     waiting to be packed.                         
360                                                   
361     This stage requires the data_vio to get an    
362     appropriate logical zone to prevent concur    
363     hashtable. This implicit locking is handle    
364     described above.                              
365                                                   
366 3.  The data_vio traverses the block map tree     
367     necessary internal tree nodes have been al    
368     the leaf page for its logical address. If     
369     missing, it is allocated at this time out     
370     pool used to store application data.          
371                                                   
372     a. If any page-node in the tree has not ye    
373        allocated before the write can continue    
374        data_vio to lock the page-node that nee    
375        lock, like the logical block lock in st    
376        that causes other data_vios to wait for    
377        complete.                                  
378                                                   
379        The implicit logical zone lock is relea    
380        happening, in order to allow other oper    
381        zone to proceed. The details of allocat    
382        step 4. Once a new node has been alloca    
383        the tree using a similar process to add    
384        The data_vio journals the intent to add    
385        map tree (step 10), updates the referen    
386        (step 11), and reacquires the implicit     
387        new mapping to the parent tree node (st    
388        updated, the data_vio proceeds down the    
389        waiting on this allocation also proceed    
390                                                   
391     b. In the steady-state case, the block map    
392        allocated, so the data_vio just travers    
393        the required leaf node. The location of    
394        slot") is recorded in the data_vio so t    
395        to traverse the tree again. The data_vi    
396        logical zone lock.                         
397                                                   
398 4.  If the block is a zero block, skip to step    
399     made to allocate a free data block. This a    
400     data_vio can write its data somewhere even    
401     compression are not possible. This stage g    
402     physical zone to search for free space wit    
403                                                   
404     The data_vio will search each slab in a zo    
405     block or decides there are none. If the fi    
406     it will proceed to search the next physica    
407     lock for that zone and releasing the previ    
408     free block or runs out of zones to search.    
409     struct pbn_lock (the "physical block lock"    
410     struct pbn_lock also has several fields to    
411     claims that data_vios can have on physical    
412     added to a hashtable like the logical bloc    
413     hashtable is also covered by the implicit     
414     reference count of the free block is updat    
415     data_vio from considering it free. The ref    
416     sub-component of the slab and are thus als    
417     physical zone lock.                           
418                                                   
419 5.  If an allocation was obtained, the data_vi    
420     needs to complete the write. The applicati    
421     acknowledged at this point. The acknowledg    
422     thread to prevent the application callback    
423     operations.                                   
424                                                   
425     If an allocation could not be obtained, th    
426     attempt to deduplicate or compress the dat    
427     acknowledged because the vdo device may be    
428                                                   
429 6.  At this point vdo must determine where to     
430     The data_vio's data is hashed and the hash    
431     recorded in the data_vio.                     
432                                                   
433 7.  The data_vio reserves or joins a struct ha    
434     the data_vios currently writing the same d    
435     tracked in a hashtable similar to the way     
436     tracked in step 2. This hashtable is cover    
437     the hash zone.                                
438                                                   
439     If there is no existing hash lock for this    
440     data_vio obtains a hash lock from the pool    
441     and sets itself as the new hash lock's "ag    
442     also covered by the implicit hash zone loc    
443     do all the work to decide where the applic    
444     written. If a hash lock for the data_vio's    
445     and the data_vio's data is the same as the    
446     data_vio will wait for the agent to comple    
447     its result.                                   
448                                                   
449     In the rare case that a hash lock exists f    
450     the data does not match the hash lock's ag    
451     step 8h and attempts to write its data dir    
452     different data blocks produce the same has    
453                                                   
454 8.  The hash lock agent attempts to deduplicat    
455     the following steps.                          
456                                                   
457     a. The agent initializes and sends its emb    
458        (struct uds_request) to the deduplicati    
459        require the data_vio to get any locks b    
460        manage their own locking. The data_vio     
461        response from the index or times out.      
462                                                   
463     b. If the deduplication index returns advi    
464        obtain a physical block lock on the ind    
465        order to read the data and verify that     
466        data_vio's data, and that it can accept    
467        physical address is already locked by a    
468        that address may soon be overwritten so    
469        address for deduplication.                 
470                                                   
471     c. If the data matches and the physical bl    
472        agent and any other data_vios waiting o    
473        physical block as their new physical ad    
474        to record their new mapping. If there a    
475        lock than there are references availabl    
476        data_vios becomes the new agent and con    
477        valid advice was returned.                 
478                                                   
479     d. If no usable duplicate block was found,    
480        it has an allocated physical block (fro    
481        to. If the agent does not have an alloc    
482        the hash lock that does have an allocat    
483        none of the data_vios have an allocated    
484        are out of space, so they proceed to st    
485                                                   
486     e. The agent attempts to compress its data    
487        compress, the data_vio will continue to    
488        directly.                                  
489                                                   
490        If the compressed size is small enough,    
491        implicit hash zone lock and go to the p    
492        it will be placed in a bin (struct pack    
493        data_vios. All compression operations r    
494        the packer zone.                           
495                                                   
496        The packer can combine up to 14 compres    
497        data block. Compression is only helpful    
498        data_vios into a single data block. Thi    
499        wait in the packer for an arbitrarily l    
500        to fill out the compressed block. There    
501        evict waiting data_vios when continuing    
502        problems. Circumstances causing an evic    
503        flush, device shutdown, or a subsequent    
504        the same logical block address. A data_    
505        the packer if it cannot be paired with     
506        before more compressible blocks need to    
507        data_vio will proceed to step 8h to wri    
508                                                   
509     f. If the agent fills a packer bin, either    
510        are used or because it has no remaining    
511        using the allocated physical block from    
512        8d has already ensured that an allocati    
513                                                   
514     g. Each data_vio sets the compressed block    
515        The data_vio obtains an implicit lock o    
516        acquires the struct pbn_lock for the co    
517        modified to be a shared lock. Then it r    
518        zone lock and proceeds to step 8i.         
519                                                   
520     h. Any data_vio evicted from the packer wi    
521        step 3. It will write its data to that     
522                                                   
523     i. After the data is written, if the data_    
524        lock, it will reacquire the implicit ha    
525        physical address with as many other dat    
526        possible. Each data_vio will then proce    
527        new mapping.                               
528                                                   
529     j. If the agent actually wrote new data (w    
530        the deduplication index is updated to r    
531        new data. The agent then releases the i    
532                                                   
533 9.  The data_vio determines the previous mappi    
534     There is a cache for block map leaf pages     
535     because there are usually too many block m    
536     entirely in memory. If the desired leaf pa    
537     data_vio will reserve a slot in the cache     
538     into it, possibly evicting an older cached    
539     finds the current physical address for thi    
540     physical mapping"), if any, and records it    
541     on the block map cache structures, covered    
542     lock.                                         
543                                                   
544 10. The data_vio makes an entry in the recover    
545     logical block address, the old physical ma    
546     mapping. Making this journal entry require    
547     recovery journal lock. The data_vio will w    
548     recovery blocks up to the one containing i    
549     and flushed to ensure the transaction is s    
550                                                   
551 11. Once the recovery journal entry is stable,    
552     journal entries: an increment entry for th    
553     decrement entry for the old mapping. These    
554     holding a lock on the affected physical sl    
555     physical zone lock. For correctness during    
556     entries in any given slab journal must be     
557     corresponding recovery journal entries. Th    
558     are in different zones, they are made conc    
559     the same zone, the increment is always mad    
560     order to avoid underflow. After each slab     
561     memory, the associated reference count is     
562                                                   
563 12. Once both of the reference count updates a    
564     acquires the implicit logical zone lock an    
565     logical-to-physical mapping in the block m    
566     physical block. At this point the write op    
567                                                   
568 13. If the data_vio has a hash lock, it acquir    
569     lock and releases its hash lock to the poo    
570                                                   
571     The data_vio then acquires the implicit ph    
572     the struct pbn_lock it holds for its alloc    
573     allocation that it did not use, it also se    
574     that block back to zero to free it for use    
575                                                   
576     The data_vio then acquires the implicit lo    
577     the logical block lock acquired in step 2.    
578                                                   
579     The application bio is then acknowledged i    
580     acknowledged, and the data_vio is returned    
581                                                   
582 *Read Path*                                       
583                                                   
584 An application read bio follows a much simpler    
585 1 and 2 in the write path to obtain a data_vio    
586 address. If there is already a write data_vio     
587 address that is guaranteed to complete, the re    
588 data from the write data_vio and return it. Ot    
589 logical-to-physical mapping by traversing the     
590 and then read and possibly decompress the indi    
591 physical block address. A read data_vio will n    
592 nodes if they are missing. If the interior blo    
593 yet, the logical block map address must still     
594 data_vio will return all zeroes. A read data_v    
595 acknowledgment as in step 13, although it only    
596 lock and return itself to the pool.               
597                                                   
598 *Small Writes*                                    
599                                                   
600 All storage within vdo is managed as 4KB block    
601 as small as 512 bytes. Processing a write that    
602 a read-modify-write operation that reads the r    
603 new data over the approriate sectors of the bl    
604 write operation for the modified data block. T    
605 this operation are nearly identical to the nor    
606 operations, and a single data_vio is used thro    
607                                                   
608 *Recovery*                                        
609                                                   
610 When a vdo is restarted after a crash, it will    
611 recovery journal. During the pre-resume phase     
612 recovery journal is read. The increment portio    
613 into the block map. Next, valid entries are pl    
614 into the slab journals. Finally, each physical    
615 least one slab journal to reconstruct the refe    
616 Once each zone has some free space (or has det    
617 the vdo comes back online, while the remainder    
618 used to reconstruct the rest of the reference     
619                                                   
620 *Read-only Rebuild*                               
621                                                   
622 If a vdo encounters an unrecoverable error, it    
623 This mode indicates that some previously ackno    
624 lost. The vdo may be instructed to rebuild as     
625 return to a writable state. However, this is n    
626 to the possibility that data has been lost. Du    
627 block map is recovered from the recovery journ    
628 reference counts are not rebuilt from the slab    
629 reference counts are zeroed, the entire block     
630 reference counts are updated from the block ma    
631 some data, it ensures that the block map and r    
632 consistent with each other. This allows vdo to    
633 accept further writes.                            
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php