~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/admin-guide/device-mapper/vdo-design.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/admin-guide/device-mapper/vdo-design.rst (Version linux-6.12-rc7) and /Documentation/admin-guide/device-mapper/vdo-design.rst (Version linux-6.10.14)


  1 .. SPDX-License-Identifier: GPL-2.0-only            1 .. SPDX-License-Identifier: GPL-2.0-only
  2                                                     2 
  3 ================                                    3 ================
  4 Design of dm-vdo                                    4 Design of dm-vdo
  5 ================                                    5 ================
  6                                                     6 
  7 The dm-vdo (virtual data optimizer) target pro      7 The dm-vdo (virtual data optimizer) target provides inline deduplication,
  8 compression, zero-block elimination, and thin       8 compression, zero-block elimination, and thin provisioning. A dm-vdo target
  9 can be backed by up to 256TB of storage, and c      9 can be backed by up to 256TB of storage, and can present a logical size of
 10 up to 4PB. This target was originally develope     10 up to 4PB. This target was originally developed at Permabit Technology
 11 Corp. starting in 2009. It was first released      11 Corp. starting in 2009. It was first released in 2013 and has been used in
 12 production environments ever since. It was mad     12 production environments ever since. It was made open-source in 2017 after
 13 Permabit was acquired by Red Hat. This documen     13 Permabit was acquired by Red Hat. This document describes the design of
 14 dm-vdo. For usage, see vdo.rst in the same dir     14 dm-vdo. For usage, see vdo.rst in the same directory as this file.
 15                                                    15 
 16 Because deduplication rates fall drastically a     16 Because deduplication rates fall drastically as the block size increases, a
 17 vdo target has a maximum block size of 4K. How     17 vdo target has a maximum block size of 4K. However, it can achieve
 18 deduplication rates of 254:1, i.e. up to 254 c     18 deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
 19 reference a single 4K of actual storage. It ca     19 reference a single 4K of actual storage. It can achieve compression rates
 20 of 14:1. All zero blocks consume no storage at     20 of 14:1. All zero blocks consume no storage at all.
 21                                                    21 
 22 Theory of Operation                                22 Theory of Operation
 23 ===================                                23 ===================
 24                                                    24 
 25 The design of dm-vdo is based on the idea that     25 The design of dm-vdo is based on the idea that deduplication is a two-part
 26 problem. The first is to recognize duplicate d     26 problem. The first is to recognize duplicate data. The second is to avoid
 27 storing multiple copies of those duplicates. T     27 storing multiple copies of those duplicates. Therefore, dm-vdo has two main
 28 parts: a deduplication index (called UDS) that     28 parts: a deduplication index (called UDS) that is used to discover
 29 duplicate data, and a data store with a refere     29 duplicate data, and a data store with a reference counted block map that
 30 maps from logical block addresses to the actua     30 maps from logical block addresses to the actual storage location of the
 31 data.                                              31 data.
 32                                                    32 
 33 Zones and Threading                                33 Zones and Threading
 34 -------------------                                34 -------------------
 35                                                    35 
 36 Due to the complexity of data optimization, th     36 Due to the complexity of data optimization, the number of metadata
 37 structures involved in a single write operatio     37 structures involved in a single write operation to a vdo target is larger
 38 than most other targets. Furthermore, because      38 than most other targets. Furthermore, because vdo must operate on small
 39 block sizes in order to achieve good deduplica     39 block sizes in order to achieve good deduplication rates, acceptable
 40 performance can only be achieved through paral     40 performance can only be achieved through parallelism. Therefore, vdo's
 41 design attempts to be lock-free.                   41 design attempts to be lock-free.
 42                                                    42 
 43 Most of a vdo's main data structures are desig     43 Most of a vdo's main data structures are designed to be easily divided into
 44 "zones" such that any given bio must only acce     44 "zones" such that any given bio must only access a single zone of any zoned
 45 structure. Safety with minimal locking is achi     45 structure. Safety with minimal locking is achieved by ensuring that during
 46 normal operation, each zone is assigned to a s     46 normal operation, each zone is assigned to a specific thread, and only that
 47 thread will access the portion of the data str     47 thread will access the portion of the data structure in that zone.
 48 Associated with each thread is a work queue. E     48 Associated with each thread is a work queue. Each bio is associated with a
 49 request object (the "data_vio") which will be      49 request object (the "data_vio") which will be added to a work queue when
 50 the next phase of its operation requires acces     50 the next phase of its operation requires access to the structures in the
 51 zone associated with that queue.                   51 zone associated with that queue.
 52                                                    52 
 53 Another way of thinking about this arrangement     53 Another way of thinking about this arrangement is that the work queue for
 54 each zone has an implicit lock on the structur     54 each zone has an implicit lock on the structures it manages for all its
 55 operations, because vdo guarantees that no oth     55 operations, because vdo guarantees that no other thread will alter those
 56 structures.                                        56 structures.
 57                                                    57 
 58 Although each structure is divided into zones,     58 Although each structure is divided into zones, this division is not
 59 reflected in the on-disk representation of eac     59 reflected in the on-disk representation of each data structure. Therefore,
 60 the number of zones for each structure, and he     60 the number of zones for each structure, and hence the number of threads,
 61 can be reconfigured each time a vdo target is      61 can be reconfigured each time a vdo target is started.
 62                                                    62 
 63 The Deduplication Index                            63 The Deduplication Index
 64 -----------------------                            64 -----------------------
 65                                                    65 
 66 In order to identify duplicate data efficientl     66 In order to identify duplicate data efficiently, vdo was designed to
 67 leverage some common characteristics of duplic     67 leverage some common characteristics of duplicate data. From empirical
 68 observations, we gathered two key insights. Th     68 observations, we gathered two key insights. The first is that in most data
 69 sets with significant amounts of duplicate dat     69 sets with significant amounts of duplicate data, the duplicates tend to
 70 have temporal locality. When a duplicate appea     70 have temporal locality. When a duplicate appears, it is more likely that
 71 other duplicates will be detected, and that th     71 other duplicates will be detected, and that those duplicates will have been
 72 written at about the same time. This is why th     72 written at about the same time. This is why the index keeps records in
 73 temporal order. The second insight is that new     73 temporal order. The second insight is that new data is more likely to
 74 duplicate recent data than it is to duplicate      74 duplicate recent data than it is to duplicate older data and in general,
 75 there are diminishing returns to looking furth     75 there are diminishing returns to looking further back in time. Therefore,
 76 when the index is full, it should cull its old     76 when the index is full, it should cull its oldest records to make space for
 77 new ones. Another important idea behind the de     77 new ones. Another important idea behind the design of the index is that the
 78 ultimate goal of deduplication is to reduce st     78 ultimate goal of deduplication is to reduce storage costs. Since there is a
 79 trade-off between the storage saved and the re     79 trade-off between the storage saved and the resources expended to achieve
 80 those savings, vdo does not attempt to find ev     80 those savings, vdo does not attempt to find every last duplicate block. It
 81 is sufficient to find and eliminate most of th     81 is sufficient to find and eliminate most of the redundancy.
 82                                                    82 
 83 Each block of data is hashed to produce a 16-b     83 Each block of data is hashed to produce a 16-byte block name. An index
 84 record consists of this block name paired with     84 record consists of this block name paired with the presumed location of
 85 that data on the underlying storage. However,      85 that data on the underlying storage. However, it is not possible to
 86 guarantee that the index is accurate. In the m     86 guarantee that the index is accurate. In the most common case, this occurs
 87 because it is too costly to update the index w     87 because it is too costly to update the index when a block is over-written
 88 or discarded. Doing so would require either st     88 or discarded. Doing so would require either storing the block name along
 89 with the blocks, which is difficult to do effi     89 with the blocks, which is difficult to do efficiently in block-based
 90 storage, or reading and rehashing each block b     90 storage, or reading and rehashing each block before overwriting it.
 91 Inaccuracy can also result from a hash collisi     91 Inaccuracy can also result from a hash collision where two different blocks
 92 have the same name. In practice, this is extre     92 have the same name. In practice, this is extremely unlikely, but because
 93 vdo does not use a cryptographic hash, a malic     93 vdo does not use a cryptographic hash, a malicious workload could be
 94 constructed. Because of these inaccuracies, vd     94 constructed. Because of these inaccuracies, vdo treats the locations in the
 95 index as hints, and reads each indicated block     95 index as hints, and reads each indicated block to verify that it is indeed
 96 a duplicate before sharing the existing block      96 a duplicate before sharing the existing block with a new one.
 97                                                    97 
 98 Records are collected into groups called chapt     98 Records are collected into groups called chapters. New records are added to
 99 the newest chapter, called the open chapter. T     99 the newest chapter, called the open chapter. This chapter is stored in a
100 format optimized for adding and modifying reco    100 format optimized for adding and modifying records, and the content of the
101 open chapter is not finalized until it runs ou    101 open chapter is not finalized until it runs out of space for new records.
102 When the open chapter fills up, it is closed a    102 When the open chapter fills up, it is closed and a new open chapter is
103 created to collect new records.                   103 created to collect new records.
104                                                   104 
105 Closing a chapter converts it to a different f    105 Closing a chapter converts it to a different format which is optimized for
106 reading. The records are written to a series o    106 reading. The records are written to a series of record pages based on the
107 order in which they were received. This means     107 order in which they were received. This means that records with temporal
108 locality should be on a small number of pages,    108 locality should be on a small number of pages, reducing the I/O required to
109 retrieve them. The chapter also compiles an in    109 retrieve them. The chapter also compiles an index that indicates which
110 record page contains any given name. This inde    110 record page contains any given name. This index means that a request for a
111 name can determine exactly which record page m    111 name can determine exactly which record page may contain that record,
112 without having to load the entire chapter from    112 without having to load the entire chapter from storage. This index uses
113 only a subset of the block name as its key, so    113 only a subset of the block name as its key, so it cannot guarantee that an
114 index entry refers to the desired block name.     114 index entry refers to the desired block name. It can only guarantee that if
115 there is a record for this name, it will be on    115 there is a record for this name, it will be on the indicated page. Closed
116 chapters are read-only structures and their co    116 chapters are read-only structures and their contents are never altered in
117 any way.                                          117 any way.
118                                                   118 
119 Once enough records have been written to fill     119 Once enough records have been written to fill up all the available index
120 space, the oldest chapter is removed to make s    120 space, the oldest chapter is removed to make space for new chapters. Any
121 time a request finds a matching record in the     121 time a request finds a matching record in the index, that record is copied
122 into the open chapter. This ensures that usefu    122 into the open chapter. This ensures that useful block names remain available
123 in the index, while unreferenced block names a    123 in the index, while unreferenced block names are forgotten over time.
124                                                   124 
125 In order to find records in older chapters, th    125 In order to find records in older chapters, the index also maintains a
126 higher level structure called the volume index    126 higher level structure called the volume index, which contains entries
127 mapping each block name to the chapter contain    127 mapping each block name to the chapter containing its newest record. This
128 mapping is updated as records for the block na    128 mapping is updated as records for the block name are copied or updated,
129 ensuring that only the newest record for a giv    129 ensuring that only the newest record for a given block name can be found.
130 An older record for a block name will no longe    130 An older record for a block name will no longer be found even though it has
131 not been deleted from its chapter. Like the ch    131 not been deleted from its chapter. Like the chapter index, the volume index
132 uses only a subset of the block name as its ke    132 uses only a subset of the block name as its key and can not definitively
133 say that a record exists for a name. It can on    133 say that a record exists for a name. It can only say which chapter would
134 contain the record if a record exists. The vol    134 contain the record if a record exists. The volume index is stored entirely
135 in memory and is saved to storage only when th    135 in memory and is saved to storage only when the vdo target is shut down.
136                                                   136 
137 From the viewpoint of a request for a particul    137 From the viewpoint of a request for a particular block name, it will first
138 look up the name in the volume index. This sea    138 look up the name in the volume index. This search will either indicate that
139 the name is new, or which chapter to search. I    139 the name is new, or which chapter to search. If it returns a chapter, the
140 request looks up its name in the chapter index    140 request looks up its name in the chapter index. This will indicate either
141 that the name is new, or which record page to     141 that the name is new, or which record page to search. Finally, if it is not
142 new, the request will look for its name in the    142 new, the request will look for its name in the indicated record page.
143 This process may require up to two page reads     143 This process may require up to two page reads per request (one for the
144 chapter index page and one for the request pag    144 chapter index page and one for the request page). However, recently
145 accessed pages are cached so that these page r    145 accessed pages are cached so that these page reads can be amortized across
146 many block name requests.                         146 many block name requests.
147                                                   147 
148 The volume index and the chapter indexes are i    148 The volume index and the chapter indexes are implemented using a
149 memory-efficient structure called a delta inde    149 memory-efficient structure called a delta index. Instead of storing the
150 entire block name (the key) for each entry, th    150 entire block name (the key) for each entry, the entries are sorted by name
151 and only the difference between adjacent keys     151 and only the difference between adjacent keys (the delta) is stored.
152 Because we expect the hashes to be randomly di    152 Because we expect the hashes to be randomly distributed, the size of the
153 deltas follows an exponential distribution. Be    153 deltas follows an exponential distribution. Because of this distribution,
154 the deltas are expressed using a Huffman code     154 the deltas are expressed using a Huffman code to take up even less space.
155 The entire sorted list of keys is called a del    155 The entire sorted list of keys is called a delta list. This structure
156 allows the index to use many fewer bytes per e    156 allows the index to use many fewer bytes per entry than a traditional hash
157 table, but it is slightly more expensive to lo    157 table, but it is slightly more expensive to look up entries, because a
158 request must read every entry in a delta list     158 request must read every entry in a delta list to add up the deltas in order
159 to find the record it needs. The delta index r    159 to find the record it needs. The delta index reduces this lookup cost by
160 splitting its key space into many sub-lists, e    160 splitting its key space into many sub-lists, each starting at a fixed key
161 value, so that each individual list is short.     161 value, so that each individual list is short.
162                                                   162 
163 The default index size can hold 64 million rec    163 The default index size can hold 64 million records, corresponding to about
164 256GB of data. This means that the index can i    164 256GB of data. This means that the index can identify duplicate data if the
165 original data was written within the last 256G    165 original data was written within the last 256GB of writes. This range is
166 called the deduplication window. If new writes    166 called the deduplication window. If new writes duplicate data that is older
167 than that, the index will not be able to find     167 than that, the index will not be able to find it because the records of the
168 older data have been removed. This means that     168 older data have been removed. This means that if an application writes a
169 200 GB file to a vdo target and then immediate    169 200 GB file to a vdo target and then immediately writes it again, the two
170 copies will deduplicate perfectly. Doing the s    170 copies will deduplicate perfectly. Doing the same with a 500 GB file will
171 result in no deduplication, because the beginn    171 result in no deduplication, because the beginning of the file will no
172 longer be in the index by the time the second     172 longer be in the index by the time the second write begins (assuming there
173 is no duplication within the file itself).        173 is no duplication within the file itself).
174                                                   174 
175 If an application anticipates a data workload     175 If an application anticipates a data workload that will see useful
176 deduplication beyond the 256GB threshold, vdo     176 deduplication beyond the 256GB threshold, vdo can be configured to use a
177 larger index with a correspondingly larger ded    177 larger index with a correspondingly larger deduplication window. (This
178 configuration can only be set when the target     178 configuration can only be set when the target is created, not altered
179 later. It is important to consider the expecte    179 later. It is important to consider the expected workload for a vdo target
180 before configuring it.)  There are two ways to    180 before configuring it.)  There are two ways to do this.
181                                                   181 
182 One way is to increase the memory size of the     182 One way is to increase the memory size of the index, which also increases
183 the amount of backing storage required. Doubli    183 the amount of backing storage required. Doubling the size of the index will
184 double the length of the deduplication window     184 double the length of the deduplication window at the expense of doubling
185 the storage size and the memory requirements.     185 the storage size and the memory requirements.
186                                                   186 
187 The other option is to enable sparse indexing.    187 The other option is to enable sparse indexing. Sparse indexing increases
188 the deduplication window by a factor of 10, at    188 the deduplication window by a factor of 10, at the expense of also
189 increasing the storage size by a factor of 10.    189 increasing the storage size by a factor of 10. However with sparse
190 indexing, the memory requirements do not incre    190 indexing, the memory requirements do not increase. The trade-off is
191 slightly more computation per request and a sl    191 slightly more computation per request and a slight decrease in the amount
192 of deduplication detected. For most workloads     192 of deduplication detected. For most workloads with significant amounts of
193 duplicate data, sparse indexing will detect 97    193 duplicate data, sparse indexing will detect 97-99% of the deduplication
194 that a standard index will detect.                194 that a standard index will detect.
195                                                   195 
196 The vio and data_vio Structures                   196 The vio and data_vio Structures
197 -------------------------------                   197 -------------------------------
198                                                   198 
199 A vio (short for Vdo I/O) is conceptually simi    199 A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
200 fields and data to track vdo-specific informat    200 fields and data to track vdo-specific information. A struct vio maintains a
201 pointer to a bio but also tracks other fields     201 pointer to a bio but also tracks other fields specific to the operation of
202 vdo. The vio is kept separate from its related    202 vdo. The vio is kept separate from its related bio because there are many
203 circumstances where vdo completes the bio but     203 circumstances where vdo completes the bio but must continue to do work
204 related to deduplication or compression.          204 related to deduplication or compression.
205                                                   205 
206 Metadata reads and writes, and other writes th    206 Metadata reads and writes, and other writes that originate within vdo, use
207 a struct vio directly. Application reads and w    207 a struct vio directly. Application reads and writes use a larger structure
208 called a data_vio to track information about t    208 called a data_vio to track information about their progress. A struct
209 data_vio contain a struct vio and also include    209 data_vio contain a struct vio and also includes several other fields
210 related to deduplication and other vdo feature    210 related to deduplication and other vdo features. The data_vio is the
211 primary unit of application work in vdo. Each     211 primary unit of application work in vdo. Each data_vio proceeds through a
212 set of steps to handle the application data, a    212 set of steps to handle the application data, after which it is reset and
213 returned to a pool of data_vios for reuse.        213 returned to a pool of data_vios for reuse.
214                                                   214 
215 There is a fixed pool of 2048 data_vios. This     215 There is a fixed pool of 2048 data_vios. This number was chosen to bound
216 the amount of work that is required to recover    216 the amount of work that is required to recover from a crash. In addition,
217 benchmarks have indicated that increasing the     217 benchmarks have indicated that increasing the size of the pool does not
218 significantly improve performance.                218 significantly improve performance.
219                                                   219 
220 The Data Store                                    220 The Data Store
221 --------------                                    221 --------------
222                                                   222 
223 The data store is implemented by three main da    223 The data store is implemented by three main data structures, all of which
224 work in concert to reduce or amortize metadata    224 work in concert to reduce or amortize metadata updates across as many data
225 writes as possible.                               225 writes as possible.
226                                                   226 
227 *The Slab Depot*                                  227 *The Slab Depot*
228                                                   228 
229 Most of the vdo volume belongs to the slab dep    229 Most of the vdo volume belongs to the slab depot. The depot contains a
230 collection of slabs. The slabs can be up to 32    230 collection of slabs. The slabs can be up to 32GB, and are divided into
231 three sections. Most of a slab consists of a l    231 three sections. Most of a slab consists of a linear sequence of 4K blocks.
232 These blocks are used either to store data, or    232 These blocks are used either to store data, or to hold portions of the
233 block map (see below). In addition to the data    233 block map (see below). In addition to the data blocks, each slab has a set
234 of reference counters, using 1 byte for each d    234 of reference counters, using 1 byte for each data block. Finally each slab
235 has a journal.                                    235 has a journal.
236                                                   236 
237 Reference updates are written to the slab jour    237 Reference updates are written to the slab journal. Slab journal blocks are
238 written out either when they are full, or when    238 written out either when they are full, or when the recovery journal
239 requests they do so in order to allow the main    239 requests they do so in order to allow the main recovery journal (see below)
240 to free up space. The slab journal is used bot    240 to free up space. The slab journal is used both to ensure that the main
241 recovery journal can regularly free up space,     241 recovery journal can regularly free up space, and also to amortize the cost
242 of updating individual reference blocks. The r    242 of updating individual reference blocks. The reference counters are kept in
243 memory and are written out, a block at a time     243 memory and are written out, a block at a time in oldest-dirtied-order, only
244 when there is a need to reclaim slab journal s    244 when there is a need to reclaim slab journal space. The write operations
245 are performed in the background as needed so t    245 are performed in the background as needed so they do not add latency to
246 particular I/O operations.                        246 particular I/O operations.
247                                                   247 
248 Each slab is independent of every other. They     248 Each slab is independent of every other. They are assigned to "physical
249 zones" in round-robin fashion. If there are P     249 zones" in round-robin fashion. If there are P physical zones, then slab n
250 is assigned to zone n mod P.                      250 is assigned to zone n mod P.
251                                                   251 
252 The slab depot maintains an additional small d    252 The slab depot maintains an additional small data structure, the "slab
253 summary," which is used to reduce the amount o    253 summary," which is used to reduce the amount of work needed to come back
254 online after a crash. The slab summary maintai    254 online after a crash. The slab summary maintains an entry for each slab
255 indicating whether or not the slab has ever be    255 indicating whether or not the slab has ever been used, whether all of its
256 reference count updates have been persisted to    256 reference count updates have been persisted to storage, and approximately
257 how full it is. During recovery, each physical    257 how full it is. During recovery, each physical zone will attempt to recover
258 at least one slab, stopping whenever it has re    258 at least one slab, stopping whenever it has recovered a slab which has some
259 free blocks. Once each zone has some space, or    259 free blocks. Once each zone has some space, or has determined that none is
260 available, the target can resume normal operat    260 available, the target can resume normal operation in a degraded mode. Read
261 and write requests can be serviced, perhaps wi    261 and write requests can be serviced, perhaps with degraded performance,
262 while the remainder of the dirty slabs are rec    262 while the remainder of the dirty slabs are recovered.
263                                                   263 
264 *The Block Map*                                   264 *The Block Map*
265                                                   265 
266 The block map contains the logical to physical    266 The block map contains the logical to physical mapping. It can be thought
267 of as an array with one entry per logical addr    267 of as an array with one entry per logical address. Each entry is 5 bytes,
268 36 bits of which contain the physical block nu    268 36 bits of which contain the physical block number which holds the data for
269 the given logical address. The other 4 bits ar    269 the given logical address. The other 4 bits are used to indicate the nature
270 of the mapping. Of the 16 possible states, one    270 of the mapping. Of the 16 possible states, one represents a logical address
271 which is unmapped (i.e. it has never been writ    271 which is unmapped (i.e. it has never been written, or has been discarded),
272 one represents an uncompressed block, and the     272 one represents an uncompressed block, and the other 14 states are used to
273 indicate that the mapped data is compressed, a    273 indicate that the mapped data is compressed, and which of the compression
274 slots in the compressed block contains the dat    274 slots in the compressed block contains the data for this logical address.
275                                                   275 
276 In practice, the array of mapping entries is d    276 In practice, the array of mapping entries is divided into "block map
277 pages," each of which fits in a single 4K bloc    277 pages," each of which fits in a single 4K block. Each block map page
278 consists of a header and 812 mapping entries.     278 consists of a header and 812 mapping entries. Each mapping page is actually
279 a leaf of a radix tree which consists of block    279 a leaf of a radix tree which consists of block map pages at each level.
280 There are 60 radix trees which are assigned to    280 There are 60 radix trees which are assigned to "logical zones" in round
281 robin fashion. (If there are L logical zones,     281 robin fashion. (If there are L logical zones, tree n will belong to zone n
282 mod L.) At each level, the trees are interleav    282 mod L.) At each level, the trees are interleaved, so logical addresses
283 0-811 belong to tree 0, logical addresses 812-    283 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
284 on. The interleaving is maintained all the way    284 on. The interleaving is maintained all the way up to the 60 root nodes.
285 Choosing 60 trees results in an evenly distrib    285 Choosing 60 trees results in an evenly distributed number of trees per zone
286 for a large number of possible logical zone co    286 for a large number of possible logical zone counts. The storage for the 60
287 tree roots is allocated at format time. All ot    287 tree roots is allocated at format time. All other block map pages are
288 allocated out of the slabs as needed. This fle    288 allocated out of the slabs as needed. This flexible allocation avoids the
289 need to pre-allocate space for the entire set     289 need to pre-allocate space for the entire set of logical mappings and also
290 makes growing the logical size of a vdo relati    290 makes growing the logical size of a vdo relatively easy.
291                                                   291 
292 In operation, the block map maintains two cach    292 In operation, the block map maintains two caches. It is prohibitive to keep
293 the entire leaf level of the trees in memory,     293 the entire leaf level of the trees in memory, so each logical zone
294 maintains its own cache of leaf pages. The siz    294 maintains its own cache of leaf pages. The size of this cache is
295 configurable at target start time. The second     295 configurable at target start time. The second cache is allocated at start
296 time, and is large enough to hold all the non-    296 time, and is large enough to hold all the non-leaf pages of the entire
297 block map. This cache is populated as pages ar    297 block map. This cache is populated as pages are needed.
298                                                   298 
299 *The Recovery Journal*                            299 *The Recovery Journal*
300                                                   300 
301 The recovery journal is used to amortize updat    301 The recovery journal is used to amortize updates across the block map and
302 slab depot. Each write request causes an entry    302 slab depot. Each write request causes an entry to be made in the journal.
303 Entries are either "data remappings" or "block    303 Entries are either "data remappings" or "block map remappings." For a data
304 remapping, the journal records the logical add    304 remapping, the journal records the logical address affected and its old and
305 new physical mappings. For a block map remappi    305 new physical mappings. For a block map remapping, the journal records the
306 block map page number and the physical block a    306 block map page number and the physical block allocated for it. Block map
307 pages are never reclaimed or repurposed, so th    307 pages are never reclaimed or repurposed, so the old mapping is always 0.
308                                                   308 
309 Each journal entry is an intent record summari    309 Each journal entry is an intent record summarizing the metadata updates
310 that are required for a data_vio. The recovery    310 that are required for a data_vio. The recovery journal issues a flush
311 before each journal block write to ensure that    311 before each journal block write to ensure that the physical data for the
312 new block mappings in that block are stable on    312 new block mappings in that block are stable on storage, and journal block
313 writes are all issued with the FUA bit set to     313 writes are all issued with the FUA bit set to ensure the recovery journal
314 entries themselves are stable. The journal ent    314 entries themselves are stable. The journal entry and the data write it
315 represents must be stable on disk before the o    315 represents must be stable on disk before the other metadata structures may
316 be updated to reflect the operation. These ent    316 be updated to reflect the operation. These entries allow the vdo device to
317 reconstruct the logical to physical mappings a    317 reconstruct the logical to physical mappings after an unexpected
318 interruption such as a loss of power.             318 interruption such as a loss of power.
319                                                   319 
320 *Write Path*                                      320 *Write Path*
321                                                   321 
322 All write I/O to vdo is asynchronous. Each bio    322 All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
323 as vdo has done enough work to guarantee that     323 as vdo has done enough work to guarantee that it can complete the write
324 eventually. Generally, the data for acknowledg    324 eventually. Generally, the data for acknowledged but unflushed write I/O
325 can be treated as though it is cached in memor    325 can be treated as though it is cached in memory. If an application
326 requires data to be stable on storage, it must    326 requires data to be stable on storage, it must issue a flush or write the
327 data with the FUA bit set like any other async    327 data with the FUA bit set like any other asynchronous I/O. Shutting down
328 the vdo target will also flush any remaining I    328 the vdo target will also flush any remaining I/O.
329                                                   329 
330 Application write bios follow the steps outlin    330 Application write bios follow the steps outlined below.
331                                                   331 
332 1.  A data_vio is obtained from the data_vio p    332 1.  A data_vio is obtained from the data_vio pool and associated with the
333     application bio. If there are no data_vios    333     application bio. If there are no data_vios available, the incoming bio
334     will block until a data_vio is available.     334     will block until a data_vio is available. This provides back pressure
335     to the application. The data_vio pool is p    335     to the application. The data_vio pool is protected by a spin lock.
336                                                   336 
337     The newly acquired data_vio is reset and t    337     The newly acquired data_vio is reset and the bio's data is copied into
338     the data_vio if it is a write and the data    338     the data_vio if it is a write and the data is not all zeroes. The data
339     must be copied because the application bio    339     must be copied because the application bio can be acknowledged before
340     the data_vio processing is complete, which    340     the data_vio processing is complete, which means later processing steps
341     will no longer have access to the applicat    341     will no longer have access to the application bio. The application bio
342     may also be smaller than 4K, in which case    342     may also be smaller than 4K, in which case the data_vio will have
343     already read the underlying block and the     343     already read the underlying block and the data is instead copied over
344     the relevant portion of the larger block.     344     the relevant portion of the larger block.
345                                                   345 
346 2.  The data_vio places a claim (the "logical     346 2.  The data_vio places a claim (the "logical lock") on the logical address
347     of the bio. It is vital to prevent simulta    347     of the bio. It is vital to prevent simultaneous modifications of the
348     same logical address, because deduplicatio    348     same logical address, because deduplication involves sharing blocks.
349     This claim is implemented as an entry in a    349     This claim is implemented as an entry in a hashtable where the key is
350     the logical address and the value is a poi    350     the logical address and the value is a pointer to the data_vio
351     currently handling that address.              351     currently handling that address.
352                                                   352 
353     If a data_vio looks in the hashtable and f    353     If a data_vio looks in the hashtable and finds that another data_vio is
354     already operating on that logical address,    354     already operating on that logical address, it waits until the previous
355     operation finishes. It also sends a messag    355     operation finishes. It also sends a message to inform the current
356     lock holder that it is waiting. Most notab    356     lock holder that it is waiting. Most notably, a new data_vio waiting
357     for a logical lock will flush the previous    357     for a logical lock will flush the previous lock holder out of the
358     compression packer (step 8d) rather than a    358     compression packer (step 8d) rather than allowing it to continue
359     waiting to be packed.                         359     waiting to be packed.
360                                                   360 
361     This stage requires the data_vio to get an    361     This stage requires the data_vio to get an implicit lock on the
362     appropriate logical zone to prevent concur    362     appropriate logical zone to prevent concurrent modifications of the
363     hashtable. This implicit locking is handle    363     hashtable. This implicit locking is handled by the zone divisions
364     described above.                              364     described above.
365                                                   365 
366 3.  The data_vio traverses the block map tree     366 3.  The data_vio traverses the block map tree to ensure that all the
367     necessary internal tree nodes have been al    367     necessary internal tree nodes have been allocated, by trying to find
368     the leaf page for its logical address. If     368     the leaf page for its logical address. If any interior tree page is
369     missing, it is allocated at this time out     369     missing, it is allocated at this time out of the same physical storage
370     pool used to store application data.          370     pool used to store application data.
371                                                   371 
372     a. If any page-node in the tree has not ye    372     a. If any page-node in the tree has not yet been allocated, it must be
373        allocated before the write can continue    373        allocated before the write can continue. This step requires the
374        data_vio to lock the page-node that nee    374        data_vio to lock the page-node that needs to be allocated. This
375        lock, like the logical block lock in st    375        lock, like the logical block lock in step 2, is a hashtable entry
376        that causes other data_vios to wait for    376        that causes other data_vios to wait for the allocation process to
377        complete.                                  377        complete.
378                                                   378 
379        The implicit logical zone lock is relea    379        The implicit logical zone lock is released while the allocation is
380        happening, in order to allow other oper    380        happening, in order to allow other operations in the same logical
381        zone to proceed. The details of allocat    381        zone to proceed. The details of allocation are the same as in
382        step 4. Once a new node has been alloca    382        step 4. Once a new node has been allocated, that node is added to
383        the tree using a similar process to add    383        the tree using a similar process to adding a new data block mapping.
384        The data_vio journals the intent to add    384        The data_vio journals the intent to add the new node to the block
385        map tree (step 10), updates the referen    385        map tree (step 10), updates the reference count of the new block
386        (step 11), and reacquires the implicit     386        (step 11), and reacquires the implicit logical zone lock to add the
387        new mapping to the parent tree node (st    387        new mapping to the parent tree node (step 12). Once the tree is
388        updated, the data_vio proceeds down the    388        updated, the data_vio proceeds down the tree. Any other data_vios
389        waiting on this allocation also proceed    389        waiting on this allocation also proceed.
390                                                   390 
391     b. In the steady-state case, the block map    391     b. In the steady-state case, the block map tree nodes will already be
392        allocated, so the data_vio just travers    392        allocated, so the data_vio just traverses the tree until it finds
393        the required leaf node. The location of    393        the required leaf node. The location of the mapping (the "block map
394        slot") is recorded in the data_vio so t    394        slot") is recorded in the data_vio so that later steps do not need
395        to traverse the tree again. The data_vi    395        to traverse the tree again. The data_vio then releases the implicit
396        logical zone lock.                         396        logical zone lock.
397                                                   397 
398 4.  If the block is a zero block, skip to step    398 4.  If the block is a zero block, skip to step 9. Otherwise, an attempt is
399     made to allocate a free data block. This a    399     made to allocate a free data block. This allocation ensures that the
400     data_vio can write its data somewhere even    400     data_vio can write its data somewhere even if deduplication and
401     compression are not possible. This stage g    401     compression are not possible. This stage gets an implicit lock on a
402     physical zone to search for free space wit    402     physical zone to search for free space within that zone.
403                                                   403 
404     The data_vio will search each slab in a zo    404     The data_vio will search each slab in a zone until it finds a free
405     block or decides there are none. If the fi    405     block or decides there are none. If the first zone has no free space,
406     it will proceed to search the next physica    406     it will proceed to search the next physical zone by taking the implicit
407     lock for that zone and releasing the previ    407     lock for that zone and releasing the previous one until it finds a
408     free block or runs out of zones to search.    408     free block or runs out of zones to search. The data_vio will acquire a
409     struct pbn_lock (the "physical block lock"    409     struct pbn_lock (the "physical block lock") on the free block. The
410     struct pbn_lock also has several fields to    410     struct pbn_lock also has several fields to record the various kinds of
411     claims that data_vios can have on physical    411     claims that data_vios can have on physical blocks. The pbn_lock is
412     added to a hashtable like the logical bloc    412     added to a hashtable like the logical block locks in step 2. This
413     hashtable is also covered by the implicit     413     hashtable is also covered by the implicit physical zone lock. The
414     reference count of the free block is updat    414     reference count of the free block is updated to prevent any other
415     data_vio from considering it free. The ref    415     data_vio from considering it free. The reference counters are a
416     sub-component of the slab and are thus als    416     sub-component of the slab and are thus also covered by the implicit
417     physical zone lock.                           417     physical zone lock.
418                                                   418 
419 5.  If an allocation was obtained, the data_vi    419 5.  If an allocation was obtained, the data_vio has all the resources it
420     needs to complete the write. The applicati    420     needs to complete the write. The application bio can safely be
421     acknowledged at this point. The acknowledg    421     acknowledged at this point. The acknowledgment happens on a separate
422     thread to prevent the application callback    422     thread to prevent the application callback from blocking other data_vio
423     operations.                                   423     operations.
424                                                   424 
425     If an allocation could not be obtained, th    425     If an allocation could not be obtained, the data_vio continues to
426     attempt to deduplicate or compress the dat    426     attempt to deduplicate or compress the data, but the bio is not
427     acknowledged because the vdo device may be    427     acknowledged because the vdo device may be out of space.
428                                                   428 
429 6.  At this point vdo must determine where to     429 6.  At this point vdo must determine where to store the application data.
430     The data_vio's data is hashed and the hash    430     The data_vio's data is hashed and the hash (the "record name") is
431     recorded in the data_vio.                     431     recorded in the data_vio.
432                                                   432 
433 7.  The data_vio reserves or joins a struct ha    433 7.  The data_vio reserves or joins a struct hash_lock, which manages all of
434     the data_vios currently writing the same d    434     the data_vios currently writing the same data. Active hash locks are
435     tracked in a hashtable similar to the way     435     tracked in a hashtable similar to the way logical block locks are
436     tracked in step 2. This hashtable is cover    436     tracked in step 2. This hashtable is covered by the implicit lock on
437     the hash zone.                                437     the hash zone.
438                                                   438 
439     If there is no existing hash lock for this    439     If there is no existing hash lock for this data_vio's record_name, the
440     data_vio obtains a hash lock from the pool    440     data_vio obtains a hash lock from the pool, adds it to the hashtable,
441     and sets itself as the new hash lock's "ag    441     and sets itself as the new hash lock's "agent." The hash_lock pool is
442     also covered by the implicit hash zone loc    442     also covered by the implicit hash zone lock. The hash lock agent will
443     do all the work to decide where the applic    443     do all the work to decide where the application data will be
444     written. If a hash lock for the data_vio's    444     written. If a hash lock for the data_vio's record_name already exists,
445     and the data_vio's data is the same as the    445     and the data_vio's data is the same as the agent's data, the new
446     data_vio will wait for the agent to comple    446     data_vio will wait for the agent to complete its work and then share
447     its result.                                   447     its result.
448                                                   448 
449     In the rare case that a hash lock exists f    449     In the rare case that a hash lock exists for the data_vio's hash but
450     the data does not match the hash lock's ag    450     the data does not match the hash lock's agent, the data_vio skips to
451     step 8h and attempts to write its data dir    451     step 8h and attempts to write its data directly. This can happen if two
452     different data blocks produce the same has    452     different data blocks produce the same hash, for example.
453                                                   453 
454 8.  The hash lock agent attempts to deduplicat    454 8.  The hash lock agent attempts to deduplicate or compress its data with
455     the following steps.                          455     the following steps.
456                                                   456 
457     a. The agent initializes and sends its emb    457     a. The agent initializes and sends its embedded deduplication request
458        (struct uds_request) to the deduplicati    458        (struct uds_request) to the deduplication index. This does not
459        require the data_vio to get any locks b    459        require the data_vio to get any locks because the index components
460        manage their own locking. The data_vio     460        manage their own locking. The data_vio waits until it either gets a
461        response from the index or times out.      461        response from the index or times out.
462                                                   462 
463     b. If the deduplication index returns advi    463     b. If the deduplication index returns advice, the data_vio attempts to
464        obtain a physical block lock on the ind    464        obtain a physical block lock on the indicated physical address, in
465        order to read the data and verify that     465        order to read the data and verify that it is the same as the
466        data_vio's data, and that it can accept    466        data_vio's data, and that it can accept more references. If the
467        physical address is already locked by a    467        physical address is already locked by another data_vio, the data at
468        that address may soon be overwritten so    468        that address may soon be overwritten so it is not safe to use the
469        address for deduplication.                 469        address for deduplication.
470                                                   470 
471     c. If the data matches and the physical bl    471     c. If the data matches and the physical block can add references, the
472        agent and any other data_vios waiting o    472        agent and any other data_vios waiting on it will record this
473        physical block as their new physical ad    473        physical block as their new physical address and proceed to step 9
474        to record their new mapping. If there a    474        to record their new mapping. If there are more data_vios in the hash
475        lock than there are references availabl    475        lock than there are references available, one of the remaining
476        data_vios becomes the new agent and con    476        data_vios becomes the new agent and continues to step 8d as if no
477        valid advice was returned.                 477        valid advice was returned.
478                                                   478 
479     d. If no usable duplicate block was found,    479     d. If no usable duplicate block was found, the agent first checks that
480        it has an allocated physical block (fro    480        it has an allocated physical block (from step 3) that it can write
481        to. If the agent does not have an alloc    481        to. If the agent does not have an allocation, some other data_vio in
482        the hash lock that does have an allocat    482        the hash lock that does have an allocation takes over as agent. If
483        none of the data_vios have an allocated    483        none of the data_vios have an allocated physical block, these writes
484        are out of space, so they proceed to st    484        are out of space, so they proceed to step 13 for cleanup.
485                                                   485 
486     e. The agent attempts to compress its data    486     e. The agent attempts to compress its data. If the data does not
487        compress, the data_vio will continue to    487        compress, the data_vio will continue to step 8h to write its data
488        directly.                                  488        directly.
489                                                   489 
490        If the compressed size is small enough,    490        If the compressed size is small enough, the agent will release the
491        implicit hash zone lock and go to the p    491        implicit hash zone lock and go to the packer (struct packer) where
492        it will be placed in a bin (struct pack    492        it will be placed in a bin (struct packer_bin) along with other
493        data_vios. All compression operations r    493        data_vios. All compression operations require the implicit lock on
494        the packer zone.                           494        the packer zone.
495                                                   495 
496        The packer can combine up to 14 compres    496        The packer can combine up to 14 compressed blocks in a single 4k
497        data block. Compression is only helpful    497        data block. Compression is only helpful if vdo can pack at least 2
498        data_vios into a single data block. Thi    498        data_vios into a single data block. This means that a data_vio may
499        wait in the packer for an arbitrarily l    499        wait in the packer for an arbitrarily long time for other data_vios
500        to fill out the compressed block. There    500        to fill out the compressed block. There is a mechanism for vdo to
501        evict waiting data_vios when continuing    501        evict waiting data_vios when continuing to wait would cause
502        problems. Circumstances causing an evic    502        problems. Circumstances causing an eviction include an application
503        flush, device shutdown, or a subsequent    503        flush, device shutdown, or a subsequent data_vio trying to overwrite
504        the same logical block address. A data_    504        the same logical block address. A data_vio may also be evicted from
505        the packer if it cannot be paired with     505        the packer if it cannot be paired with any other compressed block
506        before more compressible blocks need to    506        before more compressible blocks need to use its bin. An evicted
507        data_vio will proceed to step 8h to wri    507        data_vio will proceed to step 8h to write its data directly.
508                                                   508 
509     f. If the agent fills a packer bin, either    509     f. If the agent fills a packer bin, either because all 14 of its slots
510        are used or because it has no remaining    510        are used or because it has no remaining space, it is written out
511        using the allocated physical block from    511        using the allocated physical block from one of its data_vios. Step
512        8d has already ensured that an allocati    512        8d has already ensured that an allocation is available.
513                                                   513 
514     g. Each data_vio sets the compressed block    514     g. Each data_vio sets the compressed block as its new physical address.
515        The data_vio obtains an implicit lock o    515        The data_vio obtains an implicit lock on the physical zone and
516        acquires the struct pbn_lock for the co    516        acquires the struct pbn_lock for the compressed block, which is
517        modified to be a shared lock. Then it r    517        modified to be a shared lock. Then it releases the implicit physical
518        zone lock and proceeds to step 8i.         518        zone lock and proceeds to step 8i.
519                                                   519 
520     h. Any data_vio evicted from the packer wi    520     h. Any data_vio evicted from the packer will have an allocation from
521        step 3. It will write its data to that     521        step 3. It will write its data to that allocated physical block.
522                                                   522 
523     i. After the data is written, if the data_    523     i. After the data is written, if the data_vio is the agent of a hash
524        lock, it will reacquire the implicit ha    524        lock, it will reacquire the implicit hash zone lock and share its
525        physical address with as many other dat    525        physical address with as many other data_vios in the hash lock as
526        possible. Each data_vio will then proce    526        possible. Each data_vio will then proceed to step 9 to record its
527        new mapping.                               527        new mapping.
528                                                   528 
529     j. If the agent actually wrote new data (w    529     j. If the agent actually wrote new data (whether compressed or not),
530        the deduplication index is updated to r    530        the deduplication index is updated to reflect the location of the
531        new data. The agent then releases the i    531        new data. The agent then releases the implicit hash zone lock.
532                                                   532 
533 9.  The data_vio determines the previous mappi    533 9.  The data_vio determines the previous mapping of the logical address.
534     There is a cache for block map leaf pages     534     There is a cache for block map leaf pages (the "block map cache"),
535     because there are usually too many block m    535     because there are usually too many block map leaf nodes to store
536     entirely in memory. If the desired leaf pa    536     entirely in memory. If the desired leaf page is not in the cache, the
537     data_vio will reserve a slot in the cache     537     data_vio will reserve a slot in the cache and load the desired page
538     into it, possibly evicting an older cached    538     into it, possibly evicting an older cached page. The data_vio then
539     finds the current physical address for thi    539     finds the current physical address for this logical address (the "old
540     physical mapping"), if any, and records it    540     physical mapping"), if any, and records it. This step requires a lock
541     on the block map cache structures, covered    541     on the block map cache structures, covered by the implicit logical zone
542     lock.                                         542     lock.
543                                                   543 
544 10. The data_vio makes an entry in the recover    544 10. The data_vio makes an entry in the recovery journal containing the
545     logical block address, the old physical ma    545     logical block address, the old physical mapping, and the new physical
546     mapping. Making this journal entry require    546     mapping. Making this journal entry requires holding the implicit
547     recovery journal lock. The data_vio will w    547     recovery journal lock. The data_vio will wait in the journal until all
548     recovery blocks up to the one containing i    548     recovery blocks up to the one containing its entry have been written
549     and flushed to ensure the transaction is s    549     and flushed to ensure the transaction is stable on storage.
550                                                   550 
551 11. Once the recovery journal entry is stable,    551 11. Once the recovery journal entry is stable, the data_vio makes two slab
552     journal entries: an increment entry for th    552     journal entries: an increment entry for the new mapping, and a
553     decrement entry for the old mapping. These    553     decrement entry for the old mapping. These two operations each require
554     holding a lock on the affected physical sl    554     holding a lock on the affected physical slab, covered by its implicit
555     physical zone lock. For correctness during    555     physical zone lock. For correctness during recovery, the slab journal
556     entries in any given slab journal must be     556     entries in any given slab journal must be in the same order as the
557     corresponding recovery journal entries. Th    557     corresponding recovery journal entries. Therefore, if the two entries
558     are in different zones, they are made conc    558     are in different zones, they are made concurrently, and if they are in
559     the same zone, the increment is always mad    559     the same zone, the increment is always made before the decrement in
560     order to avoid underflow. After each slab     560     order to avoid underflow. After each slab journal entry is made in
561     memory, the associated reference count is     561     memory, the associated reference count is also updated in memory.
562                                                   562 
563 12. Once both of the reference count updates a    563 12. Once both of the reference count updates are done, the data_vio
564     acquires the implicit logical zone lock an    564     acquires the implicit logical zone lock and updates the
565     logical-to-physical mapping in the block m    565     logical-to-physical mapping in the block map to point to the new
566     physical block. At this point the write op    566     physical block. At this point the write operation is complete.
567                                                   567 
568 13. If the data_vio has a hash lock, it acquir    568 13. If the data_vio has a hash lock, it acquires the implicit hash zone
569     lock and releases its hash lock to the poo    569     lock and releases its hash lock to the pool.
570                                                   570 
571     The data_vio then acquires the implicit ph    571     The data_vio then acquires the implicit physical zone lock and releases
572     the struct pbn_lock it holds for its alloc    572     the struct pbn_lock it holds for its allocated block. If it had an
573     allocation that it did not use, it also se    573     allocation that it did not use, it also sets the reference count for
574     that block back to zero to free it for use    574     that block back to zero to free it for use by subsequent data_vios.
575                                                   575 
576     The data_vio then acquires the implicit lo    576     The data_vio then acquires the implicit logical zone lock and releases
577     the logical block lock acquired in step 2.    577     the logical block lock acquired in step 2.
578                                                   578 
579     The application bio is then acknowledged i    579     The application bio is then acknowledged if it has not previously been
580     acknowledged, and the data_vio is returned    580     acknowledged, and the data_vio is returned to the pool.
581                                                   581 
582 *Read Path*                                       582 *Read Path*
583                                                   583 
584 An application read bio follows a much simpler    584 An application read bio follows a much simpler set of steps. It does steps
585 1 and 2 in the write path to obtain a data_vio    585 1 and 2 in the write path to obtain a data_vio and lock its logical
586 address. If there is already a write data_vio     586 address. If there is already a write data_vio in progress for that logical
587 address that is guaranteed to complete, the re    587 address that is guaranteed to complete, the read data_vio will copy the
588 data from the write data_vio and return it. Ot    588 data from the write data_vio and return it. Otherwise, it will look up the
589 logical-to-physical mapping by traversing the     589 logical-to-physical mapping by traversing the block map tree as in step 3,
590 and then read and possibly decompress the indi    590 and then read and possibly decompress the indicated data at the indicated
591 physical block address. A read data_vio will n    591 physical block address. A read data_vio will not allocate block map tree
592 nodes if they are missing. If the interior blo    592 nodes if they are missing. If the interior block map nodes do not exist
593 yet, the logical block map address must still     593 yet, the logical block map address must still be unmapped and the read
594 data_vio will return all zeroes. A read data_v    594 data_vio will return all zeroes. A read data_vio handles cleanup and
595 acknowledgment as in step 13, although it only    595 acknowledgment as in step 13, although it only needs to release the logical
596 lock and return itself to the pool.               596 lock and return itself to the pool.
597                                                   597 
598 *Small Writes*                                    598 *Small Writes*
599                                                   599 
600 All storage within vdo is managed as 4KB block    600 All storage within vdo is managed as 4KB blocks, but it can accept writes
601 as small as 512 bytes. Processing a write that    601 as small as 512 bytes. Processing a write that is smaller than 4K requires
602 a read-modify-write operation that reads the r    602 a read-modify-write operation that reads the relevant 4K block, copies the
603 new data over the approriate sectors of the bl    603 new data over the approriate sectors of the block, and then launches a
604 write operation for the modified data block. T    604 write operation for the modified data block. The read and write stages of
605 this operation are nearly identical to the nor    605 this operation are nearly identical to the normal read and write
606 operations, and a single data_vio is used thro    606 operations, and a single data_vio is used throughout this operation.
607                                                   607 
608 *Recovery*                                        608 *Recovery*
609                                                   609 
610 When a vdo is restarted after a crash, it will    610 When a vdo is restarted after a crash, it will attempt to recover from the
611 recovery journal. During the pre-resume phase     611 recovery journal. During the pre-resume phase of the next start, the
612 recovery journal is read. The increment portio    612 recovery journal is read. The increment portion of valid entries are played
613 into the block map. Next, valid entries are pl    613 into the block map. Next, valid entries are played, in order as required,
614 into the slab journals. Finally, each physical    614 into the slab journals. Finally, each physical zone attempts to replay at
615 least one slab journal to reconstruct the refe    615 least one slab journal to reconstruct the reference counts of one slab.
616 Once each zone has some free space (or has det    616 Once each zone has some free space (or has determined that it has none),
617 the vdo comes back online, while the remainder    617 the vdo comes back online, while the remainder of the slab journals are
618 used to reconstruct the rest of the reference     618 used to reconstruct the rest of the reference counts in the background.
619                                                   619 
620 *Read-only Rebuild*                               620 *Read-only Rebuild*
621                                                   621 
622 If a vdo encounters an unrecoverable error, it    622 If a vdo encounters an unrecoverable error, it will enter read-only mode.
623 This mode indicates that some previously ackno    623 This mode indicates that some previously acknowledged data may have been
624 lost. The vdo may be instructed to rebuild as     624 lost. The vdo may be instructed to rebuild as best it can in order to
625 return to a writable state. However, this is n    625 return to a writable state. However, this is never done automatically due
626 to the possibility that data has been lost. Du    626 to the possibility that data has been lost. During a read-only rebuild, the
627 block map is recovered from the recovery journ    627 block map is recovered from the recovery journal as before. However, the
628 reference counts are not rebuilt from the slab    628 reference counts are not rebuilt from the slab journals. Instead, the
629 reference counts are zeroed, the entire block     629 reference counts are zeroed, the entire block map is traversed, and the
630 reference counts are updated from the block ma    630 reference counts are updated from the block mappings. While this may lose
631 some data, it ensures that the block map and r    631 some data, it ensures that the block map and reference counts are
632 consistent with each other. This allows vdo to    632 consistent with each other. This allows vdo to resume normal operation and
633 accept further writes.                            633 accept further writes.
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php