1 ======== 1 ======== 2 dm-zoned 2 dm-zoned 3 ======== 3 ======== 4 4 5 The dm-zoned device mapper target exposes a zo 5 The dm-zoned device mapper target exposes a zoned block device (ZBC and 6 ZAC compliant devices) as a regular block devi 6 ZAC compliant devices) as a regular block device without any write 7 pattern constraints. In effect, it implements 7 pattern constraints. In effect, it implements a drive-managed zoned 8 block device which hides from the user (a file 8 block device which hides from the user (a file system or an application 9 doing raw block device accesses) the sequentia 9 doing raw block device accesses) the sequential write constraints of 10 host-managed zoned block devices and can mitig 10 host-managed zoned block devices and can mitigate the potential 11 device-side performance degradation due to exc 11 device-side performance degradation due to excessive random writes on 12 host-aware zoned block devices. 12 host-aware zoned block devices. 13 13 14 For a more detailed description of the zoned b 14 For a more detailed description of the zoned block device models and 15 their constraints see (for SCSI devices): 15 their constraints see (for SCSI devices): 16 16 17 https://www.t10.org/drafts.htm#ZBC_Family 17 https://www.t10.org/drafts.htm#ZBC_Family 18 18 19 and (for ATA devices): 19 and (for ATA devices): 20 20 21 http://www.t13.org/Documents/UploadedDocuments 21 http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf 22 22 23 The dm-zoned implementation is simple and mini 23 The dm-zoned implementation is simple and minimizes system overhead (CPU 24 and memory usage as well as storage capacity l 24 and memory usage as well as storage capacity loss). For a 10TB 25 host-managed disk with 256 MB zones, dm-zoned 25 host-managed disk with 256 MB zones, dm-zoned memory usage per disk 26 instance is at most 4.5 MB and as little as 5 26 instance is at most 4.5 MB and as little as 5 zones will be used 27 internally for storing metadata and performing 27 internally for storing metadata and performing reclaim operations. 28 28 29 dm-zoned target devices are formatted and chec 29 dm-zoned target devices are formatted and checked using the dmzadm 30 utility available at: 30 utility available at: 31 31 32 https://github.com/hgst/dm-zoned-tools 32 https://github.com/hgst/dm-zoned-tools 33 33 34 Algorithm 34 Algorithm 35 ========= 35 ========= 36 36 37 dm-zoned implements an on-disk buffering schem 37 dm-zoned implements an on-disk buffering scheme to handle non-sequential 38 write accesses to the sequential zones of a zo 38 write accesses to the sequential zones of a zoned block device. 39 Conventional zones are used for caching as wel 39 Conventional zones are used for caching as well as for storing internal 40 metadata. It can also use a regular block devi 40 metadata. It can also use a regular block device together with the zoned 41 block device; in that case the regular block d 41 block device; in that case the regular block device will be split logically 42 in zones with the same size as the zoned block 42 in zones with the same size as the zoned block device. These zones will be 43 placed in front of the zones from the zoned bl 43 placed in front of the zones from the zoned block device and will be handled 44 just like conventional zones. 44 just like conventional zones. 45 45 46 The zones of the device(s) are separated into 46 The zones of the device(s) are separated into 2 types: 47 47 48 1) Metadata zones: these are conventional zone 48 1) Metadata zones: these are conventional zones used to store metadata. 49 Metadata zones are not reported as usable capa !! 49 Metadata zones are not reported as useable capacity to the user. 50 50 51 2) Data zones: all remaining zones, the vast m 51 2) Data zones: all remaining zones, the vast majority of which will be 52 sequential zones used exclusively to store use 52 sequential zones used exclusively to store user data. The conventional 53 zones of the device may be used also for buffe 53 zones of the device may be used also for buffering user random writes. 54 Data in these zones may be directly mapped to 54 Data in these zones may be directly mapped to the conventional zone, but 55 later moved to a sequential zone so that the c 55 later moved to a sequential zone so that the conventional zone can be 56 reused for buffering incoming random writes. 56 reused for buffering incoming random writes. 57 57 58 dm-zoned exposes a logical device with a secto 58 dm-zoned exposes a logical device with a sector size of 4096 bytes, 59 irrespective of the physical sector size of th 59 irrespective of the physical sector size of the backend zoned block 60 device being used. This allows reducing the am 60 device being used. This allows reducing the amount of metadata needed to 61 manage valid blocks (blocks written). 61 manage valid blocks (blocks written). 62 62 63 The on-disk metadata format is as follows: 63 The on-disk metadata format is as follows: 64 64 65 1) The first block of the first conventional z 65 1) The first block of the first conventional zone found contains the 66 super block which describes the on disk amount 66 super block which describes the on disk amount and position of metadata 67 blocks. 67 blocks. 68 68 69 2) Following the super block, a set of blocks 69 2) Following the super block, a set of blocks is used to describe the 70 mapping of the logical device blocks. The mapp 70 mapping of the logical device blocks. The mapping is done per chunk of 71 blocks, with the chunk size equal to the zoned 71 blocks, with the chunk size equal to the zoned block device size. The 72 mapping table is indexed by chunk number and e 72 mapping table is indexed by chunk number and each mapping entry 73 indicates the zone number of the device storin 73 indicates the zone number of the device storing the chunk of data. Each 74 mapping entry may also indicate if the zone nu 74 mapping entry may also indicate if the zone number of a conventional 75 zone used to buffer random modification to the 75 zone used to buffer random modification to the data zone. 76 76 77 3) A set of blocks used to store bitmaps indic 77 3) A set of blocks used to store bitmaps indicating the validity of 78 blocks in the data zones follows the mapping t 78 blocks in the data zones follows the mapping table. A valid block is 79 defined as a block that was written and not di 79 defined as a block that was written and not discarded. For a buffered 80 data chunk, a block is always valid only in th 80 data chunk, a block is always valid only in the data zone mapping the 81 chunk or in the buffer zone of the chunk. 81 chunk or in the buffer zone of the chunk. 82 82 83 For a logical chunk mapped to a conventional z 83 For a logical chunk mapped to a conventional zone, all write operations 84 are processed by directly writing to the zone. 84 are processed by directly writing to the zone. If the mapping zone is a 85 sequential zone, the write operation is proces 85 sequential zone, the write operation is processed directly only if the 86 write offset within the logical chunk is equal 86 write offset within the logical chunk is equal to the write pointer 87 offset within of the sequential data zone (i.e 87 offset within of the sequential data zone (i.e. the write operation is 88 aligned on the zone write pointer). Otherwise, 88 aligned on the zone write pointer). Otherwise, write operations are 89 processed indirectly using a buffer zone. In t 89 processed indirectly using a buffer zone. In that case, an unused 90 conventional zone is allocated and assigned to 90 conventional zone is allocated and assigned to the chunk being 91 accessed. Writing a block to the buffer zone o 91 accessed. Writing a block to the buffer zone of a chunk will 92 automatically invalidate the same block in the 92 automatically invalidate the same block in the sequential zone mapping 93 the chunk. If all blocks of the sequential zon 93 the chunk. If all blocks of the sequential zone become invalid, the zone 94 is freed and the chunk buffer zone becomes the 94 is freed and the chunk buffer zone becomes the primary zone mapping the 95 chunk, resulting in native random write perfor 95 chunk, resulting in native random write performance similar to a regular 96 block device. 96 block device. 97 97 98 Read operations are processed according to the 98 Read operations are processed according to the block validity 99 information provided by the bitmaps. Valid blo 99 information provided by the bitmaps. Valid blocks are read either from 100 the sequential zone mapping a chunk, or if the 100 the sequential zone mapping a chunk, or if the chunk is buffered, from 101 the buffer zone assigned. If the accessed chun 101 the buffer zone assigned. If the accessed chunk has no mapping, or the 102 accessed blocks are invalid, the read buffer i 102 accessed blocks are invalid, the read buffer is zeroed and the read 103 operation terminated. 103 operation terminated. 104 104 105 After some time, the limited number of convent 105 After some time, the limited number of conventional zones available may 106 be exhausted (all used to map chunks or buffer 106 be exhausted (all used to map chunks or buffer sequential zones) and 107 unaligned writes to unbuffered chunks become i 107 unaligned writes to unbuffered chunks become impossible. To avoid this 108 situation, a reclaim process regularly scans u 108 situation, a reclaim process regularly scans used conventional zones and 109 tries to reclaim the least recently used zones 109 tries to reclaim the least recently used zones by copying the valid 110 blocks of the buffer zone to a free sequential 110 blocks of the buffer zone to a free sequential zone. Once the copy 111 completes, the chunk mapping is updated to poi 111 completes, the chunk mapping is updated to point to the sequential zone 112 and the buffer zone freed for reuse. 112 and the buffer zone freed for reuse. 113 113 114 Metadata Protection 114 Metadata Protection 115 =================== 115 =================== 116 116 117 To protect metadata against corruption in case 117 To protect metadata against corruption in case of sudden power loss or 118 system crash, 2 sets of metadata zones are use 118 system crash, 2 sets of metadata zones are used. One set, the primary 119 set, is used as the main metadata region, whil 119 set, is used as the main metadata region, while the secondary set is 120 used as a staging area. Modified metadata is f 120 used as a staging area. Modified metadata is first written to the 121 secondary set and validated by updating the su 121 secondary set and validated by updating the super block in the secondary 122 set, a generation counter is used to indicate 122 set, a generation counter is used to indicate that this set contains the 123 newest metadata. Once this operation completes 123 newest metadata. Once this operation completes, in place of metadata 124 block updates can be done in the primary metad 124 block updates can be done in the primary metadata set. This ensures that 125 one of the set is always consistent (all modif 125 one of the set is always consistent (all modifications committed or none 126 at all). Flush operations are used as a commit 126 at all). Flush operations are used as a commit point. Upon reception of 127 a flush request, metadata modification activit 127 a flush request, metadata modification activity is temporarily blocked 128 (for both incoming BIO processing and reclaim 128 (for both incoming BIO processing and reclaim process) and all dirty 129 metadata blocks are staged and updated. Normal 129 metadata blocks are staged and updated. Normal operation is then 130 resumed. Flushing metadata thus only temporari 130 resumed. Flushing metadata thus only temporarily delays write and 131 discard requests. Read requests can be process 131 discard requests. Read requests can be processed concurrently while 132 metadata flush is being executed. 132 metadata flush is being executed. 133 133 134 If a regular device is used in conjunction wit 134 If a regular device is used in conjunction with the zoned block device, 135 a third set of metadata (without the zone bitm 135 a third set of metadata (without the zone bitmaps) is written to the 136 start of the zoned block device. This metadata 136 start of the zoned block device. This metadata has a generation counter of 137 '0' and will never be updated during normal op 137 '0' and will never be updated during normal operation; it just serves for 138 identification purposes. The first and second 138 identification purposes. The first and second copy of the metadata 139 are located at the start of the regular block 139 are located at the start of the regular block device. 140 140 141 Usage 141 Usage 142 ===== 142 ===== 143 143 144 A zoned block device must first be formatted u 144 A zoned block device must first be formatted using the dmzadm tool. This 145 will analyze the device zone configuration, de 145 will analyze the device zone configuration, determine where to place the 146 metadata sets on the device and initialize the 146 metadata sets on the device and initialize the metadata sets. 147 147 148 Ex:: 148 Ex:: 149 149 150 dmzadm --format /dev/sdxx 150 dmzadm --format /dev/sdxx 151 151 152 152 153 If two drives are to be used, both devices mus 153 If two drives are to be used, both devices must be specified, with the 154 regular block device as the first device. 154 regular block device as the first device. 155 155 156 Ex:: 156 Ex:: 157 157 158 dmzadm --format /dev/sdxx /dev/sdyy 158 dmzadm --format /dev/sdxx /dev/sdyy 159 159 160 160 161 Formatted device(s) can be started with the dm 161 Formatted device(s) can be started with the dmzadm utility, too.: 162 162 163 Ex:: 163 Ex:: 164 164 165 dmzadm --start /dev/sdxx /dev/sdyy 165 dmzadm --start /dev/sdxx /dev/sdyy 166 166 167 167 168 Information about the internal layout and curr 168 Information about the internal layout and current usage of the zones can 169 be obtained with the 'status' callback from dm 169 be obtained with the 'status' callback from dmsetup: 170 170 171 Ex:: 171 Ex:: 172 172 173 dmsetup status /dev/dm-X 173 dmsetup status /dev/dm-X 174 174 175 will return a line 175 will return a line 176 176 177 0 <size> zoned <nr_zones> zones <nr_un 177 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential 178 178 179 where <nr_zones> is the total number of zones, 179 where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the number 180 of unmapped (ie free) random zones, <nr_rnd> t 180 of unmapped (ie free) random zones, <nr_rnd> the total number of zones, 181 <nr_unmap_seq> the number of unmapped sequenti 181 <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the 182 total number of sequential zones. 182 total number of sequential zones. 183 183 184 Normally the reclaim process will be started o 184 Normally the reclaim process will be started once there are less than 50 185 percent free random zones. In order to start t 185 percent free random zones. In order to start the reclaim process manually 186 even before reaching this threshold the 'dmset 186 even before reaching this threshold the 'dmsetup message' function can be 187 used: 187 used: 188 188 189 Ex:: 189 Ex:: 190 190 191 dmsetup message /dev/dm-X 0 reclaim 191 dmsetup message /dev/dm-X 0 reclaim 192 192 193 will start the reclaim process and random zone 193 will start the reclaim process and random zones will be moved to sequential 194 zones. 194 zones.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.