1 ================= 2 Thin provisioning 3 ================= 4 5 Introduction 6 ============ 7 8 This document describes a collection of device 9 between them implement thin-provisioning and s 10 11 The main highlight of this implementation, com 12 implementation of snapshots, is that it allows 13 be stored on the same data volume. This simpl 14 allows the sharing of data between volumes, th 15 16 Another significant feature is support for an 17 recursive snapshots (snapshots of snapshots of 18 previous implementation of snapshots did this 19 lookup tables, and so performance was O(depth) 20 implementation uses a single data structure to 21 with depth. Fragmentation may still be an iss 22 scenarios. 23 24 Metadata is stored on a separate device from d 25 administrator some freedom, for example to: 26 27 - Improve metadata resilience by storing metad 28 but data on a non-mirrored one. 29 30 - Improve performance by storing the metadata 31 32 Status 33 ====== 34 35 These targets are considered safe for producti 36 cases will have different performance characte 37 to fragmentation of the data volume. 38 39 If you find this software is not performing as 40 dm-devel@redhat.com with details and we'll try 41 things for you. 42 43 Userspace tools for checking and repairing the 44 developed and are available as 'thin_check' an 45 of the package that provides these utilities v 46 a Red Hat distribution it is named 'device-map 47 48 Cookbook 49 ======== 50 51 This section describes some quick recipes for 52 They use the dmsetup program to control the de 53 directly. End users will be advised to use a 54 manager such as LVM2 once support has been add 55 56 Pool device 57 ----------- 58 59 The pool device ties together the metadata vol 60 It maps I/O linearly to the data volume and up 61 two mechanisms: 62 63 - Function calls from the thin targets 64 65 - Device-mapper 'messages' from userspace whic 66 virtual devices amongst other things. 67 68 Setting up a fresh pool device 69 ------------------------------ 70 71 Setting up a pool device requires a valid meta 72 data device. If you do not have an existing m 73 make one by zeroing the first 4k to indicate e 74 75 dd if=/dev/zero of=$metadata_dev bs=4096 c 76 77 The amount of metadata you need will vary acco 78 are shared between thin devices (i.e. through 79 less sharing than average you'll need a larger 80 81 As a guide, we suggest you calculate the numbe 82 metadata device as 48 * $data_dev_size / $data 83 to 2MB if the answer is smaller. If you're cr 84 snapshots which are recording large amounts of 85 need to increase this. 86 87 The largest size supported is 16GB: If the dev 88 a warning will be issued and the excess space 89 90 Reloading a pool table 91 ---------------------- 92 93 You may reload a pool's table, indeed this is 94 if it runs out of space. (N.B. While specifyi 95 device when reloading is not forbidden at the 96 wrong if it does not route I/O to exactly the 97 previously.) 98 99 Using an existing pool device 100 ----------------------------- 101 102 :: 103 104 dmsetup create pool \ 105 --table "0 20971520 thin-pool $metadat 106 $data_block_size $low_water_m 107 108 $data_block_size gives the smallest unit of di 109 allocated at a time expressed in units of 512- 110 $data_block_size must be between 128 (64KB) an 111 multiple of 128 (64KB). $data_block_size cann 112 thin-pool is created. People primarily intere 113 may want to use a value such as 1024 (512KB). 114 snapshotting may want a smaller value such as 115 not zeroing newly-allocated data, a larger $da 116 region of 256000 (128MB) is suggested. 117 118 $low_water_mark is expressed in blocks of size 119 free space on the data device drops below this 120 will be triggered which a userspace daemon sho 121 extend the pool device. Only one such event w 122 123 No special event is triggered if a just resume 124 the low water mark. However, resuming a device 125 event; a userspace daemon should verify that f 126 water mark when handling this event. 127 128 A low water mark for the metadata device is ma 129 will trigger a dm event if free space on the m 130 it. 131 132 Updating on-disk metadata 133 ------------------------- 134 135 On-disk metadata is committed every time a FLU 136 If no such requests are made then commits will 137 means the thin-provisioning target behaves lik 138 a volatile write cache. If power is lost you 139 writes. The metadata should always be consist 140 141 If data space is exhausted the pool will eithe 142 according to the configuration (see: error_if_ 143 space is exhausted or a metadata operation fai 144 until the pool is taken offline and repair is 145 potential inconsistencies and 2) clear the fla 146 Once the pool's metadata device is repaired it 147 will allow the pool to return to normal operat 148 is flagged as needing repair, the pool's data 149 cannot be resized until repair is performed. 150 that when the pool's metadata space is exhaust 151 transaction is aborted. Given that the pool w 152 completion may have already been acknowledged 153 (e.g. filesystem) it is strongly suggested tha 154 (e.g. fsck) be performed on those layers when 155 required. 156 157 Thin provisioning 158 ----------------- 159 160 i) Creating a new thinly-provisioned volume. 161 162 To create a new thinly- provisioned volume y 163 active pool device, /dev/mapper/pool in this 164 165 dmsetup message /dev/mapper/pool 0 "create 166 167 Here '0' is an identifier for the volume, a 168 to the caller to allocate and manage these i 169 identifier is already in use, the message wi 170 171 ii) Using a thinly-provisioned volume. 172 173 Thinly-provisioned volumes are activated usi 174 175 dmsetup create thin --table "0 2097152 thi 176 177 The last parameter is the identifier for the 178 179 Internal snapshots 180 ------------------ 181 182 i) Creating an internal snapshot. 183 184 Snapshots are created with another message t 185 186 N.B. If the origin device that you wish to 187 must suspend it before creating the snapshot 188 This is NOT enforced at the moment, so pleas 189 190 :: 191 192 dmsetup suspend /dev/mapper/thin 193 dmsetup message /dev/mapper/pool 0 "create 194 dmsetup resume /dev/mapper/thin 195 196 Here '1' is the identifier for the volume, a 197 identifier for the origin device. 198 199 ii) Using an internal snapshot. 200 201 Once created, the user doesn't have to worry 202 between the origin and the snapshot. Indeed 203 different from any other thinly-provisioned 204 snapshotted itself via the same method. It' 205 have only one of them active, and there's no 206 activating or removing them both. (This dif 207 device-mapper snapshots.) 208 209 Activate it exactly the same way as any othe 210 211 dmsetup create snap --table "0 2097152 thi 212 213 External snapshots 214 ------------------ 215 216 You can use an external **read only** device a 217 thinly-provisioned volume. Any read to an unp 218 thin device will be passed through to the orig 219 the allocation of new blocks as usual. 220 221 One use case for this is VM hosts that want to 222 thinly-provisioned volumes but have the base i 223 (possibly shared between many VMs). 224 225 You must not write to the origin device if you 226 Of course, you may write to the thin device an 227 of the thin volume. 228 229 i) Creating a snapshot of an external device 230 231 This is the same as creating a thin device. 232 You don't mention the origin at this stage. 233 234 :: 235 236 dmsetup message /dev/mapper/pool 0 "create 237 238 ii) Using a snapshot of an external device. 239 240 Append an extra parameter to the thin target 241 242 dmsetup create snap --table "0 2097152 thi 243 244 N.B. All descendants (internal snapshots) of 245 same extra origin parameter. 246 247 Deactivation 248 ------------ 249 250 All devices using a pool must be deactivated b 251 can be. 252 253 :: 254 255 dmsetup remove thin 256 dmsetup remove snap 257 dmsetup remove pool 258 259 Reference 260 ========= 261 262 'thin-pool' target 263 ------------------ 264 265 i) Constructor 266 267 :: 268 269 thin-pool <metadata dev> <data dev> <dat 270 <low water mark (blocks)> [<nu 271 272 Optional feature arguments: 273 274 skip_block_zeroing: 275 Skip the zeroing of newly-provisioned 276 277 ignore_discard: 278 Disable discard support. 279 280 no_discard_passdown: 281 Don't pass discards down to the underl 282 data device, but just remove the mappi 283 284 read_only: 285 Don't allow any changes to be 286 metadata. This mode is only 287 thin-pool has been created an 288 read/write mode. It cannot b 289 thin-pool creation. 290 291 error_if_no_space: 292 Error IOs, instead of queueing, if no 293 294 Data block size must be between 64KB (128 295 (2097152 sectors) inclusive. 296 297 298 ii) Status 299 300 :: 301 302 <transaction id> <used metadata blocks>/ 303 <used data blocks>/<total data blocks> < 304 ro|rw|out_of_data_space [no_]discard_pas 305 needs_check|- metadata_low_watermark 306 307 transaction id: 308 A 64-bit number used by userspace to h 309 from volume managers. 310 311 used data blocks / total data blocks 312 If the number of free blocks drops bel 313 dm event will be sent to userspace. T 314 it will occur only once after each res 315 should register for the event and then 316 317 held metadata root: 318 The location, in blocks, of the metada 319 'held' for userspace read access. '-' 320 held root. 321 322 discard_passdown|no_discard_passdown 323 Whether or not discards are actually b 324 underlying device. When this is enabl 325 it can get disabled if the underlying 326 327 ro|rw|out_of_data_space 328 If the pool encounters certain types o 329 drop into a read-only metadata mode in 330 the pool metadata (like allocating new 331 332 In serious cases where even a read-onl 333 no further I/O will be permitted and t 334 contain the string 'Fail'. The usersp 335 should then be used. 336 337 error_if_no_space|queue_if_no_space 338 If the pool runs out of data or metada 339 either queue or error the IO destined 340 default is to queue the IO until more 341 'no_space_timeout' expires. The 'no_s 342 module parameter can be used to change 343 defaults to 60 seconds but may be disa 344 345 needs_check 346 A metadata operation has failed, resul 347 flag being set in the metadata's super 348 device must be deactivated and checked 349 thin-pool can be made fully operationa 350 needs_check is not set. 351 352 metadata_low_watermark: 353 Value of metadata low watermark in blo 354 value internally but userspace needs t 355 determine if an event was caused by cr 356 357 iii) Messages 358 359 create_thin <dev id> 360 Create a new thinly-provisioned device 361 <dev id> is an arbitrary unique 24-bit 362 the caller. 363 364 create_snap <dev id> <origin id> 365 Create a new snapshot of another thinl 366 <dev id> is an arbitrary unique 24-bit 367 the caller. 368 <origin id> is the identifier of the t 369 of which the new device will be a snap 370 371 delete <dev id> 372 Deletes a thin device. Irreversible. 373 374 set_transaction_id <current id> <new id> 375 Userland volume managers, such as LVM, 376 synchronise their external metadata wi 377 pool target. The thin-pool target off 378 arbitrary 64-bit transaction id and re 379 status line. To avoid races you must 380 the current transaction id is when you 381 compare-and-swap message. 382 383 reserve_metadata_snap 384 Reserve a copy of the data mapping btr 385 This allows userland to inspect the ma 386 this message was executed. Use the po 387 get the root block associated with the 388 389 release_metadata_snap 390 Release a previously reserved copy of 391 392 'thin' target 393 ------------- 394 395 i) Constructor 396 397 :: 398 399 thin <pool dev> <dev id> [<external or 400 401 pool dev: 402 the thin-pool device, e.g. /dev/mapper 403 404 dev id: 405 the internal device identifier of the 406 activated. 407 408 external origin dev: 409 an optional block device outside the p 410 read-only snapshot origin: reads to un 411 thin target will be mapped to this dev 412 413 The pool doesn't store any size against the th 414 load a thin target that is smaller than you've 415 then you'll have no access to blocks mapped be 416 load a target that is bigger than before, then 417 provisioned as and when needed. 418 419 ii) Status 420 421 <nr mapped sectors> <highest mapped sector 422 If the pool has encountered device err 423 will just contain the string 'Fail'. 424 tools should then be used. 425 426 In the case where <nr mapped sectors> is 0 427 mapped sector and the value of <highest ma
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.