1 .. SPDX-License-Identifier: GPL-2.0 2 .. _iomap_design: 3 4 .. 5 Dumb style notes to maintain the autho 6 Please try to start sentences on separ 7 sentence changes don't bleed colors in 8 Heading decorations are documented in 9 10 ============== 11 Library Design 12 ============== 13 14 .. contents:: Table of Contents 15 :local: 16 17 Introduction 18 ============ 19 20 iomap is a filesystem library for handling com 21 The library has two layers: 22 23 1. A lower layer that provides an iterator ov 24 This layer tries to obtain mappings of eac 25 from the filesystem, but the storage infor 26 required. 27 28 2. An upper layer that acts upon the space ma 29 lower layer iterator. 30 31 The iteration can involve mappings of file's l 32 physical extents, but the storage layer inform 33 required, e.g. for walking cached file informa 34 The library exports various APIs for implement 35 as: 36 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 45 46 This origins of this library is the file I/O p 47 has now been extended to cover several other o 48 49 Who Should Read This? 50 ===================== 51 52 The target audience for this document are file 53 pagecache programmers and code reviewers. 54 55 If you are working on PCI, machine architectur 56 are most likely in the wrong place. 57 58 How Is This Better? 59 =================== 60 61 Unlike the classic Linux I/O model which break 62 units (generally memory pages or blocks) and l 63 the basis of that unit, the iomap model asks t 64 largest space mappings that it can create for 65 initiates operations on that basis. 66 This strategy improves the filesystem's visibi 67 operation being performed, which enables it to 68 larger space allocations when possible. 69 Larger space mappings improve runtime performa 70 of mapping function calls into the filesystem 71 data. 72 73 At a high level, an iomap operation `looks lik 74 <https://lore.kernel.org/all/ZGbVaewzcCysclPt@d 75 76 1. For each byte in the operation range... 77 78 1. Obtain a space mapping via ``->iomap_beg 79 80 2. For each sub-unit of work... 81 82 1. Revalidate the mapping and go back to 83 So far only the pagecache operations 84 85 2. Do the work 86 87 3. Increment operation cursor 88 89 4. Release the mapping via ``->iomap_end``, 90 91 Each iomap operation will be covered in more d 92 This library was covered previously by an `LWN 93 <https://lwn.net/Articles/935934/>`_ and a `Ke 94 <https://kernelnewbies.org/KernelProjects/ioma 95 96 The goal of this document is to provide a brie 97 design and capabilities of iomap, followed by 98 of the interfaces presented by iomap. 99 If you change iomap, please update this design 100 101 File Range Iterator 102 =================== 103 104 Definitions 105 ----------- 106 107 * **buffer head**: Shattered remnants of the 108 109 * ``fsblock``: The block size of a file, also 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsem 112 Processes hold this in shared mode to read 113 Some filesystems may allow shared mode for 114 Processes often hold this in exclusive mode 115 contents. 116 117 * ``invalidate_lock``: The pagecache ``struct 118 rwsemaphore that protects against folio ins 119 filesystems that support punching out folio 120 Processes wishing to insert folios must hol 121 mode to prevent removal, though concurrent 122 Processes wishing to remove folios must hol 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 125 126 * ``dax_read_lock``: The RCU read lock that d 127 device pre-shutdown hook from returning bef 128 released resources. 129 130 * **filesystem mapping lock**: This synchroni 131 internal to the filesystem and must protect 132 from updates while a mapping is being sampl 133 The filesystem author must determine how th 134 happen; it does not need to be an actual lo 135 136 * **iomap internal operation lock**: This is 137 synchronization primitives that iomap funct 138 mapping. 139 A specific example would be taking the foli 140 writing the pagecache. 141 142 * **pure overwrite**: A write operation that 143 metadata or zeroing operations to perform d 144 or completion. 145 This implies that the filesystem must have 146 on disk as ``IOMAP_MAPPED`` and the filesys 147 constraints on IO alignment or size. 148 The only constraints on I/O alignment are d 149 size and alignment, typically sector size). 150 151 ``struct iomap`` 152 ---------------- 153 154 The filesystem communicates to the iomap itera 155 byte ranges of a file to byte ranges of a stor 156 structure below: 157 158 .. code-block:: c 159 160 struct iomap { 161 u64 addr; 162 loff_t offset; 163 u64 length; 164 u16 type; 165 u16 flags; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 168 void *inline_data; 169 void *private; 170 const struct iomap_folio_ops *folio_ops; 171 u64 validity_cookie; 172 }; 173 174 The fields are as follows: 175 176 * ``offset`` and ``length`` describe the rang 177 bytes, covered by this mapping. 178 These fields must always be set by the file 179 180 * ``type`` describes the type of the space ma 181 182 * **IOMAP_HOLE**: No storage has been alloc 183 This type must never be returned in respo 184 operation because writes must allocate an 185 the mapping. 186 The ``addr`` field must be set to ``IOMAP 187 iomap does not support writing (whether v 188 I/O) to a hole. 189 190 * **IOMAP_DELALLOC**: A promise to allocate 191 ("delayed allocation"). 192 If the filesystem returns IOMAP_F_NEW her 193 ``->iomap_end`` function must delete the 194 The ``addr`` field must be set to ``IOMAP 195 196 * **IOMAP_MAPPED**: The file range maps to 197 storage device. 198 The device is returned in ``bdev`` or ``d 199 The device address, in bytes, is returned 200 201 * **IOMAP_UNWRITTEN**: The file range maps 202 storage device, but the space has not yet 203 The device is returned in ``bdev`` or ``d 204 The device address, in bytes, is returned 205 Reads from this type of mapping will retu 206 For a write or writeback operation, the i 207 mapping to MAPPED. 208 Refer to the sections about ioends for mo 209 210 * **IOMAP_INLINE**: The file range maps to 211 specified by ``inline_data``. 212 For write operation, the ``->iomap_end`` 213 handles persisting the data. 214 The ``addr`` field must be set to ``IOMAP 215 216 * ``flags`` describe the status of the space 217 These flags should be set by the filesystem 218 219 * **IOMAP_F_NEW**: The space under the mapp 220 Areas that will not be written to must be 221 If a write fails and the mapping is a spa 222 reservation must be deleted. 223 224 * **IOMAP_F_DIRTY**: The inode will have un 225 to access any data written. 226 fdatasync is required to commit these cha 227 storage. 228 This needs to take into account metadata 229 at I/O completion, such as file size upda 230 231 * **IOMAP_F_SHARED**: The space under the m 232 Copy on write is necessary to avoid corru 233 234 * **IOMAP_F_BUFFER_HEAD**: This mapping req 235 heads for pagecache operations. 236 Do not add more uses of this. 237 238 * **IOMAP_F_MERGED**: Multiple contiguous b 239 coalesced into this single mapping. 240 This is only useful for FIEMAP. 241 242 * **IOMAP_F_XATTR**: The mapping is for ext 243 regular file data. 244 This is only useful for FIEMAP. 245 246 * **IOMAP_F_PRIVATE**: Starting with this v 247 be set by the filesystem for its own purp 248 249 These flags can be set by iomap itself duri 250 The filesystem should supply an ``->iomap_e 251 to observe these flags: 252 253 * **IOMAP_F_SIZE_CHANGED**: The file size h 254 using this mapping. 255 256 * **IOMAP_F_STALE**: The mapping was found 257 iomap will call ``->iomap_end`` on this m 258 ``->iomap_begin`` to obtain a new mapping 259 260 Currently, these flags are only set by page 261 262 * ``addr`` describes the device address, in b 263 264 * ``bdev`` describes the block device for thi 265 This only needs to be set for mapped or unw 266 267 * ``dax_dev`` describes the DAX device for th 268 This only needs to be set for mapped or unw 269 only for a fsdax operation. 270 271 * ``inline_data`` points to a memory buffer f 272 ``IOMAP_INLINE`` mappings. 273 This value is ignored for all other mapping 274 275 * ``private`` is a pointer to `filesystem-pri 276 <https://lore.kernel.org/all/20180619164137. 277 This value will be passed unchanged to ``-> 278 279 * ``folio_ops`` will be covered in the sectio 280 281 * ``validity_cookie`` is a magic freshness va 282 that should be used to detect stale mapping 283 For pagecache operations this is critical f 284 because page faults can occur, which implie 285 should not be held between ``->iomap_begin` 286 Filesystems with completely static mappings 287 Only pagecache operations revalidate mappin 288 ``iomap_valid`` for details. 289 290 ``struct iomap_ops`` 291 -------------------- 292 293 Every iomap function requires the filesystem t 294 structure to obtain a mapping and (optionally) 295 296 .. code-block:: c 297 298 struct iomap_ops { 299 int (*iomap_begin)(struct inode *inode, l 300 unsigned flags, struct 301 struct iomap *srcmap); 302 303 int (*iomap_end)(struct inode *inode, lof 304 ssize_t written, unsigne 305 struct iomap *iomap); 306 }; 307 308 ``->iomap_begin`` 309 ~~~~~~~~~~~~~~~~~ 310 311 iomap operations call ``->iomap_begin`` to obt 312 the range of bytes specified by ``pos`` and `` 313 ``inode``. 314 This mapping should be returned through the `` 315 The mapping must cover at least the first byte 316 range, but it does not need to cover the entir 317 318 Each iomap operation describes the requested o 319 ``flags`` argument. 320 The exact value of ``flags`` will be documente 321 operation-specific sections below. 322 These flags can, at least in principle, apply 323 operations: 324 325 * ``IOMAP_DIRECT`` is set when the caller wis 326 block storage. 327 328 * ``IOMAP_DAX`` is set when the caller wishes 329 memory-like storage. 330 331 * ``IOMAP_NOWAIT`` is set when the caller wis 332 effort attempt to avoid any operation that 333 the submitting task. 334 This is similar in intent to ``O_NONBLOCK`` 335 intended for asynchronous applications to k 336 instead of waiting for the specific unavail 337 to become available. 338 Filesystems implementing ``IOMAP_NOWAIT`` s 339 trylock algorithms. 340 They need to be able to satisfy the entire 341 single iomap mapping. 342 They need to avoid reading or writing metad 343 They need to avoid blocking memory allocati 344 They need to avoid waiting on transaction r 345 modifications to take place. 346 They probably should not be allocating new 347 And so on. 348 If there is any doubt in the filesystem dev 349 whether any specific ``IOMAP_NOWAIT`` opera 350 then they should return ``-EAGAIN`` as earl 351 start the operation and force the submittin 352 ``IOMAP_NOWAIT`` is often set on behalf of 353 ``RWF_NOWAIT``. 354 355 If it is necessary to read existing file conte 356 <https://lore.kernel.org/all/20191008071527.293 357 device or address range on a device, the files 358 information via ``srcmap``. 359 Only pagecache and fsdax operations support re 360 writing to another. 361 362 ``->iomap_end`` 363 ~~~~~~~~~~~~~~~ 364 365 After the operation completes, the ``->iomap_e 366 is called to signal that iomap is finished wit 367 Typically, implementations will use this funct 368 context that were set up in ``->iomap_begin``. 369 For example, a write might wish to commit the 370 that were operated upon and unreserve any spac 371 upon. 372 ``written`` might be zero if no bytes were tou 373 ``flags`` will contain the same value passed t 374 iomap ops for reads are not likely to need to 375 376 Both functions should return a negative errno 377 success. 378 379 Preparing for File Operations 380 ============================= 381 382 iomap only handles mapping and I/O. 383 Filesystems must still call out to the VFS to 384 and file state before initiating an I/O operat 385 It does not handle obtaining filesystem freeze 386 timestamps, stripping privileges, or access co 387 388 Locking Hierarchy 389 ================= 390 391 iomap requires that filesystems supply their o 392 There are three categories of synchronization 393 iomap is concerned: 394 395 * The **upper** level primitive is provided b 396 coordinate access to different iomap operat 397 The exact primitive is specific to the file 398 but is often a VFS inode, pagecache invalid 399 For example, a filesystem might take ``i_rw 400 ``iomap_file_buffered_write`` and ``iomap_f 401 these two file operations from clobbering e 402 Pagecache writeback may lock a folio to pre 403 accessing the folio until writeback is unde 404 405 * The **lower** level primitive is taken by 406 ``->iomap_begin`` and ``->iomap_end`` fun 407 access to the file space mapping informat 408 The fields of the iomap object should be 409 this primitive. 410 The upper level synchronization primitive 411 while acquiring the lower level synchroni 412 For example, XFS takes ``ILOCK_EXCL`` and 413 while sampling mappings. 414 Filesystems with immutable mapping inform 415 synchronization here. 416 417 * The **operation** primitive is taken by a 418 coordinate access to its own internal dat 419 The upper level synchronization primitive 420 while acquiring this primitive. 421 The lower level primitive is not held whi 422 primitive. 423 For example, pagecache write operations w 424 then grab and lock a folio to copy new co 425 It may also lock an internal folio state 426 427 The exact locking requirements are specific to 428 certain operations, some of these locks can be 429 All further mentions of locking are *recommend 430 Each filesystem author must figure out the loc 431 432 Bugs and Limitations 433 ==================== 434 435 * No support for fscrypt. 436 * No support for compression. 437 * No support for fsverity yet. 438 * Strong assumptions that IO should work the 439 * Does iomap *actually* work for non-regular 440 441 Patches welcome!
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.