1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 .. _iomap_design: 2 .. _iomap_design: 3 3 4 .. 4 .. 5 Dumb style notes to maintain the autho 5 Dumb style notes to maintain the author's sanity: 6 Please try to start sentences on separ 6 Please try to start sentences on separate lines so that 7 sentence changes don't bleed colors in 7 sentence changes don't bleed colors in diff. 8 Heading decorations are documented in 8 Heading decorations are documented in sphinx.rst. 9 9 10 ============== 10 ============== 11 Library Design 11 Library Design 12 ============== 12 ============== 13 13 14 .. contents:: Table of Contents 14 .. contents:: Table of Contents 15 :local: 15 :local: 16 16 17 Introduction 17 Introduction 18 ============ 18 ============ 19 19 20 iomap is a filesystem library for handling com 20 iomap is a filesystem library for handling common file operations. 21 The library has two layers: 21 The library has two layers: 22 22 23 1. A lower layer that provides an iterator ov 23 1. A lower layer that provides an iterator over ranges of file offsets. 24 This layer tries to obtain mappings of eac 24 This layer tries to obtain mappings of each file ranges to storage 25 from the filesystem, but the storage infor 25 from the filesystem, but the storage information is not necessarily 26 required. 26 required. 27 27 28 2. An upper layer that acts upon the space ma 28 2. An upper layer that acts upon the space mappings provided by the 29 lower layer iterator. 29 lower layer iterator. 30 30 31 The iteration can involve mappings of file's l 31 The iteration can involve mappings of file's logical offset ranges to 32 physical extents, but the storage layer inform 32 physical extents, but the storage layer information is not necessarily 33 required, e.g. for walking cached file informa 33 required, e.g. for walking cached file information. 34 The library exports various APIs for implement 34 The library exports various APIs for implementing file operations such 35 as: 35 as: 36 36 37 * Pagecache reads and writes 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 44 * swapfile activation 45 45 46 This origins of this library is the file I/O p 46 This origins of this library is the file I/O path that XFS once used; it 47 has now been extended to cover several other o 47 has now been extended to cover several other operations. 48 48 49 Who Should Read This? 49 Who Should Read This? 50 ===================== 50 ===================== 51 51 52 The target audience for this document are file 52 The target audience for this document are filesystem, storage, and 53 pagecache programmers and code reviewers. 53 pagecache programmers and code reviewers. 54 54 55 If you are working on PCI, machine architectur 55 If you are working on PCI, machine architectures, or device drivers, you 56 are most likely in the wrong place. 56 are most likely in the wrong place. 57 57 58 How Is This Better? 58 How Is This Better? 59 =================== 59 =================== 60 60 61 Unlike the classic Linux I/O model which break 61 Unlike the classic Linux I/O model which breaks file I/O into small 62 units (generally memory pages or blocks) and l 62 units (generally memory pages or blocks) and looks up space mappings on 63 the basis of that unit, the iomap model asks t 63 the basis of that unit, the iomap model asks the filesystem for the 64 largest space mappings that it can create for 64 largest space mappings that it can create for a given file operation and 65 initiates operations on that basis. 65 initiates operations on that basis. 66 This strategy improves the filesystem's visibi 66 This strategy improves the filesystem's visibility into the size of the 67 operation being performed, which enables it to 67 operation being performed, which enables it to combat fragmentation with 68 larger space allocations when possible. 68 larger space allocations when possible. 69 Larger space mappings improve runtime performa 69 Larger space mappings improve runtime performance by amortizing the cost 70 of mapping function calls into the filesystem 70 of mapping function calls into the filesystem across a larger amount of 71 data. 71 data. 72 72 73 At a high level, an iomap operation `looks lik 73 At a high level, an iomap operation `looks like this 74 <https://lore.kernel.org/all/ZGbVaewzcCysclPt@d 74 <https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/">https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 75 76 1. For each byte in the operation range... 76 1. For each byte in the operation range... 77 77 78 1. Obtain a space mapping via ``->iomap_beg 78 1. Obtain a space mapping via ``->iomap_begin`` 79 79 80 2. For each sub-unit of work... 80 2. For each sub-unit of work... 81 81 82 1. Revalidate the mapping and go back to 82 1. Revalidate the mapping and go back to (1) above, if necessary. 83 So far only the pagecache operations 83 So far only the pagecache operations need to do this. 84 84 85 2. Do the work 85 2. Do the work 86 86 87 3. Increment operation cursor 87 3. Increment operation cursor 88 88 89 4. Release the mapping via ``->iomap_end``, 89 4. Release the mapping via ``->iomap_end``, if necessary 90 90 91 Each iomap operation will be covered in more d 91 Each iomap operation will be covered in more detail below. 92 This library was covered previously by an `LWN 92 This library was covered previously by an `LWN article 93 <https://lwn.net/Articles/935934/>`_ and a `Ke 93 <https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94 <https://kernelnewbies.org/KernelProjects/ioma 94 <https://kernelnewbies.org/KernelProjects/iomap>`_. 95 95 96 The goal of this document is to provide a brie 96 The goal of this document is to provide a brief discussion of the 97 design and capabilities of iomap, followed by 97 design and capabilities of iomap, followed by a more detailed catalog 98 of the interfaces presented by iomap. 98 of the interfaces presented by iomap. 99 If you change iomap, please update this design 99 If you change iomap, please update this design document. 100 100 101 File Range Iterator 101 File Range Iterator 102 =================== 102 =================== 103 103 104 Definitions 104 Definitions 105 ----------- 105 ----------- 106 106 107 * **buffer head**: Shattered remnants of the 107 * **buffer head**: Shattered remnants of the old buffer cache. 108 108 109 * ``fsblock``: The block size of a file, also 109 * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsem 111 * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 Processes hold this in shared mode to read 112 Processes hold this in shared mode to read file state and contents. 113 Some filesystems may allow shared mode for 113 Some filesystems may allow shared mode for writes. 114 Processes often hold this in exclusive mode 114 Processes often hold this in exclusive mode to change file state and 115 contents. 115 contents. 116 116 117 * ``invalidate_lock``: The pagecache ``struct 117 * ``invalidate_lock``: The pagecache ``struct address_space`` 118 rwsemaphore that protects against folio ins 118 rwsemaphore that protects against folio insertion and removal for 119 filesystems that support punching out folio 119 filesystems that support punching out folios below EOF. 120 Processes wishing to insert folios must hol 120 Processes wishing to insert folios must hold this lock in shared 121 mode to prevent removal, though concurrent 121 mode to prevent removal, though concurrent insertion is allowed. 122 Processes wishing to remove folios must hol 122 Processes wishing to remove folios must hold this lock in exclusive 123 mode to prevent insertions. 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 124 Concurrent removals are not allowed. 125 125 126 * ``dax_read_lock``: The RCU read lock that d 126 * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 device pre-shutdown hook from returning bef 127 device pre-shutdown hook from returning before other threads have 128 released resources. 128 released resources. 129 129 130 * **filesystem mapping lock**: This synchroni 130 * **filesystem mapping lock**: This synchronization primitive is 131 internal to the filesystem and must protect 131 internal to the filesystem and must protect the file mapping data 132 from updates while a mapping is being sampl 132 from updates while a mapping is being sampled. 133 The filesystem author must determine how th 133 The filesystem author must determine how this coordination should 134 happen; it does not need to be an actual lo 134 happen; it does not need to be an actual lock. 135 135 136 * **iomap internal operation lock**: This is 136 * **iomap internal operation lock**: This is a general term for 137 synchronization primitives that iomap funct 137 synchronization primitives that iomap functions take while holding a 138 mapping. 138 mapping. 139 A specific example would be taking the foli 139 A specific example would be taking the folio lock while reading or 140 writing the pagecache. 140 writing the pagecache. 141 141 142 * **pure overwrite**: A write operation that 142 * **pure overwrite**: A write operation that does not require any 143 metadata or zeroing operations to perform d 143 metadata or zeroing operations to perform during either submission 144 or completion. 144 or completion. 145 This implies that the filesystem must have !! 145 This implies that the fileystem must have already allocated space 146 on disk as ``IOMAP_MAPPED`` and the filesys 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 constraints on IO alignment or size. !! 147 constaints on IO alignment or size. 148 The only constraints on I/O alignment are d 148 The only constraints on I/O alignment are device level (minimum I/O 149 size and alignment, typically sector size). 149 size and alignment, typically sector size). 150 150 151 ``struct iomap`` 151 ``struct iomap`` 152 ---------------- 152 ---------------- 153 153 154 The filesystem communicates to the iomap itera 154 The filesystem communicates to the iomap iterator the mapping of 155 byte ranges of a file to byte ranges of a stor 155 byte ranges of a file to byte ranges of a storage device with the 156 structure below: 156 structure below: 157 157 158 .. code-block:: c 158 .. code-block:: c 159 159 160 struct iomap { 160 struct iomap { 161 u64 addr; 161 u64 addr; 162 loff_t offset; 162 loff_t offset; 163 u64 length; 163 u64 length; 164 u16 type; 164 u16 type; 165 u16 flags; 165 u16 flags; 166 struct block_device *bdev; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 167 struct dax_device *dax_dev; 168 void *inline_data; !! 168 voidw *inline_data; 169 void *private; 169 void *private; 170 const struct iomap_folio_ops *folio_ops; 170 const struct iomap_folio_ops *folio_ops; 171 u64 validity_cookie; 171 u64 validity_cookie; 172 }; 172 }; 173 173 174 The fields are as follows: 174 The fields are as follows: 175 175 176 * ``offset`` and ``length`` describe the rang 176 * ``offset`` and ``length`` describe the range of file offsets, in 177 bytes, covered by this mapping. 177 bytes, covered by this mapping. 178 These fields must always be set by the file 178 These fields must always be set by the filesystem. 179 179 180 * ``type`` describes the type of the space ma 180 * ``type`` describes the type of the space mapping: 181 181 182 * **IOMAP_HOLE**: No storage has been alloc 182 * **IOMAP_HOLE**: No storage has been allocated. 183 This type must never be returned in respo 183 This type must never be returned in response to an ``IOMAP_WRITE`` 184 operation because writes must allocate an 184 operation because writes must allocate and map space, and return 185 the mapping. 185 the mapping. 186 The ``addr`` field must be set to ``IOMAP 186 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 187 iomap does not support writing (whether v 187 iomap does not support writing (whether via pagecache or direct 188 I/O) to a hole. 188 I/O) to a hole. 189 189 190 * **IOMAP_DELALLOC**: A promise to allocate 190 * **IOMAP_DELALLOC**: A promise to allocate space at a later time 191 ("delayed allocation"). 191 ("delayed allocation"). 192 If the filesystem returns IOMAP_F_NEW her 192 If the filesystem returns IOMAP_F_NEW here and the write fails, the 193 ``->iomap_end`` function must delete the 193 ``->iomap_end`` function must delete the reservation. 194 The ``addr`` field must be set to ``IOMAP 194 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 195 195 196 * **IOMAP_MAPPED**: The file range maps to 196 * **IOMAP_MAPPED**: The file range maps to specific space on the 197 storage device. 197 storage device. 198 The device is returned in ``bdev`` or ``d 198 The device is returned in ``bdev`` or ``dax_dev``. 199 The device address, in bytes, is returned 199 The device address, in bytes, is returned via ``addr``. 200 200 201 * **IOMAP_UNWRITTEN**: The file range maps 201 * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 202 storage device, but the space has not yet 202 storage device, but the space has not yet been initialized. 203 The device is returned in ``bdev`` or ``d 203 The device is returned in ``bdev`` or ``dax_dev``. 204 The device address, in bytes, is returned 204 The device address, in bytes, is returned via ``addr``. 205 Reads from this type of mapping will retu 205 Reads from this type of mapping will return zeroes to the caller. 206 For a write or writeback operation, the i 206 For a write or writeback operation, the ioend should update the 207 mapping to MAPPED. 207 mapping to MAPPED. 208 Refer to the sections about ioends for mo 208 Refer to the sections about ioends for more details. 209 209 210 * **IOMAP_INLINE**: The file range maps to 210 * **IOMAP_INLINE**: The file range maps to the memory buffer 211 specified by ``inline_data``. 211 specified by ``inline_data``. 212 For write operation, the ``->iomap_end`` 212 For write operation, the ``->iomap_end`` function presumably 213 handles persisting the data. 213 handles persisting the data. 214 The ``addr`` field must be set to ``IOMAP 214 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 215 215 216 * ``flags`` describe the status of the space 216 * ``flags`` describe the status of the space mapping. 217 These flags should be set by the filesystem 217 These flags should be set by the filesystem in ``->iomap_begin``: 218 218 219 * **IOMAP_F_NEW**: The space under the mapp 219 * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 220 Areas that will not be written to must be 220 Areas that will not be written to must be zeroed. 221 If a write fails and the mapping is a spa 221 If a write fails and the mapping is a space reservation, the 222 reservation must be deleted. 222 reservation must be deleted. 223 223 224 * **IOMAP_F_DIRTY**: The inode will have un 224 * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 225 to access any data written. 225 to access any data written. 226 fdatasync is required to commit these cha 226 fdatasync is required to commit these changes to persistent 227 storage. 227 storage. 228 This needs to take into account metadata 228 This needs to take into account metadata changes that *may* be made 229 at I/O completion, such as file size upda 229 at I/O completion, such as file size updates from direct I/O. 230 230 231 * **IOMAP_F_SHARED**: The space under the m 231 * **IOMAP_F_SHARED**: The space under the mapping is shared. 232 Copy on write is necessary to avoid corru 232 Copy on write is necessary to avoid corrupting other file data. 233 233 234 * **IOMAP_F_BUFFER_HEAD**: This mapping req 234 * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 235 heads for pagecache operations. 235 heads for pagecache operations. 236 Do not add more uses of this. 236 Do not add more uses of this. 237 237 238 * **IOMAP_F_MERGED**: Multiple contiguous b 238 * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 239 coalesced into this single mapping. 239 coalesced into this single mapping. 240 This is only useful for FIEMAP. 240 This is only useful for FIEMAP. 241 241 242 * **IOMAP_F_XATTR**: The mapping is for ext 242 * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 243 regular file data. 243 regular file data. 244 This is only useful for FIEMAP. 244 This is only useful for FIEMAP. 245 245 246 * **IOMAP_F_PRIVATE**: Starting with this v 246 * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can 247 be set by the filesystem for its own purp 247 be set by the filesystem for its own purposes. 248 248 249 These flags can be set by iomap itself duri 249 These flags can be set by iomap itself during file operations. 250 The filesystem should supply an ``->iomap_e 250 The filesystem should supply an ``->iomap_end`` function if it needs 251 to observe these flags: 251 to observe these flags: 252 252 253 * **IOMAP_F_SIZE_CHANGED**: The file size h 253 * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 254 using this mapping. 254 using this mapping. 255 255 256 * **IOMAP_F_STALE**: The mapping was found 256 * **IOMAP_F_STALE**: The mapping was found to be stale. 257 iomap will call ``->iomap_end`` on this m 257 iomap will call ``->iomap_end`` on this mapping and then 258 ``->iomap_begin`` to obtain a new mapping 258 ``->iomap_begin`` to obtain a new mapping. 259 259 260 Currently, these flags are only set by page 260 Currently, these flags are only set by pagecache operations. 261 261 262 * ``addr`` describes the device address, in b 262 * ``addr`` describes the device address, in bytes. 263 263 264 * ``bdev`` describes the block device for thi 264 * ``bdev`` describes the block device for this mapping. 265 This only needs to be set for mapped or unw 265 This only needs to be set for mapped or unwritten operations. 266 266 267 * ``dax_dev`` describes the DAX device for th 267 * ``dax_dev`` describes the DAX device for this mapping. 268 This only needs to be set for mapped or unw 268 This only needs to be set for mapped or unwritten operations, and 269 only for a fsdax operation. 269 only for a fsdax operation. 270 270 271 * ``inline_data`` points to a memory buffer f 271 * ``inline_data`` points to a memory buffer for I/O involving 272 ``IOMAP_INLINE`` mappings. 272 ``IOMAP_INLINE`` mappings. 273 This value is ignored for all other mapping 273 This value is ignored for all other mapping types. 274 274 275 * ``private`` is a pointer to `filesystem-pri 275 * ``private`` is a pointer to `filesystem-private information 276 <https://lore.kernel.org/all/20180619164137. 276 <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/">https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 277 This value will be passed unchanged to ``-> 277 This value will be passed unchanged to ``->iomap_end``. 278 278 279 * ``folio_ops`` will be covered in the sectio 279 * ``folio_ops`` will be covered in the section on pagecache operations. 280 280 281 * ``validity_cookie`` is a magic freshness va 281 * ``validity_cookie`` is a magic freshness value set by the filesystem 282 that should be used to detect stale mapping 282 that should be used to detect stale mappings. 283 For pagecache operations this is critical f 283 For pagecache operations this is critical for correct operation 284 because page faults can occur, which implie 284 because page faults can occur, which implies that filesystem locks 285 should not be held between ``->iomap_begin` 285 should not be held between ``->iomap_begin`` and ``->iomap_end``. 286 Filesystems with completely static mappings 286 Filesystems with completely static mappings need not set this value. 287 Only pagecache operations revalidate mappin 287 Only pagecache operations revalidate mappings; see the section about 288 ``iomap_valid`` for details. 288 ``iomap_valid`` for details. 289 289 290 ``struct iomap_ops`` 290 ``struct iomap_ops`` 291 -------------------- 291 -------------------- 292 292 293 Every iomap function requires the filesystem t 293 Every iomap function requires the filesystem to pass an operations 294 structure to obtain a mapping and (optionally) 294 structure to obtain a mapping and (optionally) to release the mapping: 295 295 296 .. code-block:: c 296 .. code-block:: c 297 297 298 struct iomap_ops { 298 struct iomap_ops { 299 int (*iomap_begin)(struct inode *inode, l 299 int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 300 unsigned flags, struct 300 unsigned flags, struct iomap *iomap, 301 struct iomap *srcmap); 301 struct iomap *srcmap); 302 302 303 int (*iomap_end)(struct inode *inode, lof 303 int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 304 ssize_t written, unsigne 304 ssize_t written, unsigned flags, 305 struct iomap *iomap); 305 struct iomap *iomap); 306 }; 306 }; 307 307 308 ``->iomap_begin`` 308 ``->iomap_begin`` 309 ~~~~~~~~~~~~~~~~~ 309 ~~~~~~~~~~~~~~~~~ 310 310 311 iomap operations call ``->iomap_begin`` to obt 311 iomap operations call ``->iomap_begin`` to obtain one file mapping for 312 the range of bytes specified by ``pos`` and `` 312 the range of bytes specified by ``pos`` and ``length`` for the file 313 ``inode``. 313 ``inode``. 314 This mapping should be returned through the `` 314 This mapping should be returned through the ``iomap`` pointer. 315 The mapping must cover at least the first byte 315 The mapping must cover at least the first byte of the supplied file 316 range, but it does not need to cover the entir 316 range, but it does not need to cover the entire requested range. 317 317 318 Each iomap operation describes the requested o 318 Each iomap operation describes the requested operation through the 319 ``flags`` argument. 319 ``flags`` argument. 320 The exact value of ``flags`` will be documente 320 The exact value of ``flags`` will be documented in the 321 operation-specific sections below. 321 operation-specific sections below. 322 These flags can, at least in principle, apply 322 These flags can, at least in principle, apply generally to iomap 323 operations: 323 operations: 324 324 325 * ``IOMAP_DIRECT`` is set when the caller wis 325 * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 326 block storage. 326 block storage. 327 327 328 * ``IOMAP_DAX`` is set when the caller wishes 328 * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 329 memory-like storage. 329 memory-like storage. 330 330 331 * ``IOMAP_NOWAIT`` is set when the caller wis 331 * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 332 effort attempt to avoid any operation that 332 effort attempt to avoid any operation that would result in blocking 333 the submitting task. 333 the submitting task. 334 This is similar in intent to ``O_NONBLOCK`` 334 This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 335 intended for asynchronous applications to k 335 intended for asynchronous applications to keep doing other work 336 instead of waiting for the specific unavail 336 instead of waiting for the specific unavailable filesystem resource 337 to become available. 337 to become available. 338 Filesystems implementing ``IOMAP_NOWAIT`` s 338 Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 339 trylock algorithms. 339 trylock algorithms. 340 They need to be able to satisfy the entire 340 They need to be able to satisfy the entire I/O request range with a 341 single iomap mapping. 341 single iomap mapping. 342 They need to avoid reading or writing metad 342 They need to avoid reading or writing metadata synchronously. 343 They need to avoid blocking memory allocati 343 They need to avoid blocking memory allocations. 344 They need to avoid waiting on transaction r 344 They need to avoid waiting on transaction reservations to allow 345 modifications to take place. 345 modifications to take place. 346 They probably should not be allocating new 346 They probably should not be allocating new space. 347 And so on. 347 And so on. 348 If there is any doubt in the filesystem dev 348 If there is any doubt in the filesystem developer's mind as to 349 whether any specific ``IOMAP_NOWAIT`` opera 349 whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 350 then they should return ``-EAGAIN`` as earl 350 then they should return ``-EAGAIN`` as early as possible rather than 351 start the operation and force the submittin 351 start the operation and force the submitting task to block. 352 ``IOMAP_NOWAIT`` is often set on behalf of 352 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 353 ``RWF_NOWAIT``. 353 ``RWF_NOWAIT``. 354 354 355 If it is necessary to read existing file conte 355 If it is necessary to read existing file contents from a `different 356 <https://lore.kernel.org/all/20191008071527.293 356 <https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/">https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 357 device or address range on a device, the files 357 device or address range on a device, the filesystem should return that 358 information via ``srcmap``. 358 information via ``srcmap``. 359 Only pagecache and fsdax operations support re 359 Only pagecache and fsdax operations support reading from one mapping and 360 writing to another. 360 writing to another. 361 361 362 ``->iomap_end`` 362 ``->iomap_end`` 363 ~~~~~~~~~~~~~~~ 363 ~~~~~~~~~~~~~~~ 364 364 365 After the operation completes, the ``->iomap_e 365 After the operation completes, the ``->iomap_end`` function, if present, 366 is called to signal that iomap is finished wit 366 is called to signal that iomap is finished with a mapping. 367 Typically, implementations will use this funct 367 Typically, implementations will use this function to tear down any 368 context that were set up in ``->iomap_begin``. 368 context that were set up in ``->iomap_begin``. 369 For example, a write might wish to commit the 369 For example, a write might wish to commit the reservations for the bytes 370 that were operated upon and unreserve any spac 370 that were operated upon and unreserve any space that was not operated 371 upon. 371 upon. 372 ``written`` might be zero if no bytes were tou 372 ``written`` might be zero if no bytes were touched. 373 ``flags`` will contain the same value passed t 373 ``flags`` will contain the same value passed to ``->iomap_begin``. 374 iomap ops for reads are not likely to need to 374 iomap ops for reads are not likely to need to supply this function. 375 375 376 Both functions should return a negative errno 376 Both functions should return a negative errno code on error, or zero on 377 success. 377 success. 378 378 379 Preparing for File Operations 379 Preparing for File Operations 380 ============================= 380 ============================= 381 381 382 iomap only handles mapping and I/O. 382 iomap only handles mapping and I/O. 383 Filesystems must still call out to the VFS to 383 Filesystems must still call out to the VFS to check input parameters 384 and file state before initiating an I/O operat 384 and file state before initiating an I/O operation. 385 It does not handle obtaining filesystem freeze 385 It does not handle obtaining filesystem freeze protection, updating of 386 timestamps, stripping privileges, or access co 386 timestamps, stripping privileges, or access control. 387 387 388 Locking Hierarchy 388 Locking Hierarchy 389 ================= 389 ================= 390 390 391 iomap requires that filesystems supply their o 391 iomap requires that filesystems supply their own locking model. 392 There are three categories of synchronization 392 There are three categories of synchronization primitives, as far as 393 iomap is concerned: 393 iomap is concerned: 394 394 395 * The **upper** level primitive is provided b 395 * The **upper** level primitive is provided by the filesystem to 396 coordinate access to different iomap operat 396 coordinate access to different iomap operations. 397 The exact primitive is specific to the file !! 397 The exact primitive is specifc to the filesystem and operation, 398 but is often a VFS inode, pagecache invalid 398 but is often a VFS inode, pagecache invalidation, or folio lock. 399 For example, a filesystem might take ``i_rw 399 For example, a filesystem might take ``i_rwsem`` before calling 400 ``iomap_file_buffered_write`` and ``iomap_f 400 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 401 these two file operations from clobbering e 401 these two file operations from clobbering each other. 402 Pagecache writeback may lock a folio to pre 402 Pagecache writeback may lock a folio to prevent other threads from 403 accessing the folio until writeback is unde 403 accessing the folio until writeback is underway. 404 404 405 * The **lower** level primitive is taken by 405 * The **lower** level primitive is taken by the filesystem in the 406 ``->iomap_begin`` and ``->iomap_end`` fun 406 ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 407 access to the file space mapping informat 407 access to the file space mapping information. 408 The fields of the iomap object should be 408 The fields of the iomap object should be filled out while holding 409 this primitive. 409 this primitive. 410 The upper level synchronization primitive 410 The upper level synchronization primitive, if any, remains held 411 while acquiring the lower level synchroni 411 while acquiring the lower level synchronization primitive. 412 For example, XFS takes ``ILOCK_EXCL`` and 412 For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 413 while sampling mappings. 413 while sampling mappings. 414 Filesystems with immutable mapping inform 414 Filesystems with immutable mapping information may not require 415 synchronization here. 415 synchronization here. 416 416 417 * The **operation** primitive is taken by a 417 * The **operation** primitive is taken by an iomap operation to 418 coordinate access to its own internal dat 418 coordinate access to its own internal data structures. 419 The upper level synchronization primitive 419 The upper level synchronization primitive, if any, remains held 420 while acquiring this primitive. 420 while acquiring this primitive. 421 The lower level primitive is not held whi 421 The lower level primitive is not held while acquiring this 422 primitive. 422 primitive. 423 For example, pagecache write operations w 423 For example, pagecache write operations will obtain a file mapping, 424 then grab and lock a folio to copy new co 424 then grab and lock a folio to copy new contents. 425 It may also lock an internal folio state 425 It may also lock an internal folio state object to update metadata. 426 426 427 The exact locking requirements are specific to 427 The exact locking requirements are specific to the filesystem; for 428 certain operations, some of these locks can be 428 certain operations, some of these locks can be elided. 429 All further mentions of locking are *recommend !! 429 All further mention of locking are *recommendations*, not mandates. 430 Each filesystem author must figure out the loc 430 Each filesystem author must figure out the locking for themself. 431 431 432 Bugs and Limitations 432 Bugs and Limitations 433 ==================== 433 ==================== 434 434 435 * No support for fscrypt. 435 * No support for fscrypt. 436 * No support for compression. 436 * No support for compression. 437 * No support for fsverity yet. 437 * No support for fsverity yet. 438 * Strong assumptions that IO should work the 438 * Strong assumptions that IO should work the way it does on XFS. 439 * Does iomap *actually* work for non-regular 439 * Does iomap *actually* work for non-regular file data? 440 440 441 Patches welcome! 441 Patches welcome!
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.