~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/iomap/design.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/filesystems/iomap/design.rst (Version linux-6.12-rc7) and /Documentation/filesystems/iomap/design.rst (Version linux-4.16.18)


  1 .. SPDX-License-Identifier: GPL-2.0               
  2 .. _iomap_design:                                 
  3                                                   
  4 ..                                                
  5         Dumb style notes to maintain the autho    
  6         Please try to start sentences on separ    
  7         sentence changes don't bleed colors in    
  8         Heading decorations are documented in     
  9                                                   
 10 ==============                                    
 11 Library Design                                    
 12 ==============                                    
 13                                                   
 14 .. contents:: Table of Contents                   
 15    :local:                                        
 16                                                   
 17 Introduction                                      
 18 ============                                      
 19                                                   
 20 iomap is a filesystem library for handling com    
 21 The library has two layers:                       
 22                                                   
 23  1. A lower layer that provides an iterator ov    
 24     This layer tries to obtain mappings of eac    
 25     from the filesystem, but the storage infor    
 26     required.                                     
 27                                                   
 28  2. An upper layer that acts upon the space ma    
 29     lower layer iterator.                         
 30                                                   
 31 The iteration can involve mappings of file's l    
 32 physical extents, but the storage layer inform    
 33 required, e.g. for walking cached file informa    
 34 The library exports various APIs for implement    
 35 as:                                               
 36                                                   
 37  * Pagecache reads and writes                     
 38  * Folio write faults to the pagecache            
 39  * Writeback of dirty folios                      
 40  * Direct I/O reads and writes                    
 41  * fsdax I/O reads, writes, loads, and stores     
 42  * FIEMAP                                         
 43  * lseek ``SEEK_DATA`` and ``SEEK_HOLE``          
 44  * swapfile activation                            
 45                                                   
 46 This origins of this library is the file I/O p    
 47 has now been extended to cover several other o    
 48                                                   
 49 Who Should Read This?                             
 50 =====================                             
 51                                                   
 52 The target audience for this document are file    
 53 pagecache programmers and code reviewers.         
 54                                                   
 55 If you are working on PCI, machine architectur    
 56 are most likely in the wrong place.               
 57                                                   
 58 How Is This Better?                               
 59 ===================                               
 60                                                   
 61 Unlike the classic Linux I/O model which break    
 62 units (generally memory pages or blocks) and l    
 63 the basis of that unit, the iomap model asks t    
 64 largest space mappings that it can create for     
 65 initiates operations on that basis.               
 66 This strategy improves the filesystem's visibi    
 67 operation being performed, which enables it to    
 68 larger space allocations when possible.           
 69 Larger space mappings improve runtime performa    
 70 of mapping function calls into the filesystem     
 71 data.                                             
 72                                                   
 73 At a high level, an iomap operation `looks lik    
 74 <https://lore.kernel.org/all/ZGbVaewzcCysclPt@d    
 75                                                   
 76 1. For each byte in the operation range...        
 77                                                   
 78    1. Obtain a space mapping via ``->iomap_beg    
 79                                                   
 80    2. For each sub-unit of work...                
 81                                                   
 82       1. Revalidate the mapping and go back to    
 83          So far only the pagecache operations     
 84                                                   
 85       2. Do the work                              
 86                                                   
 87    3. Increment operation cursor                  
 88                                                   
 89    4. Release the mapping via ``->iomap_end``,    
 90                                                   
 91 Each iomap operation will be covered in more d    
 92 This library was covered previously by an `LWN    
 93 <https://lwn.net/Articles/935934/>`_ and a `Ke    
 94 <https://kernelnewbies.org/KernelProjects/ioma    
 95                                                   
 96 The goal of this document is to provide a brie    
 97 design and capabilities of iomap, followed by     
 98 of the interfaces presented by iomap.             
 99 If you change iomap, please update this design    
100                                                   
101 File Range Iterator                               
102 ===================                               
103                                                   
104 Definitions                                       
105 -----------                                       
106                                                   
107  * **buffer head**: Shattered remnants of the     
108                                                   
109  * ``fsblock``: The block size of a file, also    
110                                                   
111  * ``i_rwsem``: The VFS ``struct inode`` rwsem    
112    Processes hold this in shared mode to read     
113    Some filesystems may allow shared mode for     
114    Processes often hold this in exclusive mode    
115    contents.                                      
116                                                   
117  * ``invalidate_lock``: The pagecache ``struct    
118    rwsemaphore that protects against folio ins    
119    filesystems that support punching out folio    
120    Processes wishing to insert folios must hol    
121    mode to prevent removal, though concurrent     
122    Processes wishing to remove folios must hol    
123    mode to prevent insertions.                    
124    Concurrent removals are not allowed.           
125                                                   
126  * ``dax_read_lock``: The RCU read lock that d    
127    device pre-shutdown hook from returning bef    
128    released resources.                            
129                                                   
130  * **filesystem mapping lock**: This synchroni    
131    internal to the filesystem and must protect    
132    from updates while a mapping is being sampl    
133    The filesystem author must determine how th    
134    happen; it does not need to be an actual lo    
135                                                   
136  * **iomap internal operation lock**: This is     
137    synchronization primitives that iomap funct    
138    mapping.                                       
139    A specific example would be taking the foli    
140    writing the pagecache.                         
141                                                   
142  * **pure overwrite**: A write operation that     
143    metadata or zeroing operations to perform d    
144    or completion.                                 
145    This implies that the filesystem must have     
146    on disk as ``IOMAP_MAPPED`` and the filesys    
147    constraints on IO alignment or size.           
148    The only constraints on I/O alignment are d    
149    size and alignment, typically sector size).    
150                                                   
151 ``struct iomap``                                  
152 ----------------                                  
153                                                   
154 The filesystem communicates to the iomap itera    
155 byte ranges of a file to byte ranges of a stor    
156 structure below:                                  
157                                                   
158 .. code-block:: c                                 
159                                                   
160  struct iomap {                                   
161      u64                 addr;                    
162      loff_t              offset;                  
163      u64                 length;                  
164      u16                 type;                    
165      u16                 flags;                   
166      struct block_device *bdev;                   
167      struct dax_device   *dax_dev;                
168      void                *inline_data;            
169      void                *private;                
170      const struct iomap_folio_ops *folio_ops;     
171      u64                 validity_cookie;         
172  };                                               
173                                                   
174 The fields are as follows:                        
175                                                   
176  * ``offset`` and ``length`` describe the rang    
177    bytes, covered by this mapping.                
178    These fields must always be set by the file    
179                                                   
180  * ``type`` describes the type of the space ma    
181                                                   
182    * **IOMAP_HOLE**: No storage has been alloc    
183      This type must never be returned in respo    
184      operation because writes must allocate an    
185      the mapping.                                 
186      The ``addr`` field must be set to ``IOMAP    
187      iomap does not support writing (whether v    
188      I/O) to a hole.                              
189                                                   
190    * **IOMAP_DELALLOC**: A promise to allocate    
191      ("delayed allocation").                      
192      If the filesystem returns IOMAP_F_NEW her    
193      ``->iomap_end`` function must delete the     
194      The ``addr`` field must be set to ``IOMAP    
195                                                   
196    * **IOMAP_MAPPED**: The file range maps to     
197      storage device.                              
198      The device is returned in ``bdev`` or ``d    
199      The device address, in bytes, is returned    
200                                                   
201    * **IOMAP_UNWRITTEN**: The file range maps     
202      storage device, but the space has not yet    
203      The device is returned in ``bdev`` or ``d    
204      The device address, in bytes, is returned    
205      Reads from this type of mapping will retu    
206      For a write or writeback operation, the i    
207      mapping to MAPPED.                           
208      Refer to the sections about ioends for mo    
209                                                   
210    * **IOMAP_INLINE**: The file range maps to     
211      specified by ``inline_data``.                
212      For write operation, the ``->iomap_end``     
213      handles persisting the data.                 
214      The ``addr`` field must be set to ``IOMAP    
215                                                   
216  * ``flags`` describe the status of the space     
217    These flags should be set by the filesystem    
218                                                   
219    * **IOMAP_F_NEW**: The space under the mapp    
220      Areas that will not be written to must be    
221      If a write fails and the mapping is a spa    
222      reservation must be deleted.                 
223                                                   
224    * **IOMAP_F_DIRTY**: The inode will have un    
225      to access any data written.                  
226      fdatasync is required to commit these cha    
227      storage.                                     
228      This needs to take into account metadata     
229      at I/O completion, such as file size upda    
230                                                   
231    * **IOMAP_F_SHARED**: The space under the m    
232      Copy on write is necessary to avoid corru    
233                                                   
234    * **IOMAP_F_BUFFER_HEAD**: This mapping req    
235      heads for pagecache operations.              
236      Do not add more uses of this.                
237                                                   
238    * **IOMAP_F_MERGED**: Multiple contiguous b    
239      coalesced into this single mapping.          
240      This is only useful for FIEMAP.              
241                                                   
242    * **IOMAP_F_XATTR**: The mapping is for ext    
243      regular file data.                           
244      This is only useful for FIEMAP.              
245                                                   
246    * **IOMAP_F_PRIVATE**: Starting with this v    
247      be set by the filesystem for its own purp    
248                                                   
249    These flags can be set by iomap itself duri    
250    The filesystem should supply an ``->iomap_e    
251    to observe these flags:                        
252                                                   
253    * **IOMAP_F_SIZE_CHANGED**: The file size h    
254      using this mapping.                          
255                                                   
256    * **IOMAP_F_STALE**: The mapping was found     
257      iomap will call ``->iomap_end`` on this m    
258      ``->iomap_begin`` to obtain a new mapping    
259                                                   
260    Currently, these flags are only set by page    
261                                                   
262  * ``addr`` describes the device address, in b    
263                                                   
264  * ``bdev`` describes the block device for thi    
265    This only needs to be set for mapped or unw    
266                                                   
267  * ``dax_dev`` describes the DAX device for th    
268    This only needs to be set for mapped or unw    
269    only for a fsdax operation.                    
270                                                   
271  * ``inline_data`` points to a memory buffer f    
272    ``IOMAP_INLINE`` mappings.                     
273    This value is ignored for all other mapping    
274                                                   
275  * ``private`` is a pointer to `filesystem-pri    
276    <https://lore.kernel.org/all/20180619164137.    
277    This value will be passed unchanged to ``->    
278                                                   
279  * ``folio_ops`` will be covered in the sectio    
280                                                   
281  * ``validity_cookie`` is a magic freshness va    
282    that should be used to detect stale mapping    
283    For pagecache operations this is critical f    
284    because page faults can occur, which implie    
285    should not be held between ``->iomap_begin`    
286    Filesystems with completely static mappings    
287    Only pagecache operations revalidate mappin    
288    ``iomap_valid`` for details.                   
289                                                   
290 ``struct iomap_ops``                              
291 --------------------                              
292                                                   
293 Every iomap function requires the filesystem t    
294 structure to obtain a mapping and (optionally)    
295                                                   
296 .. code-block:: c                                 
297                                                   
298  struct iomap_ops {                               
299      int (*iomap_begin)(struct inode *inode, l    
300                         unsigned flags, struct    
301                         struct iomap *srcmap);    
302                                                   
303      int (*iomap_end)(struct inode *inode, lof    
304                       ssize_t written, unsigne    
305                       struct iomap *iomap);       
306  };                                               
307                                                   
308 ``->iomap_begin``                                 
309 ~~~~~~~~~~~~~~~~~                                 
310                                                   
311 iomap operations call ``->iomap_begin`` to obt    
312 the range of bytes specified by ``pos`` and ``    
313 ``inode``.                                        
314 This mapping should be returned through the ``    
315 The mapping must cover at least the first byte    
316 range, but it does not need to cover the entir    
317                                                   
318 Each iomap operation describes the requested o    
319 ``flags`` argument.                               
320 The exact value of ``flags`` will be documente    
321 operation-specific sections below.                
322 These flags can, at least in principle, apply     
323 operations:                                       
324                                                   
325  * ``IOMAP_DIRECT`` is set when the caller wis    
326    block storage.                                 
327                                                   
328  * ``IOMAP_DAX`` is set when the caller wishes    
329    memory-like storage.                           
330                                                   
331  * ``IOMAP_NOWAIT`` is set when the caller wis    
332    effort attempt to avoid any operation that     
333    the submitting task.                           
334    This is similar in intent to ``O_NONBLOCK``    
335    intended for asynchronous applications to k    
336    instead of waiting for the specific unavail    
337    to become available.                           
338    Filesystems implementing ``IOMAP_NOWAIT`` s    
339    trylock algorithms.                            
340    They need to be able to satisfy the entire     
341    single iomap mapping.                          
342    They need to avoid reading or writing metad    
343    They need to avoid blocking memory allocati    
344    They need to avoid waiting on transaction r    
345    modifications to take place.                   
346    They probably should not be allocating new     
347    And so on.                                     
348    If there is any doubt in the filesystem dev    
349    whether any specific ``IOMAP_NOWAIT`` opera    
350    then they should return ``-EAGAIN`` as earl    
351    start the operation and force the submittin    
352    ``IOMAP_NOWAIT`` is often set on behalf of     
353    ``RWF_NOWAIT``.                                
354                                                   
355 If it is necessary to read existing file conte    
356 <https://lore.kernel.org/all/20191008071527.293    
357 device or address range on a device, the files    
358 information via ``srcmap``.                       
359 Only pagecache and fsdax operations support re    
360 writing to another.                               
361                                                   
362 ``->iomap_end``                                   
363 ~~~~~~~~~~~~~~~                                   
364                                                   
365 After the operation completes, the ``->iomap_e    
366 is called to signal that iomap is finished wit    
367 Typically, implementations will use this funct    
368 context that were set up in ``->iomap_begin``.    
369 For example, a write might wish to commit the     
370 that were operated upon and unreserve any spac    
371 upon.                                             
372 ``written`` might be zero if no bytes were tou    
373 ``flags`` will contain the same value passed t    
374 iomap ops for reads are not likely to need to     
375                                                   
376 Both functions should return a negative errno     
377 success.                                          
378                                                   
379 Preparing for File Operations                     
380 =============================                     
381                                                   
382 iomap only handles mapping and I/O.               
383 Filesystems must still call out to the VFS to     
384 and file state before initiating an I/O operat    
385 It does not handle obtaining filesystem freeze    
386 timestamps, stripping privileges, or access co    
387                                                   
388 Locking Hierarchy                                 
389 =================                                 
390                                                   
391 iomap requires that filesystems supply their o    
392 There are three categories of synchronization     
393 iomap is concerned:                               
394                                                   
395  * The **upper** level primitive is provided b    
396    coordinate access to different iomap operat    
397    The exact primitive is specific to the file    
398    but is often a VFS inode, pagecache invalid    
399    For example, a filesystem might take ``i_rw    
400    ``iomap_file_buffered_write`` and ``iomap_f    
401    these two file operations from clobbering e    
402    Pagecache writeback may lock a folio to pre    
403    accessing the folio until writeback is unde    
404                                                   
405    * The **lower** level primitive is taken by    
406      ``->iomap_begin`` and ``->iomap_end`` fun    
407      access to the file space mapping informat    
408      The fields of the iomap object should be     
409      this primitive.                              
410      The upper level synchronization primitive    
411      while acquiring the lower level synchroni    
412      For example, XFS takes ``ILOCK_EXCL`` and    
413      while sampling mappings.                     
414      Filesystems with immutable mapping inform    
415      synchronization here.                        
416                                                   
417    * The **operation** primitive is taken by a    
418      coordinate access to its own internal dat    
419      The upper level synchronization primitive    
420      while acquiring this primitive.              
421      The lower level primitive is not held whi    
422      primitive.                                   
423      For example, pagecache write operations w    
424      then grab and lock a folio to copy new co    
425      It may also lock an internal folio state     
426                                                   
427 The exact locking requirements are specific to    
428 certain operations, some of these locks can be    
429 All further mentions of locking are *recommend    
430 Each filesystem author must figure out the loc    
431                                                   
432 Bugs and Limitations                              
433 ====================                              
434                                                   
435  * No support for fscrypt.                        
436  * No support for compression.                    
437  * No support for fsverity yet.                   
438  * Strong assumptions that IO should work the     
439  * Does iomap *actually* work for non-regular     
440                                                   
441 Patches welcome!                                  
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php