~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/zonefs.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/filesystems/zonefs.rst (Version linux-6.12-rc7) and /Documentation/filesystems/zonefs.rst (Version linux-5.6.19)


  1 .. SPDX-License-Identifier: GPL-2.0               
  2                                                   
  3 ==============================================    
  4 ZoneFS - Zone filesystem for Zoned block devic    
  5 ==============================================    
  6                                                   
  7 Introduction                                      
  8 ============                                      
  9                                                   
 10 zonefs is a very simple file system exposing e    
 11 as a file. Unlike a regular POSIX-compliant fi    
 12 device support (e.g. f2fs), zonefs does not hi    
 13 constraint of zoned block devices to the user.    
 14 write zones of the device must be written sequ    
 15 of the file (append only writes).                 
 16                                                   
 17 As such, zonefs is in essence closer to a raw     
 18 than to a full-featured POSIX file system. The    
 19 the implementation of zoned block device suppo    
 20 raw block device file accesses with a richer f    
 21 direct block device file ioctls which may be m    
 22 example of this approach is the implementation    
 23 tree structures (such as used in RocksDB and L    
 24 by allowing SSTables to be stored in a zone fi    
 25 system rather than as a range of sectors of th    
 26 of the higher level construct "one file is one    
 27 amount of changes needed in the application as    
 28 different application programming languages.      
 29                                                   
 30 Zoned block devices                               
 31 -------------------                               
 32                                                   
 33 Zoned storage devices belong to a class of sto    
 34 space that is divided into zones. A zone is a     
 35 zones are contiguous (there are no LBA gaps).     
 36                                                   
 37 * Conventional zones: there are no access cons    
 38   conventional zones. Any read or write access    
 39   regular block device.                           
 40 * Sequential zones: these zones accept random     
 41   sequentially. Each sequential zone has a wri    
 42   device that keeps track of the mandatory sta    
 43   to the device. As a result of this write con    
 44   cannot be overwritten. Sequential zones must    
 45   command (zone reset) before rewriting.          
 46                                                   
 47 Zoned storage devices can be implemented using    
 48 technologies. The most common form of zoned st    
 49 Block Commands (ZBC) and Zoned ATA Commands (Z    
 50 Magnetic Recording (SMR) HDDs.                    
 51                                                   
 52 Solid State Disks (SSD) storage devices can al    
 53 to, for instance, reduce internal write amplif    
 54 The NVMe Zoned NameSpace (ZNS) is a technical     
 55 committee aiming at adding a zoned storage int    
 56                                                   
 57 Zonefs Overview                                   
 58 ===============                                   
 59                                                   
 60 Zonefs exposes the zones of a zoned block devi    
 61 representing zones are grouped by zone type, w    
 62 by sub-directories. This file structure is bui    
 63 provided by the device and so does not require    
 64 structure.                                        
 65                                                   
 66 On-disk metadata                                  
 67 ----------------                                  
 68                                                   
 69 zonefs on-disk metadata is reduced to an immut    
 70 persistently stores a magic number and optiona    
 71 mount, zonefs uses blkdev_report_zones() to ob    
 72 and populates the mount point with a static fi    
 73 information. File sizes come from the device z    
 74 position managed by the device itself.            
 75                                                   
 76 The super block is always written on disk at s    
 77 device storing the super block is never expose    
 78 the zone containing the super block is a seque    
 79 tool always "finishes" the zone, that is, it t    
 80 state to make it read-only, preventing any dat    
 81                                                   
 82 Zone type sub-directories                         
 83 -------------------------                         
 84                                                   
 85 Files representing zones of the same type are     
 86 sub-directory automatically created on mount.     
 87                                                   
 88 For conventional zones, the sub-directory "cnv    
 89 however created if and only if the device has     
 90 the device only has a single conventional zone    
 91 be exposed as a file as it will be used to sto    
 92 such devices, the "cnv" sub-directory will not    
 93                                                   
 94 For sequential write zones, the sub-directory     
 95                                                   
 96 These two directories are the only directories    
 97 cannot create other directories and cannot ren    
 98 "seq" sub-directories.                            
 99                                                   
100 The size of the directories indicated by the s    
101 obtained with the stat() or fstat() system cal    
102 existing under the directory.                     
103                                                   
104 Zone files                                        
105 ----------                                        
106                                                   
107 Zone files are named using the number of the z    
108 of zones of a particular type. That is, both t    
109 contain files named "0", "1", "2", ... The fil    
110 increasing zone start sector on the device.       
111                                                   
112 All read and write operations to zone files ar    
113 maximum size, that is, beyond the zone capacit    
114 capacity is failed with the -EFBIG error.         
115                                                   
116 Creating, deleting, renaming or modifying any     
117 sub-directories is not allowed.                   
118                                                   
119 The number of blocks of a file as reported by     
120 capacity of the zone file, or in other words,     
121                                                   
122 Conventional zone files                           
123 -----------------------                           
124                                                   
125 The size of conventional zone files is fixed t    
126 represent. Conventional zone files cannot be t    
127                                                   
128 These files can be randomly read and written u    
129 buffered I/Os, direct I/Os, memory mapped I/Os    
130 constraint for these files beyond the file siz    
131                                                   
132 Sequential zone files                             
133 ---------------------                             
134                                                   
135 The size of sequential zone files grouped in t    
136 the file's zone write pointer position relativ    
137                                                   
138 Sequential zone files can only be written sequ    
139 end, that is, write operations can only be app    
140 attempt at accepting random writes and will fa    
141 start offset not corresponding to the end of t    
142 write issued and still in-flight (for asynchro    
143                                                   
144 Since dirty page writeback by the page cache d    
145 write pattern, zonefs prevents buffered writes    
146 on sequential files. Only direct I/O writes ar    
147 zonefs relies on the sequential delivery of wr    
148 implemented by the block layer elevator. An el    
149 write feature for zoned block device (ELEVATOR    
150 must be used. This type of elevator (e.g. mq-d    
151 for zoned block devices on device initializati    
152                                                   
153 There are no restrictions on the type of I/O u    
154 sequential zone files. Buffered I/Os, direct I    
155 all accepted.                                     
156                                                   
157 Truncating sequential zone files is allowed on    
158 zone is reset to rewind the file zone write po    
159 the zone, or up to the zone capacity, in which    
160 transitioned to the FULL state (finish zone op    
161                                                   
162 Format options                                    
163 --------------                                    
164                                                   
165 Several optional features of zonefs can be ena    
166                                                   
167 * Conventional zone aggregation: ranges of con    
168   aggregated into a single larger file instead    
169 * File ownership: The owner UID and GID of zon    
170   but can be changed to any valid UID/GID.        
171 * File access permissions: the default 640 acc    
172                                                   
173 IO error handling                                 
174 -----------------                                 
175                                                   
176 Zoned block devices may fail I/O requests for     
177 devices, e.g. due to bad sectors. However, in     
178 failure pattern, the standards governing zoned    
179 additional conditions that result in I/O error    
180                                                   
181 * A zone may transition to the read-only condi    
182   While the data already written in the zone i    
183   no longer be written. No user action on the     
184   read/write access) can change the zone condi    
185   state. While the reasons for the device to t    
186   state are not defined by the standards, a ty    
187   would be a defective write head on an HDD (a    
188   changed to read-only).                          
189                                                   
190 * A zone may transition to the offline conditi    
191   An offline zone cannot be read nor written.     
192   offline zone back to an operational good sta    
193   transitions, the reasons for a drive to tran    
194   condition are undefined. A typical cause wou    
195   on an HDD causing all zones on the platter u    
196   inaccessible.                                   
197                                                   
198 * Unaligned write errors: These errors result     
199   requests with a start sector that does not c    
200   position when the write request is executed     
201   enforces sequential file write for sequentia    
202   may still happen in the case of a partial fa    
203   operation split into multiple BIOs/requests     
204   If one of the write request within the set o    
205   issued to the device fails, all write reques    
206   become unaligned and fail.                      
207                                                   
208 * Delayed write errors: similarly to regular b    
209   write cache is enabled, write errors may occ    
210   completed writes when the device write cache    
211   Similarly to the previous immediate unaligne    
212   errors can propagate through a stream of cac    
213   causing all data to be dropped after the sec    
214                                                   
215 All I/O errors detected by zonefs are notified    
216 return for the system call that triggered or d    
217 actions taken by zonefs in response to I/O err    
218 vs write) and on the reason for the error (bad    
219 condition change).                                
220                                                   
221 * For read I/O errors, zonefs does not execute    
222   but only if the file zone is still in a good    
223   inconsistency between the file inode size an    
224   If a problem is detected, I/O error recovery    
225                                                   
226 * For write I/O errors, zonefs I/O error recov    
227                                                   
228 * A zone condition change to read-only or offl    
229   I/O error recovery.                             
230                                                   
231 Zonefs minimal I/O error recovery may change a    
232 permissions.                                      
233                                                   
234 * File size changes:                              
235   Immediate or delayed write errors in a seque    
236   inode size to be inconsistent with the amoun    
237   the file zone. For instance, the partial fai    
238   operation will cause the zone write pointer     
239   the entire write operation will be reported     
240   case, the file inode size must be advanced t    
241   change and eventually allow the user to rest    
242   file.                                           
243   A file size may also be reduced to reflect a    
244   fsync(): in this case, the amount of data ef    
245   be less than originally indicated by the fil    
246   error, zonefs always fixes the file inode si    
247   persistently stored in the file zone.           
248                                                   
249 * Access permission changes:                      
250   A zone condition change to read-only is indi    
251   access permissions to render the file read-o    
252   file attributes and data modification. For o    
253   (read and write) to the file are disabled.      
254                                                   
255 Further action taken by zonefs I/O error recov    
256 with the "errors=xxx" mount option. The table     
257 zonefs I/O error processing depending on the m    
258 conditions::                                      
259                                                   
260     +--------------+-----------+--------------    
261     |              |           |            Po    
262     | "errors=xxx" |  device   |                  
263     |    mount     |   zone    | file             
264     |    option    | condition | size     read    
265     +--------------+-----------+--------------    
266     |              | good      | fixed    yes     
267     | remount-ro   | read-only | as is    yes     
268     | (default)    | offline   |   0      no      
269     +--------------+-----------+--------------    
270     |              | good      | fixed    yes     
271     | zone-ro      | read-only | as is    yes     
272     |              | offline   |   0      no      
273     +--------------+-----------+--------------    
274     |              | good      |   0      no      
275     | zone-offline | read-only |   0      no      
276     |              | offline   |   0      no      
277     +--------------+-----------+--------------    
278     |              | good      | fixed    yes     
279     | repair       | read-only | as is    yes     
280     |              | offline   |   0      no      
281     +--------------+-----------+--------------    
282                                                   
283 Further notes:                                    
284                                                   
285 * The "errors=remount-ro" mount option is the     
286   error processing if no errors mount option i    
287 * With the "errors=remount-ro" mount option, t    
288   permissions to read-only applies to all file    
289   read-only.                                      
290 * Access permission and file size changes due     
291   to the offline condition are permanent. Remo    
292   with mkfs.zonefs (mkzonefs) will not change     
293   state.                                          
294 * File access permission changes to read-only     
295   zones to the read-only condition are permane    
296   the device will not re-enable file write acc    
297 * File access permission changes implied by th    
298   zone-offline mount options are temporary for    
299   Unmounting and remounting the file system wi    
300   (format time values) access rights to the fi    
301 * The repair mount option triggers only the mi    
302   actions, that is, file size fixes for zones     
303   indicated as being read-only or offline by t    
304   the zone file access permissions as noted in    
305                                                   
306 Mount options                                     
307 -------------                                     
308                                                   
309 zonefs defines several mount options:             
310 * errors=<behavior>                               
311 * explicit-open                                   
312                                                   
313 "errors=<behavior>" option                        
314 ~~~~~~~~~~~~~~~~~~~~~~~~~~                        
315                                                   
316 The "errors=<behavior>" option mount option al    
317 behavior in response to I/O errors, inode size    
318 condition changes. The defined behaviors are a    
319                                                   
320 * remount-ro (default)                            
321 * zone-ro                                         
322 * zone-offline                                    
323 * repair                                          
324                                                   
325 The run-time I/O error actions defined for eac    
326 previous section. Mount time I/O errors will c    
327 The handling of read-only zones also differs b    
328 If a read-only zone is found at mount time, th    
329 same manner as offline zones, that is, all acc    
330 file size set to 0. This is necessary as the w    
331 is defined as invalib by the ZBC and ZAC stand    
332 discover the amount of data that has been writ    
333 read-only zone discovered at run-time, as indi    
334 The size of the zone file is left unchanged fr    
335                                                   
336 "explicit-open" option                            
337 ~~~~~~~~~~~~~~~~~~~~~~                            
338                                                   
339 A zoned block device (e.g. an NVMe Zoned Names    
340 the number of zones that can be active, that i    
341 implicit open, explicit open or closed conditi    
342 translates into a risk for applications to see    
343 limit being exceeded if the zone of a file is     
344 request is issued by the user.                    
345                                                   
346 To avoid these potential errors, the "explicit    
347 to be made active using an open zone command w    
348 for the first time. If the zone open command s    
349 guaranteed that write requests can be processe    
350 "explicit-open" mount option will result in a     
351 to the device on the last close() of a zone fi    
352 empty.                                            
353                                                   
354 Runtime sysfs attributes                          
355 ------------------------                          
356                                                   
357 zonefs defines several sysfs attributes for mo    
358 are user readable and can be found in the dire    
359 where <dev> is the name of the mounted zoned b    
360                                                   
361 The attributes defined are as follows.            
362                                                   
363 * **max_wro_seq_files**:  This attribute repor    
364   sequential zone files that can be open for w    
365   to the maximum number of explicitly or impli    
366   supports.  A value of 0 means that the devic    
367   (any file) can be open for writing and writt    
368   state of other zones.  When the *explicit-op    
369   will fail any open() system call requesting     
370   writing when the number of sequential zone f    
371   reached the *max_wro_seq_files* limit.          
372 * **nr_wro_seq_files**:  This attribute report    
373   zone files open for writing.  When the "expl    
374   this number can never exceed *max_wro_seq_fi    
375   mount option is not used, the reported numbe    
376   *max_wro_seq_files*.  In such case, it is th    
377   application to not write simultaneously more    
378   sequential zone files.  Failure to do so can    
379 * **max_active_seq_files**:  This attribute re    
380   sequential zone files that are in an active     
381   files that are partially written (not empty     
382   is explicitly open (which happens only if th    
383   used).  This number is always equal to the m    
384   the device supports.  A value of 0 means tha    
385   on the number of sequential zone files that     
386 * **nr_active_seq_files**:  This attributes re    
387   sequential zone files that are active. If *m    
388   then the value of *nr_active_seq_files* can     
389   *nr_active_seq_files*, regardless of the use    
390   option.                                         
391                                                   
392 Zonefs User Space Tools                           
393 =======================                           
394                                                   
395 The mkzonefs tool is used to format zoned bloc    
396 This tool is available on Github at:              
397                                                   
398 https://github.com/damien-lemoal/zonefs-tools     
399                                                   
400 zonefs-tools also includes a test suite which     
401 block device, including null_blk block device     
402                                                   
403 Examples                                          
404 --------                                          
405                                                   
406 The following formats a 15TB host-managed SMR     
407 with the conventional zones aggregation featur    
408                                                   
409     # mkzonefs -o aggr_cnv /dev/sdX               
410     # mount -t zonefs /dev/sdX /mnt               
411     # ls -l /mnt/                                 
412     total 0                                       
413     dr-xr-xr-x 2 root root     1 Nov 25 13:23     
414     dr-xr-xr-x 2 root root 55356 Nov 25 13:23     
415                                                   
416 The size of the zone files sub-directories ind    
417 existing for each type of zones. In this examp    
418 conventional zone file (all conventional zones    
419 file)::                                           
420                                                   
421     # ls -l /mnt/cnv                              
422     total 137101312                               
423     -rw-r----- 1 root root 140391743488 Nov 25    
424                                                   
425 This aggregated conventional zone file can be     
426                                                   
427     # mkfs.ext4 /mnt/cnv/0                        
428     # mount -o loop /mnt/cnv/0 /data              
429                                                   
430 The "seq" sub-directory grouping files for seq    
431 example 55356 zones::                             
432                                                   
433     # ls -lv /mnt/seq                             
434     total 14511243264                             
435     -rw-r----- 1 root root 0 Nov 25 13:23 0       
436     -rw-r----- 1 root root 0 Nov 25 13:23 1       
437     -rw-r----- 1 root root 0 Nov 25 13:23 2       
438     ...                                           
439     -rw-r----- 1 root root 0 Nov 25 13:23 5535    
440     -rw-r----- 1 root root 0 Nov 25 13:23 5535    
441                                                   
442 For sequential write zone files, the file size    
443 the end of the file, similarly to any regular     
444                                                   
445     # dd if=/dev/zero of=/mnt/seq/0 bs=4096 co    
446     1+0 records in                                
447     1+0 records out                               
448     4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000    
449                                                   
450     # ls -l /mnt/seq/0                            
451     -rw-r----- 1 root root 4096 Nov 25 13:23 /    
452                                                   
453 The written file can be truncated to the zone     
454 write operation::                                 
455                                                   
456     # truncate -s 268435456 /mnt/seq/0            
457     # ls -l /mnt/seq/0                            
458     -rw-r----- 1 root root 268435456 Nov 25 13    
459                                                   
460 Truncation to 0 size allows freeing the file z    
461 append-writes to the file::                       
462                                                   
463     # truncate -s 0 /mnt/seq/0                    
464     # ls -l /mnt/seq/0                            
465     -rw-r----- 1 root root 0 Nov 25 13:49 /mnt    
466                                                   
467 Since files are statically mapped to zones on     
468 of a file as reported by stat() and fstat() in    
469 zone::                                            
470                                                   
471     # stat /mnt/seq/0                             
472     File: /mnt/seq/0                              
473     Size: 0             Blocks: 524288     IO     
474     Device: 870h/2160d  Inode: 50431       Lin    
475     Access: (0640/-rw-r-----)  Uid: (    0/       
476     Access: 2019-11-25 13:23:57.048971997 +090    
477     Modify: 2019-11-25 13:52:25.553805765 +090    
478     Change: 2019-11-25 13:52:25.553805765 +090    
479     Birth: -                                      
480                                                   
481 The number of blocks of the file ("Blocks") in    
482 maximum file size of 524288 * 512 B = 256 MB,     
483 capacity in this example. Of note is that the     
484 indicates the minimum I/O size for writes and     
485 physical sector size.                             
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php