~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

TOMOYO Linux Cross Reference
Linux/Documentation/filesystems/ext4/journal.rst

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

Diff markup

Differences between /Documentation/filesystems/ext4/journal.rst (Version linux-6.12-rc7) and /Documentation/filesystems/ext4/journal.rst (Version linux-4.20.17)


  1 .. SPDX-License-Identifier: GPL-2.0                 1 .. SPDX-License-Identifier: GPL-2.0
  2                                                     2 
  3 Journal (jbd2)                                      3 Journal (jbd2)
  4 --------------                                      4 --------------
  5                                                     5 
  6 Introduced in ext3, the ext4 filesystem employ      6 Introduced in ext3, the ext4 filesystem employs a journal to protect the
  7 filesystem against metadata inconsistencies in !!   7 filesystem against corruption in the case of a system crash. A small
  8 to 10,240,000 file system blocks (see man mke2 !!   8 continuous region of disk (default 128MiB) is reserved inside the
  9 size limits) can be reserved inside the filesy !!   9 filesystem as a place to land “important” data writes on-disk as quickly
 10 “important” data writes on-disk as quickly !!  10 as possible. Once the important data transaction is fully written to the
 11 data transaction is fully written to the disk  !!  11 disk and flushed from the disk write cache, a record of the data being
 12 cache, a record of the data being committed is !!  12 committed is also written to the journal. At some later point in time,
 13 some later point in time, the journal code wri !!  13 the journal code writes the transactions to their final locations on
 14 final locations on disk (this could involve a  !!  14 disk (this could involve a lot of seeking or a lot of small
 15 read-write-erases) before erasing the commit r     15 read-write-erases) before erasing the commit record. Should the system
 16 crash during the second slow write, the journa     16 crash during the second slow write, the journal can be replayed all the
 17 way to the latest commit record, guaranteeing      17 way to the latest commit record, guaranteeing the atomicity of whatever
 18 gets written through the journal to the disk.      18 gets written through the journal to the disk. The effect of this is to
 19 guarantee that the filesystem does not become      19 guarantee that the filesystem does not become stuck midway through a
 20 metadata update.                                   20 metadata update.
 21                                                    21 
 22 For performance reasons, ext4 by default only      22 For performance reasons, ext4 by default only writes filesystem metadata
 23 through the journal. This means that file data     23 through the journal. This means that file data blocks are /not/
 24 guaranteed to be in any consistent state after     24 guaranteed to be in any consistent state after a crash. If this default
 25 guarantee level (``data=ordered``) is not sati     25 guarantee level (``data=ordered``) is not satisfactory, there is a mount
 26 option to control journal behavior. If ``data=     26 option to control journal behavior. If ``data=journal``, all data and
 27 metadata are written to disk through the journ     27 metadata are written to disk through the journal. This is slower but
 28 safest. If ``data=writeback``, dirty data bloc     28 safest. If ``data=writeback``, dirty data blocks are not flushed to the
 29 disk before the metadata are written to disk t     29 disk before the metadata are written to disk through the journal.
 30                                                    30 
 31 In case of ``data=ordered`` mode, Ext4 also su << 
 32 help reduce commit latency significantly. The  << 
 33 mode works by logging metadata blocks to the j << 
 34 mode, Ext4 only stores the minimal delta neede << 
 35 affected metadata in fast commit space that is << 
 36 Once the fast commit area fills in or if fast  << 
 37 or if JBD2 commit timer goes off, Ext4 perform << 
 38 A full commit invalidates all the fast commits << 
 39 it and thus it makes the fast commit area empt << 
 40 commits. This feature needs to be enabled at m << 
 41                                                << 
 42 The journal inode is typically inode 8. The fi     31 The journal inode is typically inode 8. The first 68 bytes of the
 43 journal inode are replicated in the ext4 super     32 journal inode are replicated in the ext4 superblock. The journal itself
 44 is normal (but hidden) file within the filesys     33 is normal (but hidden) file within the filesystem. The file usually
 45 consumes an entire block group, though mke2fs      34 consumes an entire block group, though mke2fs tries to put it in the
 46 middle of the disk.                                35 middle of the disk.
 47                                                    36 
 48 All fields in jbd2 are written to disk in big-     37 All fields in jbd2 are written to disk in big-endian order. This is the
 49 opposite of ext4.                                  38 opposite of ext4.
 50                                                    39 
 51 NOTE: Both ext4 and ocfs2 use jbd2.                40 NOTE: Both ext4 and ocfs2 use jbd2.
 52                                                    41 
 53 The maximum size of a journal embedded in an e     42 The maximum size of a journal embedded in an ext4 filesystem is 2^32
 54 blocks. jbd2 itself does not seem to care.         43 blocks. jbd2 itself does not seem to care.
 55                                                    44 
 56 Layout                                             45 Layout
 57 ~~~~~~                                             46 ~~~~~~
 58                                                    47 
 59 Generally speaking, the journal has this forma     48 Generally speaking, the journal has this format:
 60                                                    49 
 61 .. list-table::                                    50 .. list-table::
 62    :widths: 16 48 16                               51    :widths: 16 48 16
 63    :header-rows: 1                                 52    :header-rows: 1
 64                                                    53 
 65    * - Superblock                                  54    * - Superblock
 66      - descriptor_block (data_blocks or revoca !!  55      - descriptor\_block (data\_blocks or revocation\_block) [more data or
 67        revocations] commmit_block              !!  56        revocations] commmit\_block
 68      - [more transactions...]                      57      - [more transactions...]
 69    * -                                             58    * - 
 70      - One transaction                             59      - One transaction
 71      -                                             60      -
 72                                                    61 
 73 Notice that a transaction begins with either a     62 Notice that a transaction begins with either a descriptor and some data,
 74 or a block revocation list. A finished transac     63 or a block revocation list. A finished transaction always ends with a
 75 commit. If there is no commit record (or the c     64 commit. If there is no commit record (or the checksums don't match), the
 76 transaction will be discarded during replay.       65 transaction will be discarded during replay.
 77                                                    66 
 78 External Journal                                   67 External Journal
 79 ~~~~~~~~~~~~~~~~                                   68 ~~~~~~~~~~~~~~~~
 80                                                    69 
 81 Optionally, an ext4 filesystem can be created      70 Optionally, an ext4 filesystem can be created with an external journal
 82 device (as opposed to an internal journal, whi     71 device (as opposed to an internal journal, which uses a reserved inode).
 83 In this case, on the filesystem device, ``s_jo     72 In this case, on the filesystem device, ``s_journal_inum`` should be
 84 zero and ``s_journal_uuid`` should be set. On      73 zero and ``s_journal_uuid`` should be set. On the journal device there
 85 will be an ext4 super block in the usual place     74 will be an ext4 super block in the usual place, with a matching UUID.
 86 The journal superblock will be in the next ful     75 The journal superblock will be in the next full block after the
 87 superblock.                                        76 superblock.
 88                                                    77 
 89 .. list-table::                                    78 .. list-table::
 90    :widths: 12 12 12 32 12                         79    :widths: 12 12 12 32 12
 91    :header-rows: 1                                 80    :header-rows: 1
 92                                                    81 
 93    * - 1024 bytes of padding                       82    * - 1024 bytes of padding
 94      - ext4 Superblock                             83      - ext4 Superblock
 95      - Journal Superblock                          84      - Journal Superblock
 96      - descriptor_block (data_blocks or revoca !!  85      - descriptor\_block (data\_blocks or revocation\_block) [more data or
 97        revocations] commmit_block              !!  86        revocations] commmit\_block
 98      - [more transactions...]                      87      - [more transactions...]
 99    * -                                             88    * - 
100      -                                             89      -
101      -                                             90      -
102      - One transaction                             91      - One transaction
103      -                                             92      -
104                                                    93 
105 Block Header                                       94 Block Header
106 ~~~~~~~~~~~~                                       95 ~~~~~~~~~~~~
107                                                    96 
108 Every block in the journal starts with a commo     97 Every block in the journal starts with a common 12-byte header
109 ``struct journal_header_s``:                       98 ``struct journal_header_s``:
110                                                    99 
111 .. list-table::                                   100 .. list-table::
112    :widths: 8 8 24 40                             101    :widths: 8 8 24 40
113    :header-rows: 1                                102    :header-rows: 1
114                                                   103 
115    * - Offset                                     104    * - Offset
116      - Type                                       105      - Type
117      - Name                                       106      - Name
118      - Description                                107      - Description
119    * - 0x0                                        108    * - 0x0
120      - __be32                                  !! 109      - \_\_be32
121      - h_magic                                 !! 110      - h\_magic
122      - jbd2 magic number, 0xC03B3998.             111      - jbd2 magic number, 0xC03B3998.
123    * - 0x4                                        112    * - 0x4
124      - __be32                                  !! 113      - \_\_be32
125      - h_blocktype                             !! 114      - h\_blocktype
126      - Description of what this block contains    115      - Description of what this block contains. See the jbd2_blocktype_ table
127        below.                                     116        below.
128    * - 0x8                                        117    * - 0x8
129      - __be32                                  !! 118      - \_\_be32
130      - h_sequence                              !! 119      - h\_sequence
131      - The transaction ID that goes with this     120      - The transaction ID that goes with this block.
132                                                   121 
133 .. _jbd2_blocktype:                               122 .. _jbd2_blocktype:
134                                                   123 
135 The journal block type can be any one of:         124 The journal block type can be any one of:
136                                                   125 
137 .. list-table::                                   126 .. list-table::
138    :widths: 16 64                                 127    :widths: 16 64
139    :header-rows: 1                                128    :header-rows: 1
140                                                   129 
141    * - Value                                      130    * - Value
142      - Description                                131      - Description
143    * - 1                                          132    * - 1
144      - Descriptor. This block precedes a serie    133      - Descriptor. This block precedes a series of data blocks that were
145        written through the journal during a tr    134        written through the journal during a transaction.
146    * - 2                                          135    * - 2
147      - Block commit record. This block signifi    136      - Block commit record. This block signifies the completion of a
148        transaction.                               137        transaction.
149    * - 3                                          138    * - 3
150      - Journal superblock, v1.                    139      - Journal superblock, v1.
151    * - 4                                          140    * - 4
152      - Journal superblock, v2.                    141      - Journal superblock, v2.
153    * - 5                                          142    * - 5
154      - Block revocation records. This speeds u    143      - Block revocation records. This speeds up recovery by enabling the
155        journal to skip writing blocks that wer    144        journal to skip writing blocks that were subsequently rewritten.
156                                                   145 
157 Super Block                                       146 Super Block
158 ~~~~~~~~~~~                                       147 ~~~~~~~~~~~
159                                                   148 
160 The super block for the journal is much simple    149 The super block for the journal is much simpler as compared to ext4's.
161 The key data kept within are size of the journ    150 The key data kept within are size of the journal, and where to find the
162 start of the log of transactions.                 151 start of the log of transactions.
163                                                   152 
164 The journal superblock is recorded as ``struct    153 The journal superblock is recorded as ``struct journal_superblock_s``,
165 which is 1024 bytes long:                         154 which is 1024 bytes long:
166                                                   155 
167 .. list-table::                                   156 .. list-table::
168    :widths: 8 8 24 40                             157    :widths: 8 8 24 40
169    :header-rows: 1                                158    :header-rows: 1
170                                                   159 
171    * - Offset                                     160    * - Offset
172      - Type                                       161      - Type
173      - Name                                       162      - Name
174      - Description                                163      - Description
175    * -                                            164    * -
176      -                                            165      -
177      -                                            166      -
178      - Static information describing the journ    167      - Static information describing the journal.
179    * - 0x0                                        168    * - 0x0
180      - journal_header_t (12 bytes)             !! 169      - journal\_header\_t (12 bytes)
181      - s_header                                !! 170      - s\_header
182      - Common header identifying this as a sup    171      - Common header identifying this as a superblock.
183    * - 0xC                                        172    * - 0xC
184      - __be32                                  !! 173      - \_\_be32
185      - s_blocksize                             !! 174      - s\_blocksize
186      - Journal device block size.                 175      - Journal device block size.
187    * - 0x10                                       176    * - 0x10
188      - __be32                                  !! 177      - \_\_be32
189      - s_maxlen                                !! 178      - s\_maxlen
190      - Total number of blocks in this journal.    179      - Total number of blocks in this journal.
191    * - 0x14                                       180    * - 0x14
192      - __be32                                  !! 181      - \_\_be32
193      - s_first                                 !! 182      - s\_first
194      - First block of log information.            183      - First block of log information.
195    * -                                            184    * -
196      -                                            185      -
197      -                                            186      -
198      - Dynamic information describing the curr    187      - Dynamic information describing the current state of the log.
199    * - 0x18                                       188    * - 0x18
200      - __be32                                  !! 189      - \_\_be32
201      - s_sequence                              !! 190      - s\_sequence
202      - First commit ID expected in log.           191      - First commit ID expected in log.
203    * - 0x1C                                       192    * - 0x1C
204      - __be32                                  !! 193      - \_\_be32
205      - s_start                                 !! 194      - s\_start
206      - Block number of the start of log. Contr    195      - Block number of the start of log. Contrary to the comments, this field
207        being zero does not imply that the jour    196        being zero does not imply that the journal is clean!
208    * - 0x20                                       197    * - 0x20
209      - __be32                                  !! 198      - \_\_be32
210      - s_errno                                 !! 199      - s\_errno
211      - Error value, as set by jbd2_journal_abo !! 200      - Error value, as set by jbd2\_journal\_abort().
212    * -                                            201    * -
213      -                                            202      -
214      -                                            203      -
215      - The remaining fields are only valid in     204      - The remaining fields are only valid in a v2 superblock.
216    * - 0x24                                       205    * - 0x24
217      - __be32                                  !! 206      - \_\_be32
218      - s_feature_compat;                       !! 207      - s\_feature\_compat;
219      - Compatible feature set. See the table j    208      - Compatible feature set. See the table jbd2_compat_ below.
220    * - 0x28                                       209    * - 0x28
221      - __be32                                  !! 210      - \_\_be32
222      - s_feature_incompat                      !! 211      - s\_feature\_incompat
223      - Incompatible feature set. See the table    212      - Incompatible feature set. See the table jbd2_incompat_ below.
224    * - 0x2C                                       213    * - 0x2C
225      - __be32                                  !! 214      - \_\_be32
226      - s_feature_ro_compat                     !! 215      - s\_feature\_ro\_compat
227      - Read-only compatible feature set. There    216      - Read-only compatible feature set. There aren't any of these currently.
228    * - 0x30                                       217    * - 0x30
229      - __u8                                    !! 218      - \_\_u8
230      - s_uuid[16]                              !! 219      - s\_uuid[16]
231      - 128-bit uuid for journal. This is compa    220      - 128-bit uuid for journal. This is compared against the copy in the ext4
232        super block at mount time.                 221        super block at mount time.
233    * - 0x40                                       222    * - 0x40
234      - __be32                                  !! 223      - \_\_be32
235      - s_nr_users                              !! 224      - s\_nr\_users
236      - Number of file systems sharing this jou    225      - Number of file systems sharing this journal.
237    * - 0x44                                       226    * - 0x44
238      - __be32                                  !! 227      - \_\_be32
239      - s_dynsuper                              !! 228      - s\_dynsuper
240      - Location of dynamic super block copy. (    229      - Location of dynamic super block copy. (Not used?)
241    * - 0x48                                       230    * - 0x48
242      - __be32                                  !! 231      - \_\_be32
243      - s_max_transaction                       !! 232      - s\_max\_transaction
244      - Limit of journal blocks per transaction    233      - Limit of journal blocks per transaction. (Not used?)
245    * - 0x4C                                       234    * - 0x4C
246      - __be32                                  !! 235      - \_\_be32
247      - s_max_trans_data                        !! 236      - s\_max\_trans\_data
248      - Limit of data blocks per transaction. (    237      - Limit of data blocks per transaction. (Not used?)
249    * - 0x50                                       238    * - 0x50
250      - __u8                                    !! 239      - \_\_u8
251      - s_checksum_type                         !! 240      - s\_checksum\_type
252      - Checksum algorithm used for the journal    241      - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
253        more info.                                 242        more info.
254    * - 0x51                                       243    * - 0x51
255      - __u8[3]                                 !! 244      - \_\_u8[3]
256      - s_padding2                              !! 245      - s\_padding2
257      -                                            246      -
258    * - 0x54                                       247    * - 0x54
259      - __be32                                  !! 248      - \_\_u32
260      - s_num_fc_blocks                         !! 249      - s\_padding[42]
261      - Number of fast commit blocks in the jou << 
262    * - 0x58                                    << 
263      - __be32                                  << 
264      - s_head                                  << 
265      - Block number of the head (first unused  << 
266        up-to-date when the journal is empty.   << 
267    * - 0x5C                                    << 
268      - __u32                                   << 
269      - s_padding[40]                           << 
270      -                                            250      -
271    * - 0xFC                                       251    * - 0xFC
272      - __be32                                  !! 252      - \_\_be32
273      - s_checksum                              !! 253      - s\_checksum
274      - Checksum of the entire superblock, with    254      - Checksum of the entire superblock, with this field set to zero.
275    * - 0x100                                      255    * - 0x100
276      - __u8                                    !! 256      - \_\_u8
277      - s_users[16*48]                          !! 257      - s\_users[16\*48]
278      - ids of all file systems sharing the log    258      - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
279        shared external journals, but I imagine    259        shared external journals, but I imagine Lustre (or ocfs2?), which use
280        the jbd2 code, might.                      260        the jbd2 code, might.
281                                                   261 
282 .. _jbd2_compat:                                  262 .. _jbd2_compat:
283                                                   263 
284 The journal compat features are any combinatio    264 The journal compat features are any combination of the following:
285                                                   265 
286 .. list-table::                                   266 .. list-table::
287    :widths: 16 64                                 267    :widths: 16 64
288    :header-rows: 1                                268    :header-rows: 1
289                                                   269 
290    * - Value                                      270    * - Value
291      - Description                                271      - Description
292    * - 0x1                                        272    * - 0x1
293      - Journal maintains checksums on the data    273      - Journal maintains checksums on the data blocks.
294        (JBD2_FEATURE_COMPAT_CHECKSUM)          !! 274        (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
295                                                   275 
296 .. _jbd2_incompat:                                276 .. _jbd2_incompat:
297                                                   277 
298 The journal incompat features are any combinat    278 The journal incompat features are any combination of the following:
299                                                   279 
300 .. list-table::                                   280 .. list-table::
301    :widths: 16 64                                 281    :widths: 16 64
302    :header-rows: 1                                282    :header-rows: 1
303                                                   283 
304    * - Value                                      284    * - Value
305      - Description                                285      - Description
306    * - 0x1                                        286    * - 0x1
307      - Journal has block revocation records. ( !! 287      - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
308    * - 0x2                                        288    * - 0x2
309      - Journal can deal with 64-bit block numb    289      - Journal can deal with 64-bit block numbers.
310        (JBD2_FEATURE_INCOMPAT_64BIT)           !! 290        (JBD2\_FEATURE\_INCOMPAT\_64BIT)
311    * - 0x4                                        291    * - 0x4
312      - Journal commits asynchronously. (JBD2_F !! 292      - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
313    * - 0x8                                        293    * - 0x8
314      - This journal uses v2 of the checksum on    294      - This journal uses v2 of the checksum on-disk format. Each journal
315        metadata block gets its own checksum, a    295        metadata block gets its own checksum, and the block tags in the
316        descriptor table contain checksums for     296        descriptor table contain checksums for each of the data blocks in the
317        journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2 !! 297        journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
318    * - 0x10                                       298    * - 0x10
319      - This journal uses v3 of the checksum on    299      - This journal uses v3 of the checksum on-disk format. This is the same as
320        v2, but the journal block tag size is f    300        v2, but the journal block tag size is fixed regardless of the size of
321        block numbers. (JBD2_FEATURE_INCOMPAT_C !! 301        block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
322    * - 0x20                                    << 
323      - Journal has fast commit blocks. (JBD2_F << 
324                                                   302 
325 .. _jbd2_checksum_type:                           303 .. _jbd2_checksum_type:
326                                                   304 
327 Journal checksum type codes are one of the fol    305 Journal checksum type codes are one of the following.  crc32 or crc32c are the
328 most likely choices.                              306 most likely choices.
329                                                   307 
330 .. list-table::                                   308 .. list-table::
331    :widths: 16 64                                 309    :widths: 16 64
332    :header-rows: 1                                310    :header-rows: 1
333                                                   311 
334    * - Value                                      312    * - Value
335      - Description                                313      - Description
336    * - 1                                          314    * - 1
337      - CRC32                                      315      - CRC32
338    * - 2                                          316    * - 2
339      - MD5                                        317      - MD5
340    * - 3                                          318    * - 3
341      - SHA1                                       319      - SHA1
342    * - 4                                          320    * - 4
343      - CRC32C                                     321      - CRC32C
344                                                   322 
345 Descriptor Block                                  323 Descriptor Block
346 ~~~~~~~~~~~~~~~~                                  324 ~~~~~~~~~~~~~~~~
347                                                   325 
348 The descriptor block contains an array of jour    326 The descriptor block contains an array of journal block tags that
349 describe the final locations of the data block    327 describe the final locations of the data blocks that follow in the
350 journal. Descriptor blocks are open-coded inst    328 journal. Descriptor blocks are open-coded instead of being completely
351 described by a data structure, but here is the    329 described by a data structure, but here is the block structure anyway.
352 Descriptor blocks consume at least 36 bytes, b    330 Descriptor blocks consume at least 36 bytes, but use a full block:
353                                                   331 
354 .. list-table::                                   332 .. list-table::
355    :widths: 8 8 24 40                             333    :widths: 8 8 24 40
356    :header-rows: 1                                334    :header-rows: 1
357                                                   335 
358    * - Offset                                     336    * - Offset
359      - Type                                       337      - Type
360      - Name                                       338      - Name
361      - Descriptor                                 339      - Descriptor
362    * - 0x0                                        340    * - 0x0
363      - journal_header_t                        !! 341      - journal\_header\_t
364      - (open coded)                               342      - (open coded)
365      - Common block header.                       343      - Common block header.
366    * - 0xC                                        344    * - 0xC
367      - struct journal_block_tag_s              !! 345      - struct journal\_block\_tag\_s
368      - open coded array[]                         346      - open coded array[]
369      - Enough tags either to fill up the block    347      - Enough tags either to fill up the block or to describe all the data
370        blocks that follow this descriptor bloc    348        blocks that follow this descriptor block.
371                                                   349 
372 Journal block tags have any of the following f    350 Journal block tags have any of the following formats, depending on which
373 journal feature and block tag flags are set.      351 journal feature and block tag flags are set.
374                                                   352 
375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the j !! 353 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
376 defined as ``struct journal_block_tag3_s``, wh    354 defined as ``struct journal_block_tag3_s``, which looks like the
377 following. The size is 16 or 32 bytes.            355 following. The size is 16 or 32 bytes.
378                                                   356 
379 .. list-table::                                   357 .. list-table::
380    :widths: 8 8 24 40                             358    :widths: 8 8 24 40
381    :header-rows: 1                                359    :header-rows: 1
382                                                   360 
383    * - Offset                                     361    * - Offset
384      - Type                                       362      - Type
385      - Name                                       363      - Name
386      - Descriptor                                 364      - Descriptor
387    * - 0x0                                        365    * - 0x0
388      - __be32                                  !! 366      - \_\_be32
389      - t_blocknr                               !! 367      - t\_blocknr
390      - Lower 32-bits of the location of where     368      - Lower 32-bits of the location of where the corresponding data block
391        should end up on disk.                     369        should end up on disk.
392    * - 0x4                                        370    * - 0x4
393      - __be32                                  !! 371      - \_\_be32
394      - t_flags                                 !! 372      - t\_flags
395      - Flags that go with the descriptor. See     373      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
396        more info.                                 374        more info.
397    * - 0x8                                        375    * - 0x8
398      - __be32                                  !! 376      - \_\_be32
399      - t_blocknr_high                          !! 377      - t\_blocknr\_high
400      - Upper 32-bits of the location of where     378      - Upper 32-bits of the location of where the corresponding data block
401        should end up on disk. This is zero if  !! 379        should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
402        not enabled.                               380        not enabled.
403    * - 0xC                                        381    * - 0xC
404      - __be32                                  !! 382      - \_\_be32
405      - t_checksum                              !! 383      - t\_checksum
406      - Checksum of the journal UUID, the seque    384      - Checksum of the journal UUID, the sequence number, and the data block.
407    * -                                            385    * -
408      -                                            386      -
409      -                                            387      -
410      - This field appears to be open coded. It    388      - This field appears to be open coded. It always comes at the end of the
411        tag, after t_checksum. This field is no    389        tag, after t_checksum. This field is not present if the "same UUID" flag
412        is set.                                    390        is set.
413    * - 0x8 or 0xC                                 391    * - 0x8 or 0xC
414      - char                                       392      - char
415      - uuid[16]                                   393      - uuid[16]
416      - A UUID to go with this tag. This field     394      - A UUID to go with this tag. This field appears to be copied from the
417        ``j_uuid`` field in ``struct journal_s`    395        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
418        field.                                     396        field.
419                                                   397 
420 .. _jbd2_tag_flags:                               398 .. _jbd2_tag_flags:
421                                                   399 
422 The journal tag flags are any combination of t    400 The journal tag flags are any combination of the following:
423                                                   401 
424 .. list-table::                                   402 .. list-table::
425    :widths: 16 64                                 403    :widths: 16 64
426    :header-rows: 1                                404    :header-rows: 1
427                                                   405 
428    * - Value                                      406    * - Value
429      - Description                                407      - Description
430    * - 0x1                                        408    * - 0x1
431      - On-disk block is escaped. The first fou    409      - On-disk block is escaped. The first four bytes of the data block just
432        happened to match the jbd2 magic number    410        happened to match the jbd2 magic number.
433    * - 0x2                                        411    * - 0x2
434      - This block has the same UUID as previou    412      - This block has the same UUID as previous, therefore the UUID field is
435        omitted.                                   413        omitted.
436    * - 0x4                                        414    * - 0x4
437      - The data block was deleted by the trans    415      - The data block was deleted by the transaction. (Not used?)
438    * - 0x8                                        416    * - 0x8
439      - This is the last tag in this descriptor    417      - This is the last tag in this descriptor block.
440                                                   418 
441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, t !! 419 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
442 is defined as ``struct journal_block_tag_s``,     420 is defined as ``struct journal_block_tag_s``, which looks like the
443 following. The size is 8, 12, 24, or 28 bytes:    421 following. The size is 8, 12, 24, or 28 bytes:
444                                                   422 
445 .. list-table::                                   423 .. list-table::
446    :widths: 8 8 24 40                             424    :widths: 8 8 24 40
447    :header-rows: 1                                425    :header-rows: 1
448                                                   426 
449    * - Offset                                     427    * - Offset
450      - Type                                       428      - Type
451      - Name                                       429      - Name
452      - Descriptor                                 430      - Descriptor
453    * - 0x0                                        431    * - 0x0
454      - __be32                                  !! 432      - \_\_be32
455      - t_blocknr                               !! 433      - t\_blocknr
456      - Lower 32-bits of the location of where     434      - Lower 32-bits of the location of where the corresponding data block
457        should end up on disk.                     435        should end up on disk.
458    * - 0x4                                        436    * - 0x4
459      - __be16                                  !! 437      - \_\_be16
460      - t_checksum                              !! 438      - t\_checksum
461      - Checksum of the journal UUID, the seque    439      - Checksum of the journal UUID, the sequence number, and the data block.
462        Note that only the lower 16 bits are st    440        Note that only the lower 16 bits are stored.
463    * - 0x6                                        441    * - 0x6
464      - __be16                                  !! 442      - \_\_be16
465      - t_flags                                 !! 443      - t\_flags
466      - Flags that go with the descriptor. See     444      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
467        more info.                                 445        more info.
468    * -                                            446    * -
469      -                                            447      -
470      -                                            448      -
471      - This next field is only present if the     449      - This next field is only present if the super block indicates support for
472        64-bit block numbers.                      450        64-bit block numbers.
473    * - 0x8                                        451    * - 0x8
474      - __be32                                  !! 452      - \_\_be32
475      - t_blocknr_high                          !! 453      - t\_blocknr\_high
476      - Upper 32-bits of the location of where     454      - Upper 32-bits of the location of where the corresponding data block
477        should end up on disk.                     455        should end up on disk.
478    * -                                            456    * -
479      -                                            457      -
480      -                                            458      -
481      - This field appears to be open coded. It    459      - This field appears to be open coded. It always comes at the end of the
482        tag, after t_flags or t_blocknr_high. T    460        tag, after t_flags or t_blocknr_high. This field is not present if the
483        "same UUID" flag is set.                   461        "same UUID" flag is set.
484    * - 0x8 or 0xC                                 462    * - 0x8 or 0xC
485      - char                                       463      - char
486      - uuid[16]                                   464      - uuid[16]
487      - A UUID to go with this tag. This field     465      - A UUID to go with this tag. This field appears to be copied from the
488        ``j_uuid`` field in ``struct journal_s`    466        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
489        field.                                     467        field.
490                                                   468 
491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or            !! 469 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end !! 470 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
493 ``struct jbd2_journal_block_tail``, which look    471 ``struct jbd2_journal_block_tail``, which looks like this:
494                                                   472 
495 .. list-table::                                   473 .. list-table::
496    :widths: 8 8 24 40                             474    :widths: 8 8 24 40
497    :header-rows: 1                                475    :header-rows: 1
498                                                   476 
499    * - Offset                                     477    * - Offset
500      - Type                                       478      - Type
501      - Name                                       479      - Name
502      - Descriptor                                 480      - Descriptor
503    * - 0x0                                        481    * - 0x0
504      - __be32                                  !! 482      - \_\_be32
505      - t_checksum                              !! 483      - t\_checksum
506      - Checksum of the journal UUID + the desc    484      - Checksum of the journal UUID + the descriptor block, with this field set
507        to zero.                                   485        to zero.
508                                                   486 
509 Data Block                                        487 Data Block
510 ~~~~~~~~~~                                        488 ~~~~~~~~~~
511                                                   489 
512 In general, the data blocks being written to d    490 In general, the data blocks being written to disk through the journal
513 are written verbatim into the journal file aft    491 are written verbatim into the journal file after the descriptor block.
514 However, if the first four bytes of the block     492 However, if the first four bytes of the block match the jbd2 magic
515 number then those four bytes are replaced with    493 number then those four bytes are replaced with zeroes and the “escaped”
516 flag is set in the descriptor block tag.          494 flag is set in the descriptor block tag.
517                                                   495 
518 Revocation Block                                  496 Revocation Block
519 ~~~~~~~~~~~~~~~~                                  497 ~~~~~~~~~~~~~~~~
520                                                   498 
521 A revocation block is used to prevent replay o    499 A revocation block is used to prevent replay of a block in an earlier
522 transaction. This is used to mark blocks that     500 transaction. This is used to mark blocks that were journalled at one
523 time but are no longer journalled. Typically t    501 time but are no longer journalled. Typically this happens if a metadata
524 block is freed and re-allocated as a file data    502 block is freed and re-allocated as a file data block; in this case, a
525 journal replay after the file block was writte    503 journal replay after the file block was written to disk will cause
526 corruption.                                       504 corruption.
527                                                   505 
528 **NOTE**: This mechanism is NOT used to expres    506 **NOTE**: This mechanism is NOT used to express “this journal block is
529 superseded by this other journal block”, as     507 superseded by this other journal block”, as the author (djwong)
530 mistakenly thought. Any block being added to a    508 mistakenly thought. Any block being added to a transaction will cause
531 the removal of all existing revocation records    509 the removal of all existing revocation records for that block.
532                                                   510 
533 Revocation blocks are described in                511 Revocation blocks are described in
534 ``struct jbd2_journal_revoke_header_s``, are a    512 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
535 length, but use a full block:                     513 length, but use a full block:
536                                                   514 
537 .. list-table::                                   515 .. list-table::
538    :widths: 8 8 24 40                             516    :widths: 8 8 24 40
539    :header-rows: 1                                517    :header-rows: 1
540                                                   518 
541    * - Offset                                     519    * - Offset
542      - Type                                       520      - Type
543      - Name                                       521      - Name
544      - Description                                522      - Description
545    * - 0x0                                        523    * - 0x0
546      - journal_header_t                        !! 524      - journal\_header\_t
547      - r_header                                !! 525      - r\_header
548      - Common block header.                       526      - Common block header.
549    * - 0xC                                        527    * - 0xC
550      - __be32                                  !! 528      - \_\_be32
551      - r_count                                 !! 529      - r\_count
552      - Number of bytes used in this block.        530      - Number of bytes used in this block.
553    * - 0x10                                       531    * - 0x10
554      - __be32 or __be64                        !! 532      - \_\_be32 or \_\_be64
555      - blocks[0]                                  533      - blocks[0]
556      - Blocks to revoke.                          534      - Blocks to revoke.
557                                                   535 
558 After r_count is a linear array of block numbe !! 536 After r\_count is a linear array of block numbers that are effectively
559 revoked by this transaction. The size of each     537 revoked by this transaction. The size of each block number is 8 bytes if
560 the superblock advertises 64-bit block number     538 the superblock advertises 64-bit block number support, or 4 bytes
561 otherwise.                                        539 otherwise.
562                                                   540 
563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or            !! 541 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end !! 542 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
565 block is a ``struct jbd2_journal_revoke_tail``    543 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
566                                                   544 
567 .. list-table::                                   545 .. list-table::
568    :widths: 8 8 24 40                             546    :widths: 8 8 24 40
569    :header-rows: 1                                547    :header-rows: 1
570                                                   548 
571    * - Offset                                     549    * - Offset
572      - Type                                       550      - Type
573      - Name                                       551      - Name
574      - Description                                552      - Description
575    * - 0x0                                        553    * - 0x0
576      - __be32                                  !! 554      - \_\_be32
577      - r_checksum                              !! 555      - r\_checksum
578      - Checksum of the journal UUID + revocati    556      - Checksum of the journal UUID + revocation block
579                                                   557 
580 Commit Block                                      558 Commit Block
581 ~~~~~~~~~~~~                                      559 ~~~~~~~~~~~~
582                                                   560 
583 The commit block is a sentry that indicates th    561 The commit block is a sentry that indicates that a transaction has been
584 completely written to the journal. Once this c    562 completely written to the journal. Once this commit block reaches the
585 journal, the data stored with this transaction    563 journal, the data stored with this transaction can be written to their
586 final locations on disk.                          564 final locations on disk.
587                                                   565 
588 The commit block is described by ``struct comm    566 The commit block is described by ``struct commit_header``, which is 32
589 bytes long (but uses a full block):               567 bytes long (but uses a full block):
590                                                   568 
591 .. list-table::                                   569 .. list-table::
592    :widths: 8 8 24 40                             570    :widths: 8 8 24 40
593    :header-rows: 1                                571    :header-rows: 1
594                                                   572 
595    * - Offset                                     573    * - Offset
596      - Type                                       574      - Type
597      - Name                                       575      - Name
598      - Descriptor                                 576      - Descriptor
599    * - 0x0                                        577    * - 0x0
600      - journal_header_s                        !! 578      - journal\_header\_s
601      - (open coded)                               579      - (open coded)
602      - Common block header.                       580      - Common block header.
603    * - 0xC                                        581    * - 0xC
604      - unsigned char                              582      - unsigned char
605      - h_chksum_type                           !! 583      - h\_chksum\_type
606      - The type of checksum to use to verify t    584      - The type of checksum to use to verify the integrity of the data blocks
607        in the transaction. See jbd2_checksum_t    585        in the transaction. See jbd2_checksum_type_ for more info.
608    * - 0xD                                        586    * - 0xD
609      - unsigned char                              587      - unsigned char
610      - h_chksum_size                           !! 588      - h\_chksum\_size
611      - The number of bytes used by the checksu    589      - The number of bytes used by the checksum. Most likely 4.
612    * - 0xE                                        590    * - 0xE
613      - unsigned char                              591      - unsigned char
614      - h_padding[2]                            !! 592      - h\_padding[2]
615      -                                            593      -
616    * - 0x10                                       594    * - 0x10
617      - __be32                                  !! 595      - \_\_be32
618      - h_chksum[JBD2_CHECKSUM_BYTES]           !! 596      - h\_chksum[JBD2\_CHECKSUM\_BYTES]
619      - 32 bytes of space to store checksums. I    597      - 32 bytes of space to store checksums. If
620        JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_F !! 598        JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
621        are set, the first ``__be32`` is the ch    599        are set, the first ``__be32`` is the checksum of the journal UUID and
622        the entire commit block, with this fiel    600        the entire commit block, with this field zeroed. If
623        JBD2_FEATURE_COMPAT_CHECKSUM is set, th !! 601        JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
624        crc32 of all the blocks already written    602        crc32 of all the blocks already written to the transaction.
625    * - 0x30                                       603    * - 0x30
626      - __be64                                  !! 604      - \_\_be64
627      - h_commit_sec                            !! 605      - h\_commit\_sec
628      - The time that the transaction was commi    606      - The time that the transaction was committed, in seconds since the epoch.
629    * - 0x38                                       607    * - 0x38
630      - __be32                                  !! 608      - \_\_be32
631      - h_commit_nsec                           !! 609      - h\_commit\_nsec
632      - Nanoseconds component of the above time    610      - Nanoseconds component of the above timestamp.
633                                                   611 
634 Fast commits                                   << 
635 ~~~~~~~~~~~~                                   << 
636                                                << 
637 Fast commit area is organized as a log of tag  << 
638 a ``struct ext4_fc_tl`` in the beginning which << 
639 of the entire field. It is followed by variabl << 
640 Here is the list of supported tags and their m << 
641                                                << 
642 .. list-table::                                << 
643    :widths: 8 20 20 32                         << 
644    :header-rows: 1                             << 
645                                                << 
646    * - Tag                                     << 
647      - Meaning                                 << 
648      - Value struct                            << 
649      - Description                             << 
650    * - EXT4_FC_TAG_HEAD                        << 
651      - Fast commit area header                 << 
652      - ``struct ext4_fc_head``                 << 
653      - Stores the TID of the transaction after << 
654        be applied.                             << 
655    * - EXT4_FC_TAG_ADD_RANGE                   << 
656      - Add extent to inode                     << 
657      - ``struct ext4_fc_add_range``            << 
658      - Stores the inode number and extent to b << 
659    * - EXT4_FC_TAG_DEL_RANGE                   << 
660      - Remove logical offsets to inode         << 
661      - ``struct ext4_fc_del_range``            << 
662      - Stores the inode number and the logical << 
663        removed                                 << 
664    * - EXT4_FC_TAG_CREAT                       << 
665      - Create directory entry for a newly crea << 
666      - ``struct ext4_fc_dentry_info``          << 
667      - Stores the parent inode number, inode n << 
668        newly created file                      << 
669    * - EXT4_FC_TAG_LINK                        << 
670      - Link a directory entry to an inode      << 
671      - ``struct ext4_fc_dentry_info``          << 
672      - Stores the parent inode number, inode n << 
673    * - EXT4_FC_TAG_UNLINK                      << 
674      - Unlink a directory entry of an inode    << 
675      - ``struct ext4_fc_dentry_info``          << 
676      - Stores the parent inode number, inode n << 
677                                                << 
678    * - EXT4_FC_TAG_PAD                         << 
679      - Padding (unused area)                   << 
680      - None                                    << 
681      - Unused bytes in the fast commit area.   << 
682                                                << 
683    * - EXT4_FC_TAG_TAIL                        << 
684      - Mark the end of a fast commit           << 
685      - ``struct ext4_fc_tail``                 << 
686      - Stores the TID of the commit, CRC of th << 
687        represents the end of                   << 
688                                                << 
689 Fast Commit Replay Idempotence                 << 
690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                 << 
691                                                << 
692 Fast commits tags are idempotent in nature pro << 
693 certain rules. The guiding principle that the  << 
694 committing is that it stores the result of a p << 
695 storing the procedure.                         << 
696                                                << 
697 Let's consider this rename operation: 'mv /a / << 
698 was associated with inode 10. During fast comm << 
699 operation as a procedure "rename a to b", we s << 
700 state as a "series" of outcomes:               << 
701                                                << 
702 - Link dirent b to inode 10                    << 
703 - Unlink dirent a                              << 
704 - Inode 10 with valid refcount                 << 
705                                                << 
706 Now when recovery code runs, it needs "enforce << 
707 system. This is what guarantees idempotence of << 
708                                                << 
709 Let's take an example of a procedure that is n << 
710 commits make it idempotent. Consider following << 
711                                                << 
712 1) rm A                                        << 
713 2) mv B A                                      << 
714 3) read A                                      << 
715                                                << 
716 If we store this sequence of operations as is  << 
717 Let's say while in replay, we crash after (2). << 
718 file A (which was actually created as a result << 
719 deleted. Thus, file named A would be absent wh << 
720 sequence of operations is not idempotent. Howe << 
721 of storing the procedure fast commits store th << 
722 the fast commit log for above procedure would  << 
723                                                << 
724 (Let's assume dirent A was linked to inode 10  << 
725 inode 11 before the replay)                    << 
726                                                << 
727 1) Unlink A                                    << 
728 2) Link A to inode 11                          << 
729 3) Unlink B                                    << 
730 4) Inode 11                                    << 
731                                                << 
732 If we crash after (3) we will have file A link << 
733 replay, we will remove file A (inode 11). But  << 
734 it point to inode 11. We won't find B, so we'l << 
735 point, the refcount for inode 11 is not reliab << 
736 replay of last inode 11 tag. Thus, by converti << 
737 into a series of idempotent outcomes, fast com << 
738 the replay.                                    << 
739                                                << 
740 Journal Checkpoint                             << 
741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                 << 
742                                                << 
743 Checkpointing the journal ensures all transact << 
744 are submitted to the disk. In-progress transac << 
745 in the checkpoint. Checkpointing is used inter << 
746 the filesystem including journal recovery, fil << 
747 the journal_t structure.                       << 
748                                                << 
749 A journal checkpoint can be triggered from use << 
750 EXT4_IOC_CHECKPOINT. This ioctl takes a single << 
751 Currently, three flags are supported. First, E << 
752 can be used to verify input to the ioctl. It r << 
753 invalid input, otherwise it returns success wi << 
754 any checkpointing. This can be used to check w << 
755 system and to verify there are no issues with  << 
756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_D << 
757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags  << 
758 discarded or zero-filled, respectively, after  << 
759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and << 
760 cannot both be set. The ioctl may be useful wh << 
761 complying with content deletion SLOs.          << 
                                                      

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~

kernel.org | git.kernel.org | LWN.net | Project Home | SVN repository | Mail admin

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

sflogo.php