1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 Journal (jbd2) 3 Journal (jbd2) 4 -------------- 4 -------------- 5 5 6 Introduced in ext3, the ext4 filesystem employ 6 Introduced in ext3, the ext4 filesystem employs a journal to protect the 7 filesystem against metadata inconsistencies in !! 7 filesystem against corruption in the case of a system crash. A small 8 to 10,240,000 file system blocks (see man mke2 !! 8 continuous region of disk (default 128MiB) is reserved inside the 9 size limits) can be reserved inside the filesy !! 9 filesystem as a place to land “important” data writes on-disk as quickly 10 “important” data writes on-disk as quickly !! 10 as possible. Once the important data transaction is fully written to the 11 data transaction is fully written to the disk !! 11 disk and flushed from the disk write cache, a record of the data being 12 cache, a record of the data being committed is !! 12 committed is also written to the journal. At some later point in time, 13 some later point in time, the journal code wri !! 13 the journal code writes the transactions to their final locations on 14 final locations on disk (this could involve a !! 14 disk (this could involve a lot of seeking or a lot of small 15 read-write-erases) before erasing the commit r 15 read-write-erases) before erasing the commit record. Should the system 16 crash during the second slow write, the journa 16 crash during the second slow write, the journal can be replayed all the 17 way to the latest commit record, guaranteeing 17 way to the latest commit record, guaranteeing the atomicity of whatever 18 gets written through the journal to the disk. 18 gets written through the journal to the disk. The effect of this is to 19 guarantee that the filesystem does not become 19 guarantee that the filesystem does not become stuck midway through a 20 metadata update. 20 metadata update. 21 21 22 For performance reasons, ext4 by default only 22 For performance reasons, ext4 by default only writes filesystem metadata 23 through the journal. This means that file data 23 through the journal. This means that file data blocks are /not/ 24 guaranteed to be in any consistent state after 24 guaranteed to be in any consistent state after a crash. If this default 25 guarantee level (``data=ordered``) is not sati 25 guarantee level (``data=ordered``) is not satisfactory, there is a mount 26 option to control journal behavior. If ``data= 26 option to control journal behavior. If ``data=journal``, all data and 27 metadata are written to disk through the journ 27 metadata are written to disk through the journal. This is slower but 28 safest. If ``data=writeback``, dirty data bloc 28 safest. If ``data=writeback``, dirty data blocks are not flushed to the 29 disk before the metadata are written to disk t 29 disk before the metadata are written to disk through the journal. 30 30 31 In case of ``data=ordered`` mode, Ext4 also su << 32 help reduce commit latency significantly. The << 33 mode works by logging metadata blocks to the j << 34 mode, Ext4 only stores the minimal delta neede << 35 affected metadata in fast commit space that is << 36 Once the fast commit area fills in or if fast << 37 or if JBD2 commit timer goes off, Ext4 perform << 38 A full commit invalidates all the fast commits << 39 it and thus it makes the fast commit area empt << 40 commits. This feature needs to be enabled at m << 41 << 42 The journal inode is typically inode 8. The fi 31 The journal inode is typically inode 8. The first 68 bytes of the 43 journal inode are replicated in the ext4 super 32 journal inode are replicated in the ext4 superblock. The journal itself 44 is normal (but hidden) file within the filesys 33 is normal (but hidden) file within the filesystem. The file usually 45 consumes an entire block group, though mke2fs 34 consumes an entire block group, though mke2fs tries to put it in the 46 middle of the disk. 35 middle of the disk. 47 36 48 All fields in jbd2 are written to disk in big- 37 All fields in jbd2 are written to disk in big-endian order. This is the 49 opposite of ext4. 38 opposite of ext4. 50 39 51 NOTE: Both ext4 and ocfs2 use jbd2. 40 NOTE: Both ext4 and ocfs2 use jbd2. 52 41 53 The maximum size of a journal embedded in an e 42 The maximum size of a journal embedded in an ext4 filesystem is 2^32 54 blocks. jbd2 itself does not seem to care. 43 blocks. jbd2 itself does not seem to care. 55 44 56 Layout 45 Layout 57 ~~~~~~ 46 ~~~~~~ 58 47 59 Generally speaking, the journal has this forma 48 Generally speaking, the journal has this format: 60 49 61 .. list-table:: 50 .. list-table:: 62 :widths: 16 48 16 51 :widths: 16 48 16 63 :header-rows: 1 52 :header-rows: 1 64 53 65 * - Superblock 54 * - Superblock 66 - descriptor_block (data_blocks or revoca !! 55 - descriptor\_block (data\_blocks or revocation\_block) [more data or 67 revocations] commmit_block !! 56 revocations] commmit\_block 68 - [more transactions...] 57 - [more transactions...] 69 * - 58 * - 70 - One transaction 59 - One transaction 71 - 60 - 72 61 73 Notice that a transaction begins with either a 62 Notice that a transaction begins with either a descriptor and some data, 74 or a block revocation list. A finished transac 63 or a block revocation list. A finished transaction always ends with a 75 commit. If there is no commit record (or the c 64 commit. If there is no commit record (or the checksums don't match), the 76 transaction will be discarded during replay. 65 transaction will be discarded during replay. 77 66 78 External Journal 67 External Journal 79 ~~~~~~~~~~~~~~~~ 68 ~~~~~~~~~~~~~~~~ 80 69 81 Optionally, an ext4 filesystem can be created 70 Optionally, an ext4 filesystem can be created with an external journal 82 device (as opposed to an internal journal, whi 71 device (as opposed to an internal journal, which uses a reserved inode). 83 In this case, on the filesystem device, ``s_jo 72 In this case, on the filesystem device, ``s_journal_inum`` should be 84 zero and ``s_journal_uuid`` should be set. On 73 zero and ``s_journal_uuid`` should be set. On the journal device there 85 will be an ext4 super block in the usual place 74 will be an ext4 super block in the usual place, with a matching UUID. 86 The journal superblock will be in the next ful 75 The journal superblock will be in the next full block after the 87 superblock. 76 superblock. 88 77 89 .. list-table:: 78 .. list-table:: 90 :widths: 12 12 12 32 12 79 :widths: 12 12 12 32 12 91 :header-rows: 1 80 :header-rows: 1 92 81 93 * - 1024 bytes of padding 82 * - 1024 bytes of padding 94 - ext4 Superblock 83 - ext4 Superblock 95 - Journal Superblock 84 - Journal Superblock 96 - descriptor_block (data_blocks or revoca !! 85 - descriptor\_block (data\_blocks or revocation\_block) [more data or 97 revocations] commmit_block !! 86 revocations] commmit\_block 98 - [more transactions...] 87 - [more transactions...] 99 * - 88 * - 100 - 89 - 101 - 90 - 102 - One transaction 91 - One transaction 103 - 92 - 104 93 105 Block Header 94 Block Header 106 ~~~~~~~~~~~~ 95 ~~~~~~~~~~~~ 107 96 108 Every block in the journal starts with a commo 97 Every block in the journal starts with a common 12-byte header 109 ``struct journal_header_s``: 98 ``struct journal_header_s``: 110 99 111 .. list-table:: 100 .. list-table:: 112 :widths: 8 8 24 40 101 :widths: 8 8 24 40 113 :header-rows: 1 102 :header-rows: 1 114 103 115 * - Offset 104 * - Offset 116 - Type 105 - Type 117 - Name 106 - Name 118 - Description 107 - Description 119 * - 0x0 108 * - 0x0 120 - __be32 !! 109 - \_\_be32 121 - h_magic !! 110 - h\_magic 122 - jbd2 magic number, 0xC03B3998. 111 - jbd2 magic number, 0xC03B3998. 123 * - 0x4 112 * - 0x4 124 - __be32 !! 113 - \_\_be32 125 - h_blocktype !! 114 - h\_blocktype 126 - Description of what this block contains 115 - Description of what this block contains. See the jbd2_blocktype_ table 127 below. 116 below. 128 * - 0x8 117 * - 0x8 129 - __be32 !! 118 - \_\_be32 130 - h_sequence !! 119 - h\_sequence 131 - The transaction ID that goes with this 120 - The transaction ID that goes with this block. 132 121 133 .. _jbd2_blocktype: 122 .. _jbd2_blocktype: 134 123 135 The journal block type can be any one of: 124 The journal block type can be any one of: 136 125 137 .. list-table:: 126 .. list-table:: 138 :widths: 16 64 127 :widths: 16 64 139 :header-rows: 1 128 :header-rows: 1 140 129 141 * - Value 130 * - Value 142 - Description 131 - Description 143 * - 1 132 * - 1 144 - Descriptor. This block precedes a serie 133 - Descriptor. This block precedes a series of data blocks that were 145 written through the journal during a tr 134 written through the journal during a transaction. 146 * - 2 135 * - 2 147 - Block commit record. This block signifi 136 - Block commit record. This block signifies the completion of a 148 transaction. 137 transaction. 149 * - 3 138 * - 3 150 - Journal superblock, v1. 139 - Journal superblock, v1. 151 * - 4 140 * - 4 152 - Journal superblock, v2. 141 - Journal superblock, v2. 153 * - 5 142 * - 5 154 - Block revocation records. This speeds u 143 - Block revocation records. This speeds up recovery by enabling the 155 journal to skip writing blocks that wer 144 journal to skip writing blocks that were subsequently rewritten. 156 145 157 Super Block 146 Super Block 158 ~~~~~~~~~~~ 147 ~~~~~~~~~~~ 159 148 160 The super block for the journal is much simple 149 The super block for the journal is much simpler as compared to ext4's. 161 The key data kept within are size of the journ 150 The key data kept within are size of the journal, and where to find the 162 start of the log of transactions. 151 start of the log of transactions. 163 152 164 The journal superblock is recorded as ``struct 153 The journal superblock is recorded as ``struct journal_superblock_s``, 165 which is 1024 bytes long: 154 which is 1024 bytes long: 166 155 167 .. list-table:: 156 .. list-table:: 168 :widths: 8 8 24 40 157 :widths: 8 8 24 40 169 :header-rows: 1 158 :header-rows: 1 170 159 171 * - Offset 160 * - Offset 172 - Type 161 - Type 173 - Name 162 - Name 174 - Description 163 - Description 175 * - 164 * - 176 - 165 - 177 - 166 - 178 - Static information describing the journ 167 - Static information describing the journal. 179 * - 0x0 168 * - 0x0 180 - journal_header_t (12 bytes) !! 169 - journal\_header\_t (12 bytes) 181 - s_header !! 170 - s\_header 182 - Common header identifying this as a sup 171 - Common header identifying this as a superblock. 183 * - 0xC 172 * - 0xC 184 - __be32 !! 173 - \_\_be32 185 - s_blocksize !! 174 - s\_blocksize 186 - Journal device block size. 175 - Journal device block size. 187 * - 0x10 176 * - 0x10 188 - __be32 !! 177 - \_\_be32 189 - s_maxlen !! 178 - s\_maxlen 190 - Total number of blocks in this journal. 179 - Total number of blocks in this journal. 191 * - 0x14 180 * - 0x14 192 - __be32 !! 181 - \_\_be32 193 - s_first !! 182 - s\_first 194 - First block of log information. 183 - First block of log information. 195 * - 184 * - 196 - 185 - 197 - 186 - 198 - Dynamic information describing the curr 187 - Dynamic information describing the current state of the log. 199 * - 0x18 188 * - 0x18 200 - __be32 !! 189 - \_\_be32 201 - s_sequence !! 190 - s\_sequence 202 - First commit ID expected in log. 191 - First commit ID expected in log. 203 * - 0x1C 192 * - 0x1C 204 - __be32 !! 193 - \_\_be32 205 - s_start !! 194 - s\_start 206 - Block number of the start of log. Contr 195 - Block number of the start of log. Contrary to the comments, this field 207 being zero does not imply that the jour 196 being zero does not imply that the journal is clean! 208 * - 0x20 197 * - 0x20 209 - __be32 !! 198 - \_\_be32 210 - s_errno !! 199 - s\_errno 211 - Error value, as set by jbd2_journal_abo !! 200 - Error value, as set by jbd2\_journal\_abort(). 212 * - 201 * - 213 - 202 - 214 - 203 - 215 - The remaining fields are only valid in 204 - The remaining fields are only valid in a v2 superblock. 216 * - 0x24 205 * - 0x24 217 - __be32 !! 206 - \_\_be32 218 - s_feature_compat; !! 207 - s\_feature\_compat; 219 - Compatible feature set. See the table j 208 - Compatible feature set. See the table jbd2_compat_ below. 220 * - 0x28 209 * - 0x28 221 - __be32 !! 210 - \_\_be32 222 - s_feature_incompat !! 211 - s\_feature\_incompat 223 - Incompatible feature set. See the table 212 - Incompatible feature set. See the table jbd2_incompat_ below. 224 * - 0x2C 213 * - 0x2C 225 - __be32 !! 214 - \_\_be32 226 - s_feature_ro_compat !! 215 - s\_feature\_ro\_compat 227 - Read-only compatible feature set. There 216 - Read-only compatible feature set. There aren't any of these currently. 228 * - 0x30 217 * - 0x30 229 - __u8 !! 218 - \_\_u8 230 - s_uuid[16] !! 219 - s\_uuid[16] 231 - 128-bit uuid for journal. This is compa 220 - 128-bit uuid for journal. This is compared against the copy in the ext4 232 super block at mount time. 221 super block at mount time. 233 * - 0x40 222 * - 0x40 234 - __be32 !! 223 - \_\_be32 235 - s_nr_users !! 224 - s\_nr\_users 236 - Number of file systems sharing this jou 225 - Number of file systems sharing this journal. 237 * - 0x44 226 * - 0x44 238 - __be32 !! 227 - \_\_be32 239 - s_dynsuper !! 228 - s\_dynsuper 240 - Location of dynamic super block copy. ( 229 - Location of dynamic super block copy. (Not used?) 241 * - 0x48 230 * - 0x48 242 - __be32 !! 231 - \_\_be32 243 - s_max_transaction !! 232 - s\_max\_transaction 244 - Limit of journal blocks per transaction 233 - Limit of journal blocks per transaction. (Not used?) 245 * - 0x4C 234 * - 0x4C 246 - __be32 !! 235 - \_\_be32 247 - s_max_trans_data !! 236 - s\_max\_trans\_data 248 - Limit of data blocks per transaction. ( 237 - Limit of data blocks per transaction. (Not used?) 249 * - 0x50 238 * - 0x50 250 - __u8 !! 239 - \_\_u8 251 - s_checksum_type !! 240 - s\_checksum\_type 252 - Checksum algorithm used for the journal 241 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for 253 more info. 242 more info. 254 * - 0x51 243 * - 0x51 255 - __u8[3] !! 244 - \_\_u8[3] 256 - s_padding2 !! 245 - s\_padding2 257 - 246 - 258 * - 0x54 247 * - 0x54 259 - __be32 !! 248 - \_\_u32 260 - s_num_fc_blocks !! 249 - s\_padding[42] 261 - Number of fast commit blocks in the jou << 262 * - 0x58 << 263 - __be32 << 264 - s_head << 265 - Block number of the head (first unused << 266 up-to-date when the journal is empty. << 267 * - 0x5C << 268 - __u32 << 269 - s_padding[40] << 270 - 250 - 271 * - 0xFC 251 * - 0xFC 272 - __be32 !! 252 - \_\_be32 273 - s_checksum !! 253 - s\_checksum 274 - Checksum of the entire superblock, with 254 - Checksum of the entire superblock, with this field set to zero. 275 * - 0x100 255 * - 0x100 276 - __u8 !! 256 - \_\_u8 277 - s_users[16*48] !! 257 - s\_users[16\*48] 278 - ids of all file systems sharing the log 258 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow 279 shared external journals, but I imagine 259 shared external journals, but I imagine Lustre (or ocfs2?), which use 280 the jbd2 code, might. 260 the jbd2 code, might. 281 261 282 .. _jbd2_compat: 262 .. _jbd2_compat: 283 263 284 The journal compat features are any combinatio 264 The journal compat features are any combination of the following: 285 265 286 .. list-table:: 266 .. list-table:: 287 :widths: 16 64 267 :widths: 16 64 288 :header-rows: 1 268 :header-rows: 1 289 269 290 * - Value 270 * - Value 291 - Description 271 - Description 292 * - 0x1 272 * - 0x1 293 - Journal maintains checksums on the data 273 - Journal maintains checksums on the data blocks. 294 (JBD2_FEATURE_COMPAT_CHECKSUM) !! 274 (JBD2\_FEATURE\_COMPAT\_CHECKSUM) 295 275 296 .. _jbd2_incompat: 276 .. _jbd2_incompat: 297 277 298 The journal incompat features are any combinat 278 The journal incompat features are any combination of the following: 299 279 300 .. list-table:: 280 .. list-table:: 301 :widths: 16 64 281 :widths: 16 64 302 :header-rows: 1 282 :header-rows: 1 303 283 304 * - Value 284 * - Value 305 - Description 285 - Description 306 * - 0x1 286 * - 0x1 307 - Journal has block revocation records. ( !! 287 - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) 308 * - 0x2 288 * - 0x2 309 - Journal can deal with 64-bit block numb 289 - Journal can deal with 64-bit block numbers. 310 (JBD2_FEATURE_INCOMPAT_64BIT) !! 290 (JBD2\_FEATURE\_INCOMPAT\_64BIT) 311 * - 0x4 291 * - 0x4 312 - Journal commits asynchronously. (JBD2_F !! 292 - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) 313 * - 0x8 293 * - 0x8 314 - This journal uses v2 of the checksum on 294 - This journal uses v2 of the checksum on-disk format. Each journal 315 metadata block gets its own checksum, a 295 metadata block gets its own checksum, and the block tags in the 316 descriptor table contain checksums for 296 descriptor table contain checksums for each of the data blocks in the 317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2 !! 297 journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) 318 * - 0x10 298 * - 0x10 319 - This journal uses v3 of the checksum on 299 - This journal uses v3 of the checksum on-disk format. This is the same as 320 v2, but the journal block tag size is f 300 v2, but the journal block tag size is fixed regardless of the size of 321 block numbers. (JBD2_FEATURE_INCOMPAT_C !! 301 block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) 322 * - 0x20 << 323 - Journal has fast commit blocks. (JBD2_F << 324 302 325 .. _jbd2_checksum_type: 303 .. _jbd2_checksum_type: 326 304 327 Journal checksum type codes are one of the fol 305 Journal checksum type codes are one of the following. crc32 or crc32c are the 328 most likely choices. 306 most likely choices. 329 307 330 .. list-table:: 308 .. list-table:: 331 :widths: 16 64 309 :widths: 16 64 332 :header-rows: 1 310 :header-rows: 1 333 311 334 * - Value 312 * - Value 335 - Description 313 - Description 336 * - 1 314 * - 1 337 - CRC32 315 - CRC32 338 * - 2 316 * - 2 339 - MD5 317 - MD5 340 * - 3 318 * - 3 341 - SHA1 319 - SHA1 342 * - 4 320 * - 4 343 - CRC32C 321 - CRC32C 344 322 345 Descriptor Block 323 Descriptor Block 346 ~~~~~~~~~~~~~~~~ 324 ~~~~~~~~~~~~~~~~ 347 325 348 The descriptor block contains an array of jour 326 The descriptor block contains an array of journal block tags that 349 describe the final locations of the data block 327 describe the final locations of the data blocks that follow in the 350 journal. Descriptor blocks are open-coded inst 328 journal. Descriptor blocks are open-coded instead of being completely 351 described by a data structure, but here is the 329 described by a data structure, but here is the block structure anyway. 352 Descriptor blocks consume at least 36 bytes, b 330 Descriptor blocks consume at least 36 bytes, but use a full block: 353 331 354 .. list-table:: 332 .. list-table:: 355 :widths: 8 8 24 40 333 :widths: 8 8 24 40 356 :header-rows: 1 334 :header-rows: 1 357 335 358 * - Offset 336 * - Offset 359 - Type 337 - Type 360 - Name 338 - Name 361 - Descriptor 339 - Descriptor 362 * - 0x0 340 * - 0x0 363 - journal_header_t !! 341 - journal\_header\_t 364 - (open coded) 342 - (open coded) 365 - Common block header. 343 - Common block header. 366 * - 0xC 344 * - 0xC 367 - struct journal_block_tag_s !! 345 - struct journal\_block\_tag\_s 368 - open coded array[] 346 - open coded array[] 369 - Enough tags either to fill up the block 347 - Enough tags either to fill up the block or to describe all the data 370 blocks that follow this descriptor bloc 348 blocks that follow this descriptor block. 371 349 372 Journal block tags have any of the following f 350 Journal block tags have any of the following formats, depending on which 373 journal feature and block tag flags are set. 351 journal feature and block tag flags are set. 374 352 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the j !! 353 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is 376 defined as ``struct journal_block_tag3_s``, wh 354 defined as ``struct journal_block_tag3_s``, which looks like the 377 following. The size is 16 or 32 bytes. 355 following. The size is 16 or 32 bytes. 378 356 379 .. list-table:: 357 .. list-table:: 380 :widths: 8 8 24 40 358 :widths: 8 8 24 40 381 :header-rows: 1 359 :header-rows: 1 382 360 383 * - Offset 361 * - Offset 384 - Type 362 - Type 385 - Name 363 - Name 386 - Descriptor 364 - Descriptor 387 * - 0x0 365 * - 0x0 388 - __be32 !! 366 - \_\_be32 389 - t_blocknr !! 367 - t\_blocknr 390 - Lower 32-bits of the location of where 368 - Lower 32-bits of the location of where the corresponding data block 391 should end up on disk. 369 should end up on disk. 392 * - 0x4 370 * - 0x4 393 - __be32 !! 371 - \_\_be32 394 - t_flags !! 372 - t\_flags 395 - Flags that go with the descriptor. See 373 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 396 more info. 374 more info. 397 * - 0x8 375 * - 0x8 398 - __be32 !! 376 - \_\_be32 399 - t_blocknr_high !! 377 - t\_blocknr\_high 400 - Upper 32-bits of the location of where 378 - Upper 32-bits of the location of where the corresponding data block 401 should end up on disk. This is zero if !! 379 should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is 402 not enabled. 380 not enabled. 403 * - 0xC 381 * - 0xC 404 - __be32 !! 382 - \_\_be32 405 - t_checksum !! 383 - t\_checksum 406 - Checksum of the journal UUID, the seque 384 - Checksum of the journal UUID, the sequence number, and the data block. 407 * - 385 * - 408 - 386 - 409 - 387 - 410 - This field appears to be open coded. It 388 - This field appears to be open coded. It always comes at the end of the 411 tag, after t_checksum. This field is no 389 tag, after t_checksum. This field is not present if the "same UUID" flag 412 is set. 390 is set. 413 * - 0x8 or 0xC 391 * - 0x8 or 0xC 414 - char 392 - char 415 - uuid[16] 393 - uuid[16] 416 - A UUID to go with this tag. This field 394 - A UUID to go with this tag. This field appears to be copied from the 417 ``j_uuid`` field in ``struct journal_s` 395 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 418 field. 396 field. 419 397 420 .. _jbd2_tag_flags: 398 .. _jbd2_tag_flags: 421 399 422 The journal tag flags are any combination of t 400 The journal tag flags are any combination of the following: 423 401 424 .. list-table:: 402 .. list-table:: 425 :widths: 16 64 403 :widths: 16 64 426 :header-rows: 1 404 :header-rows: 1 427 405 428 * - Value 406 * - Value 429 - Description 407 - Description 430 * - 0x1 408 * - 0x1 431 - On-disk block is escaped. The first fou 409 - On-disk block is escaped. The first four bytes of the data block just 432 happened to match the jbd2 magic number 410 happened to match the jbd2 magic number. 433 * - 0x2 411 * - 0x2 434 - This block has the same UUID as previou 412 - This block has the same UUID as previous, therefore the UUID field is 435 omitted. 413 omitted. 436 * - 0x4 414 * - 0x4 437 - The data block was deleted by the trans 415 - The data block was deleted by the transaction. (Not used?) 438 * - 0x8 416 * - 0x8 439 - This is the last tag in this descriptor 417 - This is the last tag in this descriptor block. 440 418 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, t !! 419 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag 442 is defined as ``struct journal_block_tag_s``, 420 is defined as ``struct journal_block_tag_s``, which looks like the 443 following. The size is 8, 12, 24, or 28 bytes: 421 following. The size is 8, 12, 24, or 28 bytes: 444 422 445 .. list-table:: 423 .. list-table:: 446 :widths: 8 8 24 40 424 :widths: 8 8 24 40 447 :header-rows: 1 425 :header-rows: 1 448 426 449 * - Offset 427 * - Offset 450 - Type 428 - Type 451 - Name 429 - Name 452 - Descriptor 430 - Descriptor 453 * - 0x0 431 * - 0x0 454 - __be32 !! 432 - \_\_be32 455 - t_blocknr !! 433 - t\_blocknr 456 - Lower 32-bits of the location of where 434 - Lower 32-bits of the location of where the corresponding data block 457 should end up on disk. 435 should end up on disk. 458 * - 0x4 436 * - 0x4 459 - __be16 !! 437 - \_\_be16 460 - t_checksum !! 438 - t\_checksum 461 - Checksum of the journal UUID, the seque 439 - Checksum of the journal UUID, the sequence number, and the data block. 462 Note that only the lower 16 bits are st 440 Note that only the lower 16 bits are stored. 463 * - 0x6 441 * - 0x6 464 - __be16 !! 442 - \_\_be16 465 - t_flags !! 443 - t\_flags 466 - Flags that go with the descriptor. See 444 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 467 more info. 445 more info. 468 * - 446 * - 469 - 447 - 470 - 448 - 471 - This next field is only present if the 449 - This next field is only present if the super block indicates support for 472 64-bit block numbers. 450 64-bit block numbers. 473 * - 0x8 451 * - 0x8 474 - __be32 !! 452 - \_\_be32 475 - t_blocknr_high !! 453 - t\_blocknr\_high 476 - Upper 32-bits of the location of where 454 - Upper 32-bits of the location of where the corresponding data block 477 should end up on disk. 455 should end up on disk. 478 * - 456 * - 479 - 457 - 480 - 458 - 481 - This field appears to be open coded. It 459 - This field appears to be open coded. It always comes at the end of the 482 tag, after t_flags or t_blocknr_high. T 460 tag, after t_flags or t_blocknr_high. This field is not present if the 483 "same UUID" flag is set. 461 "same UUID" flag is set. 484 * - 0x8 or 0xC 462 * - 0x8 or 0xC 485 - char 463 - char 486 - uuid[16] 464 - uuid[16] 487 - A UUID to go with this tag. This field 465 - A UUID to go with this tag. This field appears to be copied from the 488 ``j_uuid`` field in ``struct journal_s` 466 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 489 field. 467 field. 490 468 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or !! 469 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end !! 470 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a 493 ``struct jbd2_journal_block_tail``, which look 471 ``struct jbd2_journal_block_tail``, which looks like this: 494 472 495 .. list-table:: 473 .. list-table:: 496 :widths: 8 8 24 40 474 :widths: 8 8 24 40 497 :header-rows: 1 475 :header-rows: 1 498 476 499 * - Offset 477 * - Offset 500 - Type 478 - Type 501 - Name 479 - Name 502 - Descriptor 480 - Descriptor 503 * - 0x0 481 * - 0x0 504 - __be32 !! 482 - \_\_be32 505 - t_checksum !! 483 - t\_checksum 506 - Checksum of the journal UUID + the desc 484 - Checksum of the journal UUID + the descriptor block, with this field set 507 to zero. 485 to zero. 508 486 509 Data Block 487 Data Block 510 ~~~~~~~~~~ 488 ~~~~~~~~~~ 511 489 512 In general, the data blocks being written to d 490 In general, the data blocks being written to disk through the journal 513 are written verbatim into the journal file aft 491 are written verbatim into the journal file after the descriptor block. 514 However, if the first four bytes of the block 492 However, if the first four bytes of the block match the jbd2 magic 515 number then those four bytes are replaced with 493 number then those four bytes are replaced with zeroes and the “escaped” 516 flag is set in the descriptor block tag. 494 flag is set in the descriptor block tag. 517 495 518 Revocation Block 496 Revocation Block 519 ~~~~~~~~~~~~~~~~ 497 ~~~~~~~~~~~~~~~~ 520 498 521 A revocation block is used to prevent replay o 499 A revocation block is used to prevent replay of a block in an earlier 522 transaction. This is used to mark blocks that 500 transaction. This is used to mark blocks that were journalled at one 523 time but are no longer journalled. Typically t 501 time but are no longer journalled. Typically this happens if a metadata 524 block is freed and re-allocated as a file data 502 block is freed and re-allocated as a file data block; in this case, a 525 journal replay after the file block was writte 503 journal replay after the file block was written to disk will cause 526 corruption. 504 corruption. 527 505 528 **NOTE**: This mechanism is NOT used to expres 506 **NOTE**: This mechanism is NOT used to express “this journal block is 529 superseded by this other journal block”, as 507 superseded by this other journal block”, as the author (djwong) 530 mistakenly thought. Any block being added to a 508 mistakenly thought. Any block being added to a transaction will cause 531 the removal of all existing revocation records 509 the removal of all existing revocation records for that block. 532 510 533 Revocation blocks are described in 511 Revocation blocks are described in 534 ``struct jbd2_journal_revoke_header_s``, are a 512 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in 535 length, but use a full block: 513 length, but use a full block: 536 514 537 .. list-table:: 515 .. list-table:: 538 :widths: 8 8 24 40 516 :widths: 8 8 24 40 539 :header-rows: 1 517 :header-rows: 1 540 518 541 * - Offset 519 * - Offset 542 - Type 520 - Type 543 - Name 521 - Name 544 - Description 522 - Description 545 * - 0x0 523 * - 0x0 546 - journal_header_t !! 524 - journal\_header\_t 547 - r_header !! 525 - r\_header 548 - Common block header. 526 - Common block header. 549 * - 0xC 527 * - 0xC 550 - __be32 !! 528 - \_\_be32 551 - r_count !! 529 - r\_count 552 - Number of bytes used in this block. 530 - Number of bytes used in this block. 553 * - 0x10 531 * - 0x10 554 - __be32 or __be64 !! 532 - \_\_be32 or \_\_be64 555 - blocks[0] 533 - blocks[0] 556 - Blocks to revoke. 534 - Blocks to revoke. 557 535 558 After r_count is a linear array of block numbe !! 536 After r\_count is a linear array of block numbers that are effectively 559 revoked by this transaction. The size of each 537 revoked by this transaction. The size of each block number is 8 bytes if 560 the superblock advertises 64-bit block number 538 the superblock advertises 64-bit block number support, or 4 bytes 561 otherwise. 539 otherwise. 562 540 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or !! 541 If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end !! 542 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation 565 block is a ``struct jbd2_journal_revoke_tail`` 543 block is a ``struct jbd2_journal_revoke_tail``, which has this format: 566 544 567 .. list-table:: 545 .. list-table:: 568 :widths: 8 8 24 40 546 :widths: 8 8 24 40 569 :header-rows: 1 547 :header-rows: 1 570 548 571 * - Offset 549 * - Offset 572 - Type 550 - Type 573 - Name 551 - Name 574 - Description 552 - Description 575 * - 0x0 553 * - 0x0 576 - __be32 !! 554 - \_\_be32 577 - r_checksum !! 555 - r\_checksum 578 - Checksum of the journal UUID + revocati 556 - Checksum of the journal UUID + revocation block 579 557 580 Commit Block 558 Commit Block 581 ~~~~~~~~~~~~ 559 ~~~~~~~~~~~~ 582 560 583 The commit block is a sentry that indicates th 561 The commit block is a sentry that indicates that a transaction has been 584 completely written to the journal. Once this c 562 completely written to the journal. Once this commit block reaches the 585 journal, the data stored with this transaction 563 journal, the data stored with this transaction can be written to their 586 final locations on disk. 564 final locations on disk. 587 565 588 The commit block is described by ``struct comm 566 The commit block is described by ``struct commit_header``, which is 32 589 bytes long (but uses a full block): 567 bytes long (but uses a full block): 590 568 591 .. list-table:: 569 .. list-table:: 592 :widths: 8 8 24 40 570 :widths: 8 8 24 40 593 :header-rows: 1 571 :header-rows: 1 594 572 595 * - Offset 573 * - Offset 596 - Type 574 - Type 597 - Name 575 - Name 598 - Descriptor 576 - Descriptor 599 * - 0x0 577 * - 0x0 600 - journal_header_s !! 578 - journal\_header\_s 601 - (open coded) 579 - (open coded) 602 - Common block header. 580 - Common block header. 603 * - 0xC 581 * - 0xC 604 - unsigned char 582 - unsigned char 605 - h_chksum_type !! 583 - h\_chksum\_type 606 - The type of checksum to use to verify t 584 - The type of checksum to use to verify the integrity of the data blocks 607 in the transaction. See jbd2_checksum_t 585 in the transaction. See jbd2_checksum_type_ for more info. 608 * - 0xD 586 * - 0xD 609 - unsigned char 587 - unsigned char 610 - h_chksum_size !! 588 - h\_chksum\_size 611 - The number of bytes used by the checksu 589 - The number of bytes used by the checksum. Most likely 4. 612 * - 0xE 590 * - 0xE 613 - unsigned char 591 - unsigned char 614 - h_padding[2] !! 592 - h\_padding[2] 615 - 593 - 616 * - 0x10 594 * - 0x10 617 - __be32 !! 595 - \_\_be32 618 - h_chksum[JBD2_CHECKSUM_BYTES] !! 596 - h\_chksum[JBD2\_CHECKSUM\_BYTES] 619 - 32 bytes of space to store checksums. I 597 - 32 bytes of space to store checksums. If 620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_F !! 598 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 621 are set, the first ``__be32`` is the ch 599 are set, the first ``__be32`` is the checksum of the journal UUID and 622 the entire commit block, with this fiel 600 the entire commit block, with this field zeroed. If 623 JBD2_FEATURE_COMPAT_CHECKSUM is set, th !! 601 JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the 624 crc32 of all the blocks already written 602 crc32 of all the blocks already written to the transaction. 625 * - 0x30 603 * - 0x30 626 - __be64 !! 604 - \_\_be64 627 - h_commit_sec !! 605 - h\_commit\_sec 628 - The time that the transaction was commi 606 - The time that the transaction was committed, in seconds since the epoch. 629 * - 0x38 607 * - 0x38 630 - __be32 !! 608 - \_\_be32 631 - h_commit_nsec !! 609 - h\_commit\_nsec 632 - Nanoseconds component of the above time 610 - Nanoseconds component of the above timestamp. 633 611 634 Fast commits << 635 ~~~~~~~~~~~~ << 636 << 637 Fast commit area is organized as a log of tag << 638 a ``struct ext4_fc_tl`` in the beginning which << 639 of the entire field. It is followed by variabl << 640 Here is the list of supported tags and their m << 641 << 642 .. list-table:: << 643 :widths: 8 20 20 32 << 644 :header-rows: 1 << 645 << 646 * - Tag << 647 - Meaning << 648 - Value struct << 649 - Description << 650 * - EXT4_FC_TAG_HEAD << 651 - Fast commit area header << 652 - ``struct ext4_fc_head`` << 653 - Stores the TID of the transaction after << 654 be applied. << 655 * - EXT4_FC_TAG_ADD_RANGE << 656 - Add extent to inode << 657 - ``struct ext4_fc_add_range`` << 658 - Stores the inode number and extent to b << 659 * - EXT4_FC_TAG_DEL_RANGE << 660 - Remove logical offsets to inode << 661 - ``struct ext4_fc_del_range`` << 662 - Stores the inode number and the logical << 663 removed << 664 * - EXT4_FC_TAG_CREAT << 665 - Create directory entry for a newly crea << 666 - ``struct ext4_fc_dentry_info`` << 667 - Stores the parent inode number, inode n << 668 newly created file << 669 * - EXT4_FC_TAG_LINK << 670 - Link a directory entry to an inode << 671 - ``struct ext4_fc_dentry_info`` << 672 - Stores the parent inode number, inode n << 673 * - EXT4_FC_TAG_UNLINK << 674 - Unlink a directory entry of an inode << 675 - ``struct ext4_fc_dentry_info`` << 676 - Stores the parent inode number, inode n << 677 << 678 * - EXT4_FC_TAG_PAD << 679 - Padding (unused area) << 680 - None << 681 - Unused bytes in the fast commit area. << 682 << 683 * - EXT4_FC_TAG_TAIL << 684 - Mark the end of a fast commit << 685 - ``struct ext4_fc_tail`` << 686 - Stores the TID of the commit, CRC of th << 687 represents the end of << 688 << 689 Fast Commit Replay Idempotence << 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ << 691 << 692 Fast commits tags are idempotent in nature pro << 693 certain rules. The guiding principle that the << 694 committing is that it stores the result of a p << 695 storing the procedure. << 696 << 697 Let's consider this rename operation: 'mv /a / << 698 was associated with inode 10. During fast comm << 699 operation as a procedure "rename a to b", we s << 700 state as a "series" of outcomes: << 701 << 702 - Link dirent b to inode 10 << 703 - Unlink dirent a << 704 - Inode 10 with valid refcount << 705 << 706 Now when recovery code runs, it needs "enforce << 707 system. This is what guarantees idempotence of << 708 << 709 Let's take an example of a procedure that is n << 710 commits make it idempotent. Consider following << 711 << 712 1) rm A << 713 2) mv B A << 714 3) read A << 715 << 716 If we store this sequence of operations as is << 717 Let's say while in replay, we crash after (2). << 718 file A (which was actually created as a result << 719 deleted. Thus, file named A would be absent wh << 720 sequence of operations is not idempotent. Howe << 721 of storing the procedure fast commits store th << 722 the fast commit log for above procedure would << 723 << 724 (Let's assume dirent A was linked to inode 10 << 725 inode 11 before the replay) << 726 << 727 1) Unlink A << 728 2) Link A to inode 11 << 729 3) Unlink B << 730 4) Inode 11 << 731 << 732 If we crash after (3) we will have file A link << 733 replay, we will remove file A (inode 11). But << 734 it point to inode 11. We won't find B, so we'l << 735 point, the refcount for inode 11 is not reliab << 736 replay of last inode 11 tag. Thus, by converti << 737 into a series of idempotent outcomes, fast com << 738 the replay. << 739 << 740 Journal Checkpoint << 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ << 742 << 743 Checkpointing the journal ensures all transact << 744 are submitted to the disk. In-progress transac << 745 in the checkpoint. Checkpointing is used inter << 746 the filesystem including journal recovery, fil << 747 the journal_t structure. << 748 << 749 A journal checkpoint can be triggered from use << 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single << 751 Currently, three flags are supported. First, E << 752 can be used to verify input to the ioctl. It r << 753 invalid input, otherwise it returns success wi << 754 any checkpointing. This can be used to check w << 755 system and to verify there are no issues with << 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_D << 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags << 758 discarded or zero-filled, respectively, after << 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and << 760 cannot both be set. The ioctl may be useful wh << 761 complying with content deletion SLOs. <<
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.