1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 Journal (jbd2) 3 Journal (jbd2) 4 -------------- 4 -------------- 5 5 6 Introduced in ext3, the ext4 filesystem employ 6 Introduced in ext3, the ext4 filesystem employs a journal to protect the 7 filesystem against metadata inconsistencies in 7 filesystem against metadata inconsistencies in the case of a system crash. Up 8 to 10,240,000 file system blocks (see man mke2 8 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal 9 size limits) can be reserved inside the filesy 9 size limits) can be reserved inside the filesystem as a place to land 10 “important” data writes on-disk as quickly 10 “important” data writes on-disk as quickly as possible. Once the important 11 data transaction is fully written to the disk 11 data transaction is fully written to the disk and flushed from the disk write 12 cache, a record of the data being committed is 12 cache, a record of the data being committed is also written to the journal. At 13 some later point in time, the journal code wri 13 some later point in time, the journal code writes the transactions to their 14 final locations on disk (this could involve a 14 final locations on disk (this could involve a lot of seeking or a lot of small 15 read-write-erases) before erasing the commit r 15 read-write-erases) before erasing the commit record. Should the system 16 crash during the second slow write, the journa 16 crash during the second slow write, the journal can be replayed all the 17 way to the latest commit record, guaranteeing 17 way to the latest commit record, guaranteeing the atomicity of whatever 18 gets written through the journal to the disk. 18 gets written through the journal to the disk. The effect of this is to 19 guarantee that the filesystem does not become 19 guarantee that the filesystem does not become stuck midway through a 20 metadata update. 20 metadata update. 21 21 22 For performance reasons, ext4 by default only 22 For performance reasons, ext4 by default only writes filesystem metadata 23 through the journal. This means that file data 23 through the journal. This means that file data blocks are /not/ 24 guaranteed to be in any consistent state after 24 guaranteed to be in any consistent state after a crash. If this default 25 guarantee level (``data=ordered``) is not sati 25 guarantee level (``data=ordered``) is not satisfactory, there is a mount 26 option to control journal behavior. If ``data= 26 option to control journal behavior. If ``data=journal``, all data and 27 metadata are written to disk through the journ 27 metadata are written to disk through the journal. This is slower but 28 safest. If ``data=writeback``, dirty data bloc 28 safest. If ``data=writeback``, dirty data blocks are not flushed to the 29 disk before the metadata are written to disk t 29 disk before the metadata are written to disk through the journal. 30 30 31 In case of ``data=ordered`` mode, Ext4 also su 31 In case of ``data=ordered`` mode, Ext4 also supports fast commits which 32 help reduce commit latency significantly. The 32 help reduce commit latency significantly. The default ``data=ordered`` 33 mode works by logging metadata blocks to the j 33 mode works by logging metadata blocks to the journal. In fast commit 34 mode, Ext4 only stores the minimal delta neede 34 mode, Ext4 only stores the minimal delta needed to recreate the 35 affected metadata in fast commit space that is 35 affected metadata in fast commit space that is shared with JBD2. 36 Once the fast commit area fills in or if fast 36 Once the fast commit area fills in or if fast commit is not possible 37 or if JBD2 commit timer goes off, Ext4 perform 37 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. 38 A full commit invalidates all the fast commits 38 A full commit invalidates all the fast commits that happened before 39 it and thus it makes the fast commit area empt 39 it and thus it makes the fast commit area empty for further fast 40 commits. This feature needs to be enabled at m 40 commits. This feature needs to be enabled at mkfs time. 41 41 42 The journal inode is typically inode 8. The fi 42 The journal inode is typically inode 8. The first 68 bytes of the 43 journal inode are replicated in the ext4 super 43 journal inode are replicated in the ext4 superblock. The journal itself 44 is normal (but hidden) file within the filesys 44 is normal (but hidden) file within the filesystem. The file usually 45 consumes an entire block group, though mke2fs 45 consumes an entire block group, though mke2fs tries to put it in the 46 middle of the disk. 46 middle of the disk. 47 47 48 All fields in jbd2 are written to disk in big- 48 All fields in jbd2 are written to disk in big-endian order. This is the 49 opposite of ext4. 49 opposite of ext4. 50 50 51 NOTE: Both ext4 and ocfs2 use jbd2. 51 NOTE: Both ext4 and ocfs2 use jbd2. 52 52 53 The maximum size of a journal embedded in an e 53 The maximum size of a journal embedded in an ext4 filesystem is 2^32 54 blocks. jbd2 itself does not seem to care. 54 blocks. jbd2 itself does not seem to care. 55 55 56 Layout 56 Layout 57 ~~~~~~ 57 ~~~~~~ 58 58 59 Generally speaking, the journal has this forma 59 Generally speaking, the journal has this format: 60 60 61 .. list-table:: 61 .. list-table:: 62 :widths: 16 48 16 62 :widths: 16 48 16 63 :header-rows: 1 63 :header-rows: 1 64 64 65 * - Superblock 65 * - Superblock 66 - descriptor_block (data_blocks or revoca 66 - descriptor_block (data_blocks or revocation_block) [more data or 67 revocations] commmit_block 67 revocations] commmit_block 68 - [more transactions...] 68 - [more transactions...] 69 * - 69 * - 70 - One transaction 70 - One transaction 71 - 71 - 72 72 73 Notice that a transaction begins with either a 73 Notice that a transaction begins with either a descriptor and some data, 74 or a block revocation list. A finished transac 74 or a block revocation list. A finished transaction always ends with a 75 commit. If there is no commit record (or the c 75 commit. If there is no commit record (or the checksums don't match), the 76 transaction will be discarded during replay. 76 transaction will be discarded during replay. 77 77 78 External Journal 78 External Journal 79 ~~~~~~~~~~~~~~~~ 79 ~~~~~~~~~~~~~~~~ 80 80 81 Optionally, an ext4 filesystem can be created 81 Optionally, an ext4 filesystem can be created with an external journal 82 device (as opposed to an internal journal, whi 82 device (as opposed to an internal journal, which uses a reserved inode). 83 In this case, on the filesystem device, ``s_jo 83 In this case, on the filesystem device, ``s_journal_inum`` should be 84 zero and ``s_journal_uuid`` should be set. On 84 zero and ``s_journal_uuid`` should be set. On the journal device there 85 will be an ext4 super block in the usual place 85 will be an ext4 super block in the usual place, with a matching UUID. 86 The journal superblock will be in the next ful 86 The journal superblock will be in the next full block after the 87 superblock. 87 superblock. 88 88 89 .. list-table:: 89 .. list-table:: 90 :widths: 12 12 12 32 12 90 :widths: 12 12 12 32 12 91 :header-rows: 1 91 :header-rows: 1 92 92 93 * - 1024 bytes of padding 93 * - 1024 bytes of padding 94 - ext4 Superblock 94 - ext4 Superblock 95 - Journal Superblock 95 - Journal Superblock 96 - descriptor_block (data_blocks or revoca 96 - descriptor_block (data_blocks or revocation_block) [more data or 97 revocations] commmit_block 97 revocations] commmit_block 98 - [more transactions...] 98 - [more transactions...] 99 * - 99 * - 100 - 100 - 101 - 101 - 102 - One transaction 102 - One transaction 103 - 103 - 104 104 105 Block Header 105 Block Header 106 ~~~~~~~~~~~~ 106 ~~~~~~~~~~~~ 107 107 108 Every block in the journal starts with a commo 108 Every block in the journal starts with a common 12-byte header 109 ``struct journal_header_s``: 109 ``struct journal_header_s``: 110 110 111 .. list-table:: 111 .. list-table:: 112 :widths: 8 8 24 40 112 :widths: 8 8 24 40 113 :header-rows: 1 113 :header-rows: 1 114 114 115 * - Offset 115 * - Offset 116 - Type 116 - Type 117 - Name 117 - Name 118 - Description 118 - Description 119 * - 0x0 119 * - 0x0 120 - __be32 120 - __be32 121 - h_magic 121 - h_magic 122 - jbd2 magic number, 0xC03B3998. 122 - jbd2 magic number, 0xC03B3998. 123 * - 0x4 123 * - 0x4 124 - __be32 124 - __be32 125 - h_blocktype 125 - h_blocktype 126 - Description of what this block contains 126 - Description of what this block contains. See the jbd2_blocktype_ table 127 below. 127 below. 128 * - 0x8 128 * - 0x8 129 - __be32 129 - __be32 130 - h_sequence 130 - h_sequence 131 - The transaction ID that goes with this 131 - The transaction ID that goes with this block. 132 132 133 .. _jbd2_blocktype: 133 .. _jbd2_blocktype: 134 134 135 The journal block type can be any one of: 135 The journal block type can be any one of: 136 136 137 .. list-table:: 137 .. list-table:: 138 :widths: 16 64 138 :widths: 16 64 139 :header-rows: 1 139 :header-rows: 1 140 140 141 * - Value 141 * - Value 142 - Description 142 - Description 143 * - 1 143 * - 1 144 - Descriptor. This block precedes a serie 144 - Descriptor. This block precedes a series of data blocks that were 145 written through the journal during a tr 145 written through the journal during a transaction. 146 * - 2 146 * - 2 147 - Block commit record. This block signifi 147 - Block commit record. This block signifies the completion of a 148 transaction. 148 transaction. 149 * - 3 149 * - 3 150 - Journal superblock, v1. 150 - Journal superblock, v1. 151 * - 4 151 * - 4 152 - Journal superblock, v2. 152 - Journal superblock, v2. 153 * - 5 153 * - 5 154 - Block revocation records. This speeds u 154 - Block revocation records. This speeds up recovery by enabling the 155 journal to skip writing blocks that wer 155 journal to skip writing blocks that were subsequently rewritten. 156 156 157 Super Block 157 Super Block 158 ~~~~~~~~~~~ 158 ~~~~~~~~~~~ 159 159 160 The super block for the journal is much simple 160 The super block for the journal is much simpler as compared to ext4's. 161 The key data kept within are size of the journ 161 The key data kept within are size of the journal, and where to find the 162 start of the log of transactions. 162 start of the log of transactions. 163 163 164 The journal superblock is recorded as ``struct 164 The journal superblock is recorded as ``struct journal_superblock_s``, 165 which is 1024 bytes long: 165 which is 1024 bytes long: 166 166 167 .. list-table:: 167 .. list-table:: 168 :widths: 8 8 24 40 168 :widths: 8 8 24 40 169 :header-rows: 1 169 :header-rows: 1 170 170 171 * - Offset 171 * - Offset 172 - Type 172 - Type 173 - Name 173 - Name 174 - Description 174 - Description 175 * - 175 * - 176 - 176 - 177 - 177 - 178 - Static information describing the journ 178 - Static information describing the journal. 179 * - 0x0 179 * - 0x0 180 - journal_header_t (12 bytes) 180 - journal_header_t (12 bytes) 181 - s_header 181 - s_header 182 - Common header identifying this as a sup 182 - Common header identifying this as a superblock. 183 * - 0xC 183 * - 0xC 184 - __be32 184 - __be32 185 - s_blocksize 185 - s_blocksize 186 - Journal device block size. 186 - Journal device block size. 187 * - 0x10 187 * - 0x10 188 - __be32 188 - __be32 189 - s_maxlen 189 - s_maxlen 190 - Total number of blocks in this journal. 190 - Total number of blocks in this journal. 191 * - 0x14 191 * - 0x14 192 - __be32 192 - __be32 193 - s_first 193 - s_first 194 - First block of log information. 194 - First block of log information. 195 * - 195 * - 196 - 196 - 197 - 197 - 198 - Dynamic information describing the curr 198 - Dynamic information describing the current state of the log. 199 * - 0x18 199 * - 0x18 200 - __be32 200 - __be32 201 - s_sequence 201 - s_sequence 202 - First commit ID expected in log. 202 - First commit ID expected in log. 203 * - 0x1C 203 * - 0x1C 204 - __be32 204 - __be32 205 - s_start 205 - s_start 206 - Block number of the start of log. Contr 206 - Block number of the start of log. Contrary to the comments, this field 207 being zero does not imply that the jour 207 being zero does not imply that the journal is clean! 208 * - 0x20 208 * - 0x20 209 - __be32 209 - __be32 210 - s_errno 210 - s_errno 211 - Error value, as set by jbd2_journal_abo 211 - Error value, as set by jbd2_journal_abort(). 212 * - 212 * - 213 - 213 - 214 - 214 - 215 - The remaining fields are only valid in 215 - The remaining fields are only valid in a v2 superblock. 216 * - 0x24 216 * - 0x24 217 - __be32 217 - __be32 218 - s_feature_compat; 218 - s_feature_compat; 219 - Compatible feature set. See the table j 219 - Compatible feature set. See the table jbd2_compat_ below. 220 * - 0x28 220 * - 0x28 221 - __be32 221 - __be32 222 - s_feature_incompat 222 - s_feature_incompat 223 - Incompatible feature set. See the table 223 - Incompatible feature set. See the table jbd2_incompat_ below. 224 * - 0x2C 224 * - 0x2C 225 - __be32 225 - __be32 226 - s_feature_ro_compat 226 - s_feature_ro_compat 227 - Read-only compatible feature set. There 227 - Read-only compatible feature set. There aren't any of these currently. 228 * - 0x30 228 * - 0x30 229 - __u8 229 - __u8 230 - s_uuid[16] 230 - s_uuid[16] 231 - 128-bit uuid for journal. This is compa 231 - 128-bit uuid for journal. This is compared against the copy in the ext4 232 super block at mount time. 232 super block at mount time. 233 * - 0x40 233 * - 0x40 234 - __be32 234 - __be32 235 - s_nr_users 235 - s_nr_users 236 - Number of file systems sharing this jou 236 - Number of file systems sharing this journal. 237 * - 0x44 237 * - 0x44 238 - __be32 238 - __be32 239 - s_dynsuper 239 - s_dynsuper 240 - Location of dynamic super block copy. ( 240 - Location of dynamic super block copy. (Not used?) 241 * - 0x48 241 * - 0x48 242 - __be32 242 - __be32 243 - s_max_transaction 243 - s_max_transaction 244 - Limit of journal blocks per transaction 244 - Limit of journal blocks per transaction. (Not used?) 245 * - 0x4C 245 * - 0x4C 246 - __be32 246 - __be32 247 - s_max_trans_data 247 - s_max_trans_data 248 - Limit of data blocks per transaction. ( 248 - Limit of data blocks per transaction. (Not used?) 249 * - 0x50 249 * - 0x50 250 - __u8 250 - __u8 251 - s_checksum_type 251 - s_checksum_type 252 - Checksum algorithm used for the journal 252 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for 253 more info. 253 more info. 254 * - 0x51 254 * - 0x51 255 - __u8[3] 255 - __u8[3] 256 - s_padding2 256 - s_padding2 257 - 257 - 258 * - 0x54 258 * - 0x54 259 - __be32 259 - __be32 260 - s_num_fc_blocks 260 - s_num_fc_blocks 261 - Number of fast commit blocks in the jou 261 - Number of fast commit blocks in the journal. 262 * - 0x58 262 * - 0x58 263 - __be32 263 - __be32 264 - s_head 264 - s_head 265 - Block number of the head (first unused 265 - Block number of the head (first unused block) of the journal, only 266 up-to-date when the journal is empty. 266 up-to-date when the journal is empty. 267 * - 0x5C 267 * - 0x5C 268 - __u32 268 - __u32 269 - s_padding[40] 269 - s_padding[40] 270 - 270 - 271 * - 0xFC 271 * - 0xFC 272 - __be32 272 - __be32 273 - s_checksum 273 - s_checksum 274 - Checksum of the entire superblock, with 274 - Checksum of the entire superblock, with this field set to zero. 275 * - 0x100 275 * - 0x100 276 - __u8 276 - __u8 277 - s_users[16*48] 277 - s_users[16*48] 278 - ids of all file systems sharing the log 278 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow 279 shared external journals, but I imagine 279 shared external journals, but I imagine Lustre (or ocfs2?), which use 280 the jbd2 code, might. 280 the jbd2 code, might. 281 281 282 .. _jbd2_compat: 282 .. _jbd2_compat: 283 283 284 The journal compat features are any combinatio 284 The journal compat features are any combination of the following: 285 285 286 .. list-table:: 286 .. list-table:: 287 :widths: 16 64 287 :widths: 16 64 288 :header-rows: 1 288 :header-rows: 1 289 289 290 * - Value 290 * - Value 291 - Description 291 - Description 292 * - 0x1 292 * - 0x1 293 - Journal maintains checksums on the data 293 - Journal maintains checksums on the data blocks. 294 (JBD2_FEATURE_COMPAT_CHECKSUM) 294 (JBD2_FEATURE_COMPAT_CHECKSUM) 295 295 296 .. _jbd2_incompat: 296 .. _jbd2_incompat: 297 297 298 The journal incompat features are any combinat 298 The journal incompat features are any combination of the following: 299 299 300 .. list-table:: 300 .. list-table:: 301 :widths: 16 64 301 :widths: 16 64 302 :header-rows: 1 302 :header-rows: 1 303 303 304 * - Value 304 * - Value 305 - Description 305 - Description 306 * - 0x1 306 * - 0x1 307 - Journal has block revocation records. ( 307 - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) 308 * - 0x2 308 * - 0x2 309 - Journal can deal with 64-bit block numb 309 - Journal can deal with 64-bit block numbers. 310 (JBD2_FEATURE_INCOMPAT_64BIT) 310 (JBD2_FEATURE_INCOMPAT_64BIT) 311 * - 0x4 311 * - 0x4 312 - Journal commits asynchronously. (JBD2_F 312 - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) 313 * - 0x8 313 * - 0x8 314 - This journal uses v2 of the checksum on 314 - This journal uses v2 of the checksum on-disk format. Each journal 315 metadata block gets its own checksum, a 315 metadata block gets its own checksum, and the block tags in the 316 descriptor table contain checksums for 316 descriptor table contain checksums for each of the data blocks in the 317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2 317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) 318 * - 0x10 318 * - 0x10 319 - This journal uses v3 of the checksum on 319 - This journal uses v3 of the checksum on-disk format. This is the same as 320 v2, but the journal block tag size is f 320 v2, but the journal block tag size is fixed regardless of the size of 321 block numbers. (JBD2_FEATURE_INCOMPAT_C 321 block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) 322 * - 0x20 322 * - 0x20 323 - Journal has fast commit blocks. (JBD2_F 323 - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) 324 324 325 .. _jbd2_checksum_type: 325 .. _jbd2_checksum_type: 326 326 327 Journal checksum type codes are one of the fol 327 Journal checksum type codes are one of the following. crc32 or crc32c are the 328 most likely choices. 328 most likely choices. 329 329 330 .. list-table:: 330 .. list-table:: 331 :widths: 16 64 331 :widths: 16 64 332 :header-rows: 1 332 :header-rows: 1 333 333 334 * - Value 334 * - Value 335 - Description 335 - Description 336 * - 1 336 * - 1 337 - CRC32 337 - CRC32 338 * - 2 338 * - 2 339 - MD5 339 - MD5 340 * - 3 340 * - 3 341 - SHA1 341 - SHA1 342 * - 4 342 * - 4 343 - CRC32C 343 - CRC32C 344 344 345 Descriptor Block 345 Descriptor Block 346 ~~~~~~~~~~~~~~~~ 346 ~~~~~~~~~~~~~~~~ 347 347 348 The descriptor block contains an array of jour 348 The descriptor block contains an array of journal block tags that 349 describe the final locations of the data block 349 describe the final locations of the data blocks that follow in the 350 journal. Descriptor blocks are open-coded inst 350 journal. Descriptor blocks are open-coded instead of being completely 351 described by a data structure, but here is the 351 described by a data structure, but here is the block structure anyway. 352 Descriptor blocks consume at least 36 bytes, b 352 Descriptor blocks consume at least 36 bytes, but use a full block: 353 353 354 .. list-table:: 354 .. list-table:: 355 :widths: 8 8 24 40 355 :widths: 8 8 24 40 356 :header-rows: 1 356 :header-rows: 1 357 357 358 * - Offset 358 * - Offset 359 - Type 359 - Type 360 - Name 360 - Name 361 - Descriptor 361 - Descriptor 362 * - 0x0 362 * - 0x0 363 - journal_header_t 363 - journal_header_t 364 - (open coded) 364 - (open coded) 365 - Common block header. 365 - Common block header. 366 * - 0xC 366 * - 0xC 367 - struct journal_block_tag_s 367 - struct journal_block_tag_s 368 - open coded array[] 368 - open coded array[] 369 - Enough tags either to fill up the block 369 - Enough tags either to fill up the block or to describe all the data 370 blocks that follow this descriptor bloc 370 blocks that follow this descriptor block. 371 371 372 Journal block tags have any of the following f 372 Journal block tags have any of the following formats, depending on which 373 journal feature and block tag flags are set. 373 journal feature and block tag flags are set. 374 374 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the j 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is 376 defined as ``struct journal_block_tag3_s``, wh 376 defined as ``struct journal_block_tag3_s``, which looks like the 377 following. The size is 16 or 32 bytes. 377 following. The size is 16 or 32 bytes. 378 378 379 .. list-table:: 379 .. list-table:: 380 :widths: 8 8 24 40 380 :widths: 8 8 24 40 381 :header-rows: 1 381 :header-rows: 1 382 382 383 * - Offset 383 * - Offset 384 - Type 384 - Type 385 - Name 385 - Name 386 - Descriptor 386 - Descriptor 387 * - 0x0 387 * - 0x0 388 - __be32 388 - __be32 389 - t_blocknr 389 - t_blocknr 390 - Lower 32-bits of the location of where 390 - Lower 32-bits of the location of where the corresponding data block 391 should end up on disk. 391 should end up on disk. 392 * - 0x4 392 * - 0x4 393 - __be32 393 - __be32 394 - t_flags 394 - t_flags 395 - Flags that go with the descriptor. See 395 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 396 more info. 396 more info. 397 * - 0x8 397 * - 0x8 398 - __be32 398 - __be32 399 - t_blocknr_high 399 - t_blocknr_high 400 - Upper 32-bits of the location of where 400 - Upper 32-bits of the location of where the corresponding data block 401 should end up on disk. This is zero if 401 should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is 402 not enabled. 402 not enabled. 403 * - 0xC 403 * - 0xC 404 - __be32 404 - __be32 405 - t_checksum 405 - t_checksum 406 - Checksum of the journal UUID, the seque 406 - Checksum of the journal UUID, the sequence number, and the data block. 407 * - 407 * - 408 - 408 - 409 - 409 - 410 - This field appears to be open coded. It 410 - This field appears to be open coded. It always comes at the end of the 411 tag, after t_checksum. This field is no 411 tag, after t_checksum. This field is not present if the "same UUID" flag 412 is set. 412 is set. 413 * - 0x8 or 0xC 413 * - 0x8 or 0xC 414 - char 414 - char 415 - uuid[16] 415 - uuid[16] 416 - A UUID to go with this tag. This field 416 - A UUID to go with this tag. This field appears to be copied from the 417 ``j_uuid`` field in ``struct journal_s` 417 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 418 field. 418 field. 419 419 420 .. _jbd2_tag_flags: 420 .. _jbd2_tag_flags: 421 421 422 The journal tag flags are any combination of t 422 The journal tag flags are any combination of the following: 423 423 424 .. list-table:: 424 .. list-table:: 425 :widths: 16 64 425 :widths: 16 64 426 :header-rows: 1 426 :header-rows: 1 427 427 428 * - Value 428 * - Value 429 - Description 429 - Description 430 * - 0x1 430 * - 0x1 431 - On-disk block is escaped. The first fou 431 - On-disk block is escaped. The first four bytes of the data block just 432 happened to match the jbd2 magic number 432 happened to match the jbd2 magic number. 433 * - 0x2 433 * - 0x2 434 - This block has the same UUID as previou 434 - This block has the same UUID as previous, therefore the UUID field is 435 omitted. 435 omitted. 436 * - 0x4 436 * - 0x4 437 - The data block was deleted by the trans 437 - The data block was deleted by the transaction. (Not used?) 438 * - 0x8 438 * - 0x8 439 - This is the last tag in this descriptor 439 - This is the last tag in this descriptor block. 440 440 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, t 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag 442 is defined as ``struct journal_block_tag_s``, 442 is defined as ``struct journal_block_tag_s``, which looks like the 443 following. The size is 8, 12, 24, or 28 bytes: 443 following. The size is 8, 12, 24, or 28 bytes: 444 444 445 .. list-table:: 445 .. list-table:: 446 :widths: 8 8 24 40 446 :widths: 8 8 24 40 447 :header-rows: 1 447 :header-rows: 1 448 448 449 * - Offset 449 * - Offset 450 - Type 450 - Type 451 - Name 451 - Name 452 - Descriptor 452 - Descriptor 453 * - 0x0 453 * - 0x0 454 - __be32 454 - __be32 455 - t_blocknr 455 - t_blocknr 456 - Lower 32-bits of the location of where 456 - Lower 32-bits of the location of where the corresponding data block 457 should end up on disk. 457 should end up on disk. 458 * - 0x4 458 * - 0x4 459 - __be16 459 - __be16 460 - t_checksum 460 - t_checksum 461 - Checksum of the journal UUID, the seque 461 - Checksum of the journal UUID, the sequence number, and the data block. 462 Note that only the lower 16 bits are st 462 Note that only the lower 16 bits are stored. 463 * - 0x6 463 * - 0x6 464 - __be16 464 - __be16 465 - t_flags 465 - t_flags 466 - Flags that go with the descriptor. See 466 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 467 more info. 467 more info. 468 * - 468 * - 469 - 469 - 470 - 470 - 471 - This next field is only present if the 471 - This next field is only present if the super block indicates support for 472 64-bit block numbers. 472 64-bit block numbers. 473 * - 0x8 473 * - 0x8 474 - __be32 474 - __be32 475 - t_blocknr_high 475 - t_blocknr_high 476 - Upper 32-bits of the location of where 476 - Upper 32-bits of the location of where the corresponding data block 477 should end up on disk. 477 should end up on disk. 478 * - 478 * - 479 - 479 - 480 - 480 - 481 - This field appears to be open coded. It 481 - This field appears to be open coded. It always comes at the end of the 482 tag, after t_flags or t_blocknr_high. T 482 tag, after t_flags or t_blocknr_high. This field is not present if the 483 "same UUID" flag is set. 483 "same UUID" flag is set. 484 * - 0x8 or 0xC 484 * - 0x8 or 0xC 485 - char 485 - char 486 - uuid[16] 486 - uuid[16] 487 - A UUID to go with this tag. This field 487 - A UUID to go with this tag. This field appears to be copied from the 488 ``j_uuid`` field in ``struct journal_s` 488 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 489 field. 489 field. 490 490 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a 493 ``struct jbd2_journal_block_tail``, which look 493 ``struct jbd2_journal_block_tail``, which looks like this: 494 494 495 .. list-table:: 495 .. list-table:: 496 :widths: 8 8 24 40 496 :widths: 8 8 24 40 497 :header-rows: 1 497 :header-rows: 1 498 498 499 * - Offset 499 * - Offset 500 - Type 500 - Type 501 - Name 501 - Name 502 - Descriptor 502 - Descriptor 503 * - 0x0 503 * - 0x0 504 - __be32 504 - __be32 505 - t_checksum 505 - t_checksum 506 - Checksum of the journal UUID + the desc 506 - Checksum of the journal UUID + the descriptor block, with this field set 507 to zero. 507 to zero. 508 508 509 Data Block 509 Data Block 510 ~~~~~~~~~~ 510 ~~~~~~~~~~ 511 511 512 In general, the data blocks being written to d 512 In general, the data blocks being written to disk through the journal 513 are written verbatim into the journal file aft 513 are written verbatim into the journal file after the descriptor block. 514 However, if the first four bytes of the block 514 However, if the first four bytes of the block match the jbd2 magic 515 number then those four bytes are replaced with 515 number then those four bytes are replaced with zeroes and the “escaped” 516 flag is set in the descriptor block tag. 516 flag is set in the descriptor block tag. 517 517 518 Revocation Block 518 Revocation Block 519 ~~~~~~~~~~~~~~~~ 519 ~~~~~~~~~~~~~~~~ 520 520 521 A revocation block is used to prevent replay o 521 A revocation block is used to prevent replay of a block in an earlier 522 transaction. This is used to mark blocks that 522 transaction. This is used to mark blocks that were journalled at one 523 time but are no longer journalled. Typically t 523 time but are no longer journalled. Typically this happens if a metadata 524 block is freed and re-allocated as a file data 524 block is freed and re-allocated as a file data block; in this case, a 525 journal replay after the file block was writte 525 journal replay after the file block was written to disk will cause 526 corruption. 526 corruption. 527 527 528 **NOTE**: This mechanism is NOT used to expres 528 **NOTE**: This mechanism is NOT used to express “this journal block is 529 superseded by this other journal block”, as 529 superseded by this other journal block”, as the author (djwong) 530 mistakenly thought. Any block being added to a 530 mistakenly thought. Any block being added to a transaction will cause 531 the removal of all existing revocation records 531 the removal of all existing revocation records for that block. 532 532 533 Revocation blocks are described in 533 Revocation blocks are described in 534 ``struct jbd2_journal_revoke_header_s``, are a 534 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in 535 length, but use a full block: 535 length, but use a full block: 536 536 537 .. list-table:: 537 .. list-table:: 538 :widths: 8 8 24 40 538 :widths: 8 8 24 40 539 :header-rows: 1 539 :header-rows: 1 540 540 541 * - Offset 541 * - Offset 542 - Type 542 - Type 543 - Name 543 - Name 544 - Description 544 - Description 545 * - 0x0 545 * - 0x0 546 - journal_header_t 546 - journal_header_t 547 - r_header 547 - r_header 548 - Common block header. 548 - Common block header. 549 * - 0xC 549 * - 0xC 550 - __be32 550 - __be32 551 - r_count 551 - r_count 552 - Number of bytes used in this block. 552 - Number of bytes used in this block. 553 * - 0x10 553 * - 0x10 554 - __be32 or __be64 554 - __be32 or __be64 555 - blocks[0] 555 - blocks[0] 556 - Blocks to revoke. 556 - Blocks to revoke. 557 557 558 After r_count is a linear array of block numbe 558 After r_count is a linear array of block numbers that are effectively 559 revoked by this transaction. The size of each 559 revoked by this transaction. The size of each block number is 8 bytes if 560 the superblock advertises 64-bit block number 560 the superblock advertises 64-bit block number support, or 4 bytes 561 otherwise. 561 otherwise. 562 562 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation 565 block is a ``struct jbd2_journal_revoke_tail`` 565 block is a ``struct jbd2_journal_revoke_tail``, which has this format: 566 566 567 .. list-table:: 567 .. list-table:: 568 :widths: 8 8 24 40 568 :widths: 8 8 24 40 569 :header-rows: 1 569 :header-rows: 1 570 570 571 * - Offset 571 * - Offset 572 - Type 572 - Type 573 - Name 573 - Name 574 - Description 574 - Description 575 * - 0x0 575 * - 0x0 576 - __be32 576 - __be32 577 - r_checksum 577 - r_checksum 578 - Checksum of the journal UUID + revocati 578 - Checksum of the journal UUID + revocation block 579 579 580 Commit Block 580 Commit Block 581 ~~~~~~~~~~~~ 581 ~~~~~~~~~~~~ 582 582 583 The commit block is a sentry that indicates th 583 The commit block is a sentry that indicates that a transaction has been 584 completely written to the journal. Once this c 584 completely written to the journal. Once this commit block reaches the 585 journal, the data stored with this transaction 585 journal, the data stored with this transaction can be written to their 586 final locations on disk. 586 final locations on disk. 587 587 588 The commit block is described by ``struct comm 588 The commit block is described by ``struct commit_header``, which is 32 589 bytes long (but uses a full block): 589 bytes long (but uses a full block): 590 590 591 .. list-table:: 591 .. list-table:: 592 :widths: 8 8 24 40 592 :widths: 8 8 24 40 593 :header-rows: 1 593 :header-rows: 1 594 594 595 * - Offset 595 * - Offset 596 - Type 596 - Type 597 - Name 597 - Name 598 - Descriptor 598 - Descriptor 599 * - 0x0 599 * - 0x0 600 - journal_header_s 600 - journal_header_s 601 - (open coded) 601 - (open coded) 602 - Common block header. 602 - Common block header. 603 * - 0xC 603 * - 0xC 604 - unsigned char 604 - unsigned char 605 - h_chksum_type 605 - h_chksum_type 606 - The type of checksum to use to verify t 606 - The type of checksum to use to verify the integrity of the data blocks 607 in the transaction. See jbd2_checksum_t 607 in the transaction. See jbd2_checksum_type_ for more info. 608 * - 0xD 608 * - 0xD 609 - unsigned char 609 - unsigned char 610 - h_chksum_size 610 - h_chksum_size 611 - The number of bytes used by the checksu 611 - The number of bytes used by the checksum. Most likely 4. 612 * - 0xE 612 * - 0xE 613 - unsigned char 613 - unsigned char 614 - h_padding[2] 614 - h_padding[2] 615 - 615 - 616 * - 0x10 616 * - 0x10 617 - __be32 617 - __be32 618 - h_chksum[JBD2_CHECKSUM_BYTES] 618 - h_chksum[JBD2_CHECKSUM_BYTES] 619 - 32 bytes of space to store checksums. I 619 - 32 bytes of space to store checksums. If 620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_F 620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 621 are set, the first ``__be32`` is the ch 621 are set, the first ``__be32`` is the checksum of the journal UUID and 622 the entire commit block, with this fiel 622 the entire commit block, with this field zeroed. If 623 JBD2_FEATURE_COMPAT_CHECKSUM is set, th 623 JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the 624 crc32 of all the blocks already written 624 crc32 of all the blocks already written to the transaction. 625 * - 0x30 625 * - 0x30 626 - __be64 626 - __be64 627 - h_commit_sec 627 - h_commit_sec 628 - The time that the transaction was commi 628 - The time that the transaction was committed, in seconds since the epoch. 629 * - 0x38 629 * - 0x38 630 - __be32 630 - __be32 631 - h_commit_nsec 631 - h_commit_nsec 632 - Nanoseconds component of the above time 632 - Nanoseconds component of the above timestamp. 633 633 634 Fast commits 634 Fast commits 635 ~~~~~~~~~~~~ 635 ~~~~~~~~~~~~ 636 636 637 Fast commit area is organized as a log of tag 637 Fast commit area is organized as a log of tag length values. Each TLV has 638 a ``struct ext4_fc_tl`` in the beginning which 638 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length 639 of the entire field. It is followed by variabl 639 of the entire field. It is followed by variable length tag specific value. 640 Here is the list of supported tags and their m 640 Here is the list of supported tags and their meanings: 641 641 642 .. list-table:: 642 .. list-table:: 643 :widths: 8 20 20 32 643 :widths: 8 20 20 32 644 :header-rows: 1 644 :header-rows: 1 645 645 646 * - Tag 646 * - Tag 647 - Meaning 647 - Meaning 648 - Value struct 648 - Value struct 649 - Description 649 - Description 650 * - EXT4_FC_TAG_HEAD 650 * - EXT4_FC_TAG_HEAD 651 - Fast commit area header 651 - Fast commit area header 652 - ``struct ext4_fc_head`` 652 - ``struct ext4_fc_head`` 653 - Stores the TID of the transaction after 653 - Stores the TID of the transaction after which these fast commits should 654 be applied. 654 be applied. 655 * - EXT4_FC_TAG_ADD_RANGE 655 * - EXT4_FC_TAG_ADD_RANGE 656 - Add extent to inode 656 - Add extent to inode 657 - ``struct ext4_fc_add_range`` 657 - ``struct ext4_fc_add_range`` 658 - Stores the inode number and extent to b 658 - Stores the inode number and extent to be added in this inode 659 * - EXT4_FC_TAG_DEL_RANGE 659 * - EXT4_FC_TAG_DEL_RANGE 660 - Remove logical offsets to inode 660 - Remove logical offsets to inode 661 - ``struct ext4_fc_del_range`` 661 - ``struct ext4_fc_del_range`` 662 - Stores the inode number and the logical 662 - Stores the inode number and the logical offset range that needs to be 663 removed 663 removed 664 * - EXT4_FC_TAG_CREAT 664 * - EXT4_FC_TAG_CREAT 665 - Create directory entry for a newly crea 665 - Create directory entry for a newly created file 666 - ``struct ext4_fc_dentry_info`` 666 - ``struct ext4_fc_dentry_info`` 667 - Stores the parent inode number, inode n 667 - Stores the parent inode number, inode number and directory entry of the 668 newly created file 668 newly created file 669 * - EXT4_FC_TAG_LINK 669 * - EXT4_FC_TAG_LINK 670 - Link a directory entry to an inode 670 - Link a directory entry to an inode 671 - ``struct ext4_fc_dentry_info`` 671 - ``struct ext4_fc_dentry_info`` 672 - Stores the parent inode number, inode n 672 - Stores the parent inode number, inode number and directory entry 673 * - EXT4_FC_TAG_UNLINK 673 * - EXT4_FC_TAG_UNLINK 674 - Unlink a directory entry of an inode 674 - Unlink a directory entry of an inode 675 - ``struct ext4_fc_dentry_info`` 675 - ``struct ext4_fc_dentry_info`` 676 - Stores the parent inode number, inode n 676 - Stores the parent inode number, inode number and directory entry 677 677 678 * - EXT4_FC_TAG_PAD 678 * - EXT4_FC_TAG_PAD 679 - Padding (unused area) 679 - Padding (unused area) 680 - None 680 - None 681 - Unused bytes in the fast commit area. 681 - Unused bytes in the fast commit area. 682 682 683 * - EXT4_FC_TAG_TAIL 683 * - EXT4_FC_TAG_TAIL 684 - Mark the end of a fast commit 684 - Mark the end of a fast commit 685 - ``struct ext4_fc_tail`` 685 - ``struct ext4_fc_tail`` 686 - Stores the TID of the commit, CRC of th 686 - Stores the TID of the commit, CRC of the fast commit of which this tag 687 represents the end of 687 represents the end of 688 688 689 Fast Commit Replay Idempotence 689 Fast Commit Replay Idempotence 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 691 691 692 Fast commits tags are idempotent in nature pro 692 Fast commits tags are idempotent in nature provided the recovery code follows 693 certain rules. The guiding principle that the 693 certain rules. The guiding principle that the commit path follows while 694 committing is that it stores the result of a p 694 committing is that it stores the result of a particular operation instead of 695 storing the procedure. 695 storing the procedure. 696 696 697 Let's consider this rename operation: 'mv /a / 697 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' 698 was associated with inode 10. During fast comm 698 was associated with inode 10. During fast commit, instead of storing this 699 operation as a procedure "rename a to b", we s 699 operation as a procedure "rename a to b", we store the resulting file system 700 state as a "series" of outcomes: 700 state as a "series" of outcomes: 701 701 702 - Link dirent b to inode 10 702 - Link dirent b to inode 10 703 - Unlink dirent a 703 - Unlink dirent a 704 - Inode 10 with valid refcount 704 - Inode 10 with valid refcount 705 705 706 Now when recovery code runs, it needs "enforce 706 Now when recovery code runs, it needs "enforce" this state on the file 707 system. This is what guarantees idempotence of 707 system. This is what guarantees idempotence of fast commit replay. 708 708 709 Let's take an example of a procedure that is n 709 Let's take an example of a procedure that is not idempotent and see how fast 710 commits make it idempotent. Consider following 710 commits make it idempotent. Consider following sequence of operations: 711 711 712 1) rm A 712 1) rm A 713 2) mv B A 713 2) mv B A 714 3) read A 714 3) read A 715 715 716 If we store this sequence of operations as is 716 If we store this sequence of operations as is then the replay is not idempotent. 717 Let's say while in replay, we crash after (2). 717 Let's say while in replay, we crash after (2). During the second replay, 718 file A (which was actually created as a result 718 file A (which was actually created as a result of "mv B A" operation) would get 719 deleted. Thus, file named A would be absent wh 719 deleted. Thus, file named A would be absent when we try to read A. So, this 720 sequence of operations is not idempotent. Howe 720 sequence of operations is not idempotent. However, as mentioned above, instead 721 of storing the procedure fast commits store th 721 of storing the procedure fast commits store the outcome of each procedure. Thus 722 the fast commit log for above procedure would 722 the fast commit log for above procedure would be as follows: 723 723 724 (Let's assume dirent A was linked to inode 10 724 (Let's assume dirent A was linked to inode 10 and dirent B was linked to 725 inode 11 before the replay) 725 inode 11 before the replay) 726 726 727 1) Unlink A 727 1) Unlink A 728 2) Link A to inode 11 728 2) Link A to inode 11 729 3) Unlink B 729 3) Unlink B 730 4) Inode 11 730 4) Inode 11 731 731 732 If we crash after (3) we will have file A link 732 If we crash after (3) we will have file A linked to inode 11. During the second 733 replay, we will remove file A (inode 11). But 733 replay, we will remove file A (inode 11). But we will create it back and make 734 it point to inode 11. We won't find B, so we'l 734 it point to inode 11. We won't find B, so we'll just skip that step. At this 735 point, the refcount for inode 11 is not reliab 735 point, the refcount for inode 11 is not reliable, but that gets fixed by the 736 replay of last inode 11 tag. Thus, by converti 736 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure 737 into a series of idempotent outcomes, fast com 737 into a series of idempotent outcomes, fast commits ensured idempotence during 738 the replay. 738 the replay. 739 739 740 Journal Checkpoint 740 Journal Checkpoint 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 742 742 743 Checkpointing the journal ensures all transact 743 Checkpointing the journal ensures all transactions and their associated buffers 744 are submitted to the disk. In-progress transac 744 are submitted to the disk. In-progress transactions are waited upon and included 745 in the checkpoint. Checkpointing is used inter 745 in the checkpoint. Checkpointing is used internally during critical updates to 746 the filesystem including journal recovery, fil 746 the filesystem including journal recovery, filesystem resizing, and freeing of 747 the journal_t structure. 747 the journal_t structure. 748 748 749 A journal checkpoint can be triggered from use 749 A journal checkpoint can be triggered from userspace via the ioctl 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. 751 Currently, three flags are supported. First, E 751 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN 752 can be used to verify input to the ioctl. It r 752 can be used to verify input to the ioctl. It returns error if there is any 753 invalid input, otherwise it returns success wi 753 invalid input, otherwise it returns success without performing 754 any checkpointing. This can be used to check w 754 any checkpointing. This can be used to check whether the ioctl exists on a 755 system and to verify there are no issues with 755 system and to verify there are no issues with arguments or flags. The 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_D 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be 758 discarded or zero-filled, respectively, after 758 discarded or zero-filled, respectively, after the journal checkpoint is 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT 760 cannot both be set. The ioctl may be useful wh 760 cannot both be set. The ioctl may be useful when snapshotting a system or for 761 complying with content deletion SLOs. 761 complying with content deletion SLOs.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.