1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 Journal (jbd2) 3 Journal (jbd2) 4 -------------- 4 -------------- 5 5 6 Introduced in ext3, the ext4 filesystem employ 6 Introduced in ext3, the ext4 filesystem employs a journal to protect the 7 filesystem against metadata inconsistencies in 7 filesystem against metadata inconsistencies in the case of a system crash. Up 8 to 10,240,000 file system blocks (see man mke2 8 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal 9 size limits) can be reserved inside the filesy 9 size limits) can be reserved inside the filesystem as a place to land 10 “important” data writes on-disk as quickly 10 “important” data writes on-disk as quickly as possible. Once the important 11 data transaction is fully written to the disk 11 data transaction is fully written to the disk and flushed from the disk write 12 cache, a record of the data being committed is 12 cache, a record of the data being committed is also written to the journal. At 13 some later point in time, the journal code wri 13 some later point in time, the journal code writes the transactions to their 14 final locations on disk (this could involve a 14 final locations on disk (this could involve a lot of seeking or a lot of small 15 read-write-erases) before erasing the commit r 15 read-write-erases) before erasing the commit record. Should the system 16 crash during the second slow write, the journa 16 crash during the second slow write, the journal can be replayed all the 17 way to the latest commit record, guaranteeing 17 way to the latest commit record, guaranteeing the atomicity of whatever 18 gets written through the journal to the disk. 18 gets written through the journal to the disk. The effect of this is to 19 guarantee that the filesystem does not become 19 guarantee that the filesystem does not become stuck midway through a 20 metadata update. 20 metadata update. 21 21 22 For performance reasons, ext4 by default only 22 For performance reasons, ext4 by default only writes filesystem metadata 23 through the journal. This means that file data 23 through the journal. This means that file data blocks are /not/ 24 guaranteed to be in any consistent state after 24 guaranteed to be in any consistent state after a crash. If this default 25 guarantee level (``data=ordered``) is not sati 25 guarantee level (``data=ordered``) is not satisfactory, there is a mount 26 option to control journal behavior. If ``data= 26 option to control journal behavior. If ``data=journal``, all data and 27 metadata are written to disk through the journ 27 metadata are written to disk through the journal. This is slower but 28 safest. If ``data=writeback``, dirty data bloc 28 safest. If ``data=writeback``, dirty data blocks are not flushed to the 29 disk before the metadata are written to disk t 29 disk before the metadata are written to disk through the journal. 30 30 31 In case of ``data=ordered`` mode, Ext4 also su 31 In case of ``data=ordered`` mode, Ext4 also supports fast commits which 32 help reduce commit latency significantly. The 32 help reduce commit latency significantly. The default ``data=ordered`` 33 mode works by logging metadata blocks to the j 33 mode works by logging metadata blocks to the journal. In fast commit 34 mode, Ext4 only stores the minimal delta neede 34 mode, Ext4 only stores the minimal delta needed to recreate the 35 affected metadata in fast commit space that is 35 affected metadata in fast commit space that is shared with JBD2. 36 Once the fast commit area fills in or if fast 36 Once the fast commit area fills in or if fast commit is not possible 37 or if JBD2 commit timer goes off, Ext4 perform 37 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. 38 A full commit invalidates all the fast commits 38 A full commit invalidates all the fast commits that happened before 39 it and thus it makes the fast commit area empt 39 it and thus it makes the fast commit area empty for further fast 40 commits. This feature needs to be enabled at m 40 commits. This feature needs to be enabled at mkfs time. 41 41 42 The journal inode is typically inode 8. The fi 42 The journal inode is typically inode 8. The first 68 bytes of the 43 journal inode are replicated in the ext4 super 43 journal inode are replicated in the ext4 superblock. The journal itself 44 is normal (but hidden) file within the filesys 44 is normal (but hidden) file within the filesystem. The file usually 45 consumes an entire block group, though mke2fs 45 consumes an entire block group, though mke2fs tries to put it in the 46 middle of the disk. 46 middle of the disk. 47 47 48 All fields in jbd2 are written to disk in big- 48 All fields in jbd2 are written to disk in big-endian order. This is the 49 opposite of ext4. 49 opposite of ext4. 50 50 51 NOTE: Both ext4 and ocfs2 use jbd2. 51 NOTE: Both ext4 and ocfs2 use jbd2. 52 52 53 The maximum size of a journal embedded in an e 53 The maximum size of a journal embedded in an ext4 filesystem is 2^32 54 blocks. jbd2 itself does not seem to care. 54 blocks. jbd2 itself does not seem to care. 55 55 56 Layout 56 Layout 57 ~~~~~~ 57 ~~~~~~ 58 58 59 Generally speaking, the journal has this forma 59 Generally speaking, the journal has this format: 60 60 61 .. list-table:: 61 .. list-table:: 62 :widths: 16 48 16 62 :widths: 16 48 16 63 :header-rows: 1 63 :header-rows: 1 64 64 65 * - Superblock 65 * - Superblock 66 - descriptor_block (data_blocks or revoca 66 - descriptor_block (data_blocks or revocation_block) [more data or 67 revocations] commmit_block 67 revocations] commmit_block 68 - [more transactions...] 68 - [more transactions...] 69 * - 69 * - 70 - One transaction 70 - One transaction 71 - 71 - 72 72 73 Notice that a transaction begins with either a 73 Notice that a transaction begins with either a descriptor and some data, 74 or a block revocation list. A finished transac 74 or a block revocation list. A finished transaction always ends with a 75 commit. If there is no commit record (or the c 75 commit. If there is no commit record (or the checksums don't match), the 76 transaction will be discarded during replay. 76 transaction will be discarded during replay. 77 77 78 External Journal 78 External Journal 79 ~~~~~~~~~~~~~~~~ 79 ~~~~~~~~~~~~~~~~ 80 80 81 Optionally, an ext4 filesystem can be created 81 Optionally, an ext4 filesystem can be created with an external journal 82 device (as opposed to an internal journal, whi 82 device (as opposed to an internal journal, which uses a reserved inode). 83 In this case, on the filesystem device, ``s_jo 83 In this case, on the filesystem device, ``s_journal_inum`` should be 84 zero and ``s_journal_uuid`` should be set. On 84 zero and ``s_journal_uuid`` should be set. On the journal device there 85 will be an ext4 super block in the usual place 85 will be an ext4 super block in the usual place, with a matching UUID. 86 The journal superblock will be in the next ful 86 The journal superblock will be in the next full block after the 87 superblock. 87 superblock. 88 88 89 .. list-table:: 89 .. list-table:: 90 :widths: 12 12 12 32 12 90 :widths: 12 12 12 32 12 91 :header-rows: 1 91 :header-rows: 1 92 92 93 * - 1024 bytes of padding 93 * - 1024 bytes of padding 94 - ext4 Superblock 94 - ext4 Superblock 95 - Journal Superblock 95 - Journal Superblock 96 - descriptor_block (data_blocks or revoca 96 - descriptor_block (data_blocks or revocation_block) [more data or 97 revocations] commmit_block 97 revocations] commmit_block 98 - [more transactions...] 98 - [more transactions...] 99 * - 99 * - 100 - 100 - 101 - 101 - 102 - One transaction 102 - One transaction 103 - 103 - 104 104 105 Block Header 105 Block Header 106 ~~~~~~~~~~~~ 106 ~~~~~~~~~~~~ 107 107 108 Every block in the journal starts with a commo 108 Every block in the journal starts with a common 12-byte header 109 ``struct journal_header_s``: 109 ``struct journal_header_s``: 110 110 111 .. list-table:: 111 .. list-table:: 112 :widths: 8 8 24 40 112 :widths: 8 8 24 40 113 :header-rows: 1 113 :header-rows: 1 114 114 115 * - Offset 115 * - Offset 116 - Type 116 - Type 117 - Name 117 - Name 118 - Description 118 - Description 119 * - 0x0 119 * - 0x0 120 - __be32 120 - __be32 121 - h_magic 121 - h_magic 122 - jbd2 magic number, 0xC03B3998. 122 - jbd2 magic number, 0xC03B3998. 123 * - 0x4 123 * - 0x4 124 - __be32 124 - __be32 125 - h_blocktype 125 - h_blocktype 126 - Description of what this block contains 126 - Description of what this block contains. See the jbd2_blocktype_ table 127 below. 127 below. 128 * - 0x8 128 * - 0x8 129 - __be32 129 - __be32 130 - h_sequence 130 - h_sequence 131 - The transaction ID that goes with this 131 - The transaction ID that goes with this block. 132 132 133 .. _jbd2_blocktype: 133 .. _jbd2_blocktype: 134 134 135 The journal block type can be any one of: 135 The journal block type can be any one of: 136 136 137 .. list-table:: 137 .. list-table:: 138 :widths: 16 64 138 :widths: 16 64 139 :header-rows: 1 139 :header-rows: 1 140 140 141 * - Value 141 * - Value 142 - Description 142 - Description 143 * - 1 143 * - 1 144 - Descriptor. This block precedes a serie 144 - Descriptor. This block precedes a series of data blocks that were 145 written through the journal during a tr 145 written through the journal during a transaction. 146 * - 2 146 * - 2 147 - Block commit record. This block signifi 147 - Block commit record. This block signifies the completion of a 148 transaction. 148 transaction. 149 * - 3 149 * - 3 150 - Journal superblock, v1. 150 - Journal superblock, v1. 151 * - 4 151 * - 4 152 - Journal superblock, v2. 152 - Journal superblock, v2. 153 * - 5 153 * - 5 154 - Block revocation records. This speeds u 154 - Block revocation records. This speeds up recovery by enabling the 155 journal to skip writing blocks that wer 155 journal to skip writing blocks that were subsequently rewritten. 156 156 157 Super Block 157 Super Block 158 ~~~~~~~~~~~ 158 ~~~~~~~~~~~ 159 159 160 The super block for the journal is much simple 160 The super block for the journal is much simpler as compared to ext4's. 161 The key data kept within are size of the journ 161 The key data kept within are size of the journal, and where to find the 162 start of the log of transactions. 162 start of the log of transactions. 163 163 164 The journal superblock is recorded as ``struct 164 The journal superblock is recorded as ``struct journal_superblock_s``, 165 which is 1024 bytes long: 165 which is 1024 bytes long: 166 166 167 .. list-table:: 167 .. list-table:: 168 :widths: 8 8 24 40 168 :widths: 8 8 24 40 169 :header-rows: 1 169 :header-rows: 1 170 170 171 * - Offset 171 * - Offset 172 - Type 172 - Type 173 - Name 173 - Name 174 - Description 174 - Description 175 * - 175 * - 176 - 176 - 177 - 177 - 178 - Static information describing the journ 178 - Static information describing the journal. 179 * - 0x0 179 * - 0x0 180 - journal_header_t (12 bytes) 180 - journal_header_t (12 bytes) 181 - s_header 181 - s_header 182 - Common header identifying this as a sup 182 - Common header identifying this as a superblock. 183 * - 0xC 183 * - 0xC 184 - __be32 184 - __be32 185 - s_blocksize 185 - s_blocksize 186 - Journal device block size. 186 - Journal device block size. 187 * - 0x10 187 * - 0x10 188 - __be32 188 - __be32 189 - s_maxlen 189 - s_maxlen 190 - Total number of blocks in this journal. 190 - Total number of blocks in this journal. 191 * - 0x14 191 * - 0x14 192 - __be32 192 - __be32 193 - s_first 193 - s_first 194 - First block of log information. 194 - First block of log information. 195 * - 195 * - 196 - 196 - 197 - 197 - 198 - Dynamic information describing the curr 198 - Dynamic information describing the current state of the log. 199 * - 0x18 199 * - 0x18 200 - __be32 200 - __be32 201 - s_sequence 201 - s_sequence 202 - First commit ID expected in log. 202 - First commit ID expected in log. 203 * - 0x1C 203 * - 0x1C 204 - __be32 204 - __be32 205 - s_start 205 - s_start 206 - Block number of the start of log. Contr 206 - Block number of the start of log. Contrary to the comments, this field 207 being zero does not imply that the jour 207 being zero does not imply that the journal is clean! 208 * - 0x20 208 * - 0x20 209 - __be32 209 - __be32 210 - s_errno 210 - s_errno 211 - Error value, as set by jbd2_journal_abo 211 - Error value, as set by jbd2_journal_abort(). 212 * - 212 * - 213 - 213 - 214 - 214 - 215 - The remaining fields are only valid in 215 - The remaining fields are only valid in a v2 superblock. 216 * - 0x24 216 * - 0x24 217 - __be32 217 - __be32 218 - s_feature_compat; 218 - s_feature_compat; 219 - Compatible feature set. See the table j 219 - Compatible feature set. See the table jbd2_compat_ below. 220 * - 0x28 220 * - 0x28 221 - __be32 221 - __be32 222 - s_feature_incompat 222 - s_feature_incompat 223 - Incompatible feature set. See the table 223 - Incompatible feature set. See the table jbd2_incompat_ below. 224 * - 0x2C 224 * - 0x2C 225 - __be32 225 - __be32 226 - s_feature_ro_compat 226 - s_feature_ro_compat 227 - Read-only compatible feature set. There 227 - Read-only compatible feature set. There aren't any of these currently. 228 * - 0x30 228 * - 0x30 229 - __u8 229 - __u8 230 - s_uuid[16] 230 - s_uuid[16] 231 - 128-bit uuid for journal. This is compa 231 - 128-bit uuid for journal. This is compared against the copy in the ext4 232 super block at mount time. 232 super block at mount time. 233 * - 0x40 233 * - 0x40 234 - __be32 234 - __be32 235 - s_nr_users 235 - s_nr_users 236 - Number of file systems sharing this jou 236 - Number of file systems sharing this journal. 237 * - 0x44 237 * - 0x44 238 - __be32 238 - __be32 239 - s_dynsuper 239 - s_dynsuper 240 - Location of dynamic super block copy. ( 240 - Location of dynamic super block copy. (Not used?) 241 * - 0x48 241 * - 0x48 242 - __be32 242 - __be32 243 - s_max_transaction 243 - s_max_transaction 244 - Limit of journal blocks per transaction 244 - Limit of journal blocks per transaction. (Not used?) 245 * - 0x4C 245 * - 0x4C 246 - __be32 246 - __be32 247 - s_max_trans_data 247 - s_max_trans_data 248 - Limit of data blocks per transaction. ( 248 - Limit of data blocks per transaction. (Not used?) 249 * - 0x50 249 * - 0x50 250 - __u8 250 - __u8 251 - s_checksum_type 251 - s_checksum_type 252 - Checksum algorithm used for the journal 252 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for 253 more info. 253 more info. 254 * - 0x51 254 * - 0x51 255 - __u8[3] 255 - __u8[3] 256 - s_padding2 256 - s_padding2 257 - 257 - 258 * - 0x54 258 * - 0x54 259 - __be32 259 - __be32 260 - s_num_fc_blocks 260 - s_num_fc_blocks 261 - Number of fast commit blocks in the jou 261 - Number of fast commit blocks in the journal. 262 * - 0x58 262 * - 0x58 263 - __be32 << 264 - s_head << 265 - Block number of the head (first unused << 266 up-to-date when the journal is empty. << 267 * - 0x5C << 268 - __u32 263 - __u32 269 - s_padding[40] !! 264 - s_padding[42] 270 - 265 - 271 * - 0xFC 266 * - 0xFC 272 - __be32 267 - __be32 273 - s_checksum 268 - s_checksum 274 - Checksum of the entire superblock, with 269 - Checksum of the entire superblock, with this field set to zero. 275 * - 0x100 270 * - 0x100 276 - __u8 271 - __u8 277 - s_users[16*48] 272 - s_users[16*48] 278 - ids of all file systems sharing the log 273 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow 279 shared external journals, but I imagine 274 shared external journals, but I imagine Lustre (or ocfs2?), which use 280 the jbd2 code, might. 275 the jbd2 code, might. 281 276 282 .. _jbd2_compat: 277 .. _jbd2_compat: 283 278 284 The journal compat features are any combinatio 279 The journal compat features are any combination of the following: 285 280 286 .. list-table:: 281 .. list-table:: 287 :widths: 16 64 282 :widths: 16 64 288 :header-rows: 1 283 :header-rows: 1 289 284 290 * - Value 285 * - Value 291 - Description 286 - Description 292 * - 0x1 287 * - 0x1 293 - Journal maintains checksums on the data 288 - Journal maintains checksums on the data blocks. 294 (JBD2_FEATURE_COMPAT_CHECKSUM) 289 (JBD2_FEATURE_COMPAT_CHECKSUM) 295 290 296 .. _jbd2_incompat: 291 .. _jbd2_incompat: 297 292 298 The journal incompat features are any combinat 293 The journal incompat features are any combination of the following: 299 294 300 .. list-table:: 295 .. list-table:: 301 :widths: 16 64 296 :widths: 16 64 302 :header-rows: 1 297 :header-rows: 1 303 298 304 * - Value 299 * - Value 305 - Description 300 - Description 306 * - 0x1 301 * - 0x1 307 - Journal has block revocation records. ( 302 - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE) 308 * - 0x2 303 * - 0x2 309 - Journal can deal with 64-bit block numb 304 - Journal can deal with 64-bit block numbers. 310 (JBD2_FEATURE_INCOMPAT_64BIT) 305 (JBD2_FEATURE_INCOMPAT_64BIT) 311 * - 0x4 306 * - 0x4 312 - Journal commits asynchronously. (JBD2_F 307 - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT) 313 * - 0x8 308 * - 0x8 314 - This journal uses v2 of the checksum on 309 - This journal uses v2 of the checksum on-disk format. Each journal 315 metadata block gets its own checksum, a 310 metadata block gets its own checksum, and the block tags in the 316 descriptor table contain checksums for 311 descriptor table contain checksums for each of the data blocks in the 317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2 312 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2) 318 * - 0x10 313 * - 0x10 319 - This journal uses v3 of the checksum on 314 - This journal uses v3 of the checksum on-disk format. This is the same as 320 v2, but the journal block tag size is f 315 v2, but the journal block tag size is fixed regardless of the size of 321 block numbers. (JBD2_FEATURE_INCOMPAT_C 316 block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3) 322 * - 0x20 317 * - 0x20 323 - Journal has fast commit blocks. (JBD2_F 318 - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) 324 319 325 .. _jbd2_checksum_type: 320 .. _jbd2_checksum_type: 326 321 327 Journal checksum type codes are one of the fol 322 Journal checksum type codes are one of the following. crc32 or crc32c are the 328 most likely choices. 323 most likely choices. 329 324 330 .. list-table:: 325 .. list-table:: 331 :widths: 16 64 326 :widths: 16 64 332 :header-rows: 1 327 :header-rows: 1 333 328 334 * - Value 329 * - Value 335 - Description 330 - Description 336 * - 1 331 * - 1 337 - CRC32 332 - CRC32 338 * - 2 333 * - 2 339 - MD5 334 - MD5 340 * - 3 335 * - 3 341 - SHA1 336 - SHA1 342 * - 4 337 * - 4 343 - CRC32C 338 - CRC32C 344 339 345 Descriptor Block 340 Descriptor Block 346 ~~~~~~~~~~~~~~~~ 341 ~~~~~~~~~~~~~~~~ 347 342 348 The descriptor block contains an array of jour 343 The descriptor block contains an array of journal block tags that 349 describe the final locations of the data block 344 describe the final locations of the data blocks that follow in the 350 journal. Descriptor blocks are open-coded inst 345 journal. Descriptor blocks are open-coded instead of being completely 351 described by a data structure, but here is the 346 described by a data structure, but here is the block structure anyway. 352 Descriptor blocks consume at least 36 bytes, b 347 Descriptor blocks consume at least 36 bytes, but use a full block: 353 348 354 .. list-table:: 349 .. list-table:: 355 :widths: 8 8 24 40 350 :widths: 8 8 24 40 356 :header-rows: 1 351 :header-rows: 1 357 352 358 * - Offset 353 * - Offset 359 - Type 354 - Type 360 - Name 355 - Name 361 - Descriptor 356 - Descriptor 362 * - 0x0 357 * - 0x0 363 - journal_header_t 358 - journal_header_t 364 - (open coded) 359 - (open coded) 365 - Common block header. 360 - Common block header. 366 * - 0xC 361 * - 0xC 367 - struct journal_block_tag_s 362 - struct journal_block_tag_s 368 - open coded array[] 363 - open coded array[] 369 - Enough tags either to fill up the block 364 - Enough tags either to fill up the block or to describe all the data 370 blocks that follow this descriptor bloc 365 blocks that follow this descriptor block. 371 366 372 Journal block tags have any of the following f 367 Journal block tags have any of the following formats, depending on which 373 journal feature and block tag flags are set. 368 journal feature and block tag flags are set. 374 369 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the j 370 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is 376 defined as ``struct journal_block_tag3_s``, wh 371 defined as ``struct journal_block_tag3_s``, which looks like the 377 following. The size is 16 or 32 bytes. 372 following. The size is 16 or 32 bytes. 378 373 379 .. list-table:: 374 .. list-table:: 380 :widths: 8 8 24 40 375 :widths: 8 8 24 40 381 :header-rows: 1 376 :header-rows: 1 382 377 383 * - Offset 378 * - Offset 384 - Type 379 - Type 385 - Name 380 - Name 386 - Descriptor 381 - Descriptor 387 * - 0x0 382 * - 0x0 388 - __be32 383 - __be32 389 - t_blocknr 384 - t_blocknr 390 - Lower 32-bits of the location of where 385 - Lower 32-bits of the location of where the corresponding data block 391 should end up on disk. 386 should end up on disk. 392 * - 0x4 387 * - 0x4 393 - __be32 388 - __be32 394 - t_flags 389 - t_flags 395 - Flags that go with the descriptor. See 390 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 396 more info. 391 more info. 397 * - 0x8 392 * - 0x8 398 - __be32 393 - __be32 399 - t_blocknr_high 394 - t_blocknr_high 400 - Upper 32-bits of the location of where 395 - Upper 32-bits of the location of where the corresponding data block 401 should end up on disk. This is zero if 396 should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is 402 not enabled. 397 not enabled. 403 * - 0xC 398 * - 0xC 404 - __be32 399 - __be32 405 - t_checksum 400 - t_checksum 406 - Checksum of the journal UUID, the seque 401 - Checksum of the journal UUID, the sequence number, and the data block. 407 * - 402 * - 408 - 403 - 409 - 404 - 410 - This field appears to be open coded. It 405 - This field appears to be open coded. It always comes at the end of the 411 tag, after t_checksum. This field is no 406 tag, after t_checksum. This field is not present if the "same UUID" flag 412 is set. 407 is set. 413 * - 0x8 or 0xC 408 * - 0x8 or 0xC 414 - char 409 - char 415 - uuid[16] 410 - uuid[16] 416 - A UUID to go with this tag. This field 411 - A UUID to go with this tag. This field appears to be copied from the 417 ``j_uuid`` field in ``struct journal_s` 412 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 418 field. 413 field. 419 414 420 .. _jbd2_tag_flags: 415 .. _jbd2_tag_flags: 421 416 422 The journal tag flags are any combination of t 417 The journal tag flags are any combination of the following: 423 418 424 .. list-table:: 419 .. list-table:: 425 :widths: 16 64 420 :widths: 16 64 426 :header-rows: 1 421 :header-rows: 1 427 422 428 * - Value 423 * - Value 429 - Description 424 - Description 430 * - 0x1 425 * - 0x1 431 - On-disk block is escaped. The first fou 426 - On-disk block is escaped. The first four bytes of the data block just 432 happened to match the jbd2 magic number 427 happened to match the jbd2 magic number. 433 * - 0x2 428 * - 0x2 434 - This block has the same UUID as previou 429 - This block has the same UUID as previous, therefore the UUID field is 435 omitted. 430 omitted. 436 * - 0x4 431 * - 0x4 437 - The data block was deleted by the trans 432 - The data block was deleted by the transaction. (Not used?) 438 * - 0x8 433 * - 0x8 439 - This is the last tag in this descriptor 434 - This is the last tag in this descriptor block. 440 435 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, t 436 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag 442 is defined as ``struct journal_block_tag_s``, 437 is defined as ``struct journal_block_tag_s``, which looks like the 443 following. The size is 8, 12, 24, or 28 bytes: 438 following. The size is 8, 12, 24, or 28 bytes: 444 439 445 .. list-table:: 440 .. list-table:: 446 :widths: 8 8 24 40 441 :widths: 8 8 24 40 447 :header-rows: 1 442 :header-rows: 1 448 443 449 * - Offset 444 * - Offset 450 - Type 445 - Type 451 - Name 446 - Name 452 - Descriptor 447 - Descriptor 453 * - 0x0 448 * - 0x0 454 - __be32 449 - __be32 455 - t_blocknr 450 - t_blocknr 456 - Lower 32-bits of the location of where 451 - Lower 32-bits of the location of where the corresponding data block 457 should end up on disk. 452 should end up on disk. 458 * - 0x4 453 * - 0x4 459 - __be16 454 - __be16 460 - t_checksum 455 - t_checksum 461 - Checksum of the journal UUID, the seque 456 - Checksum of the journal UUID, the sequence number, and the data block. 462 Note that only the lower 16 bits are st 457 Note that only the lower 16 bits are stored. 463 * - 0x6 458 * - 0x6 464 - __be16 459 - __be16 465 - t_flags 460 - t_flags 466 - Flags that go with the descriptor. See 461 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 467 more info. 462 more info. 468 * - 463 * - 469 - 464 - 470 - 465 - 471 - This next field is only present if the 466 - This next field is only present if the super block indicates support for 472 64-bit block numbers. 467 64-bit block numbers. 473 * - 0x8 468 * - 0x8 474 - __be32 469 - __be32 475 - t_blocknr_high 470 - t_blocknr_high 476 - Upper 32-bits of the location of where 471 - Upper 32-bits of the location of where the corresponding data block 477 should end up on disk. 472 should end up on disk. 478 * - 473 * - 479 - 474 - 480 - 475 - 481 - This field appears to be open coded. It 476 - This field appears to be open coded. It always comes at the end of the 482 tag, after t_flags or t_blocknr_high. T 477 tag, after t_flags or t_blocknr_high. This field is not present if the 483 "same UUID" flag is set. 478 "same UUID" flag is set. 484 * - 0x8 or 0xC 479 * - 0x8 or 0xC 485 - char 480 - char 486 - uuid[16] 481 - uuid[16] 487 - A UUID to go with this tag. This field 482 - A UUID to go with this tag. This field appears to be copied from the 488 ``j_uuid`` field in ``struct journal_s` 483 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 489 field. 484 field. 490 485 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 486 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 487 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a 493 ``struct jbd2_journal_block_tail``, which look 488 ``struct jbd2_journal_block_tail``, which looks like this: 494 489 495 .. list-table:: 490 .. list-table:: 496 :widths: 8 8 24 40 491 :widths: 8 8 24 40 497 :header-rows: 1 492 :header-rows: 1 498 493 499 * - Offset 494 * - Offset 500 - Type 495 - Type 501 - Name 496 - Name 502 - Descriptor 497 - Descriptor 503 * - 0x0 498 * - 0x0 504 - __be32 499 - __be32 505 - t_checksum 500 - t_checksum 506 - Checksum of the journal UUID + the desc 501 - Checksum of the journal UUID + the descriptor block, with this field set 507 to zero. 502 to zero. 508 503 509 Data Block 504 Data Block 510 ~~~~~~~~~~ 505 ~~~~~~~~~~ 511 506 512 In general, the data blocks being written to d 507 In general, the data blocks being written to disk through the journal 513 are written verbatim into the journal file aft 508 are written verbatim into the journal file after the descriptor block. 514 However, if the first four bytes of the block 509 However, if the first four bytes of the block match the jbd2 magic 515 number then those four bytes are replaced with 510 number then those four bytes are replaced with zeroes and the “escaped” 516 flag is set in the descriptor block tag. 511 flag is set in the descriptor block tag. 517 512 518 Revocation Block 513 Revocation Block 519 ~~~~~~~~~~~~~~~~ 514 ~~~~~~~~~~~~~~~~ 520 515 521 A revocation block is used to prevent replay o 516 A revocation block is used to prevent replay of a block in an earlier 522 transaction. This is used to mark blocks that 517 transaction. This is used to mark blocks that were journalled at one 523 time but are no longer journalled. Typically t 518 time but are no longer journalled. Typically this happens if a metadata 524 block is freed and re-allocated as a file data 519 block is freed and re-allocated as a file data block; in this case, a 525 journal replay after the file block was writte 520 journal replay after the file block was written to disk will cause 526 corruption. 521 corruption. 527 522 528 **NOTE**: This mechanism is NOT used to expres 523 **NOTE**: This mechanism is NOT used to express “this journal block is 529 superseded by this other journal block”, as 524 superseded by this other journal block”, as the author (djwong) 530 mistakenly thought. Any block being added to a 525 mistakenly thought. Any block being added to a transaction will cause 531 the removal of all existing revocation records 526 the removal of all existing revocation records for that block. 532 527 533 Revocation blocks are described in 528 Revocation blocks are described in 534 ``struct jbd2_journal_revoke_header_s``, are a 529 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in 535 length, but use a full block: 530 length, but use a full block: 536 531 537 .. list-table:: 532 .. list-table:: 538 :widths: 8 8 24 40 533 :widths: 8 8 24 40 539 :header-rows: 1 534 :header-rows: 1 540 535 541 * - Offset 536 * - Offset 542 - Type 537 - Type 543 - Name 538 - Name 544 - Description 539 - Description 545 * - 0x0 540 * - 0x0 546 - journal_header_t 541 - journal_header_t 547 - r_header 542 - r_header 548 - Common block header. 543 - Common block header. 549 * - 0xC 544 * - 0xC 550 - __be32 545 - __be32 551 - r_count 546 - r_count 552 - Number of bytes used in this block. 547 - Number of bytes used in this block. 553 * - 0x10 548 * - 0x10 554 - __be32 or __be64 549 - __be32 or __be64 555 - blocks[0] 550 - blocks[0] 556 - Blocks to revoke. 551 - Blocks to revoke. 557 552 558 After r_count is a linear array of block numbe 553 After r_count is a linear array of block numbers that are effectively 559 revoked by this transaction. The size of each 554 revoked by this transaction. The size of each block number is 8 bytes if 560 the superblock advertises 64-bit block number 555 the superblock advertises 64-bit block number support, or 4 bytes 561 otherwise. 556 otherwise. 562 557 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 558 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 559 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation 565 block is a ``struct jbd2_journal_revoke_tail`` 560 block is a ``struct jbd2_journal_revoke_tail``, which has this format: 566 561 567 .. list-table:: 562 .. list-table:: 568 :widths: 8 8 24 40 563 :widths: 8 8 24 40 569 :header-rows: 1 564 :header-rows: 1 570 565 571 * - Offset 566 * - Offset 572 - Type 567 - Type 573 - Name 568 - Name 574 - Description 569 - Description 575 * - 0x0 570 * - 0x0 576 - __be32 571 - __be32 577 - r_checksum 572 - r_checksum 578 - Checksum of the journal UUID + revocati 573 - Checksum of the journal UUID + revocation block 579 574 580 Commit Block 575 Commit Block 581 ~~~~~~~~~~~~ 576 ~~~~~~~~~~~~ 582 577 583 The commit block is a sentry that indicates th 578 The commit block is a sentry that indicates that a transaction has been 584 completely written to the journal. Once this c 579 completely written to the journal. Once this commit block reaches the 585 journal, the data stored with this transaction 580 journal, the data stored with this transaction can be written to their 586 final locations on disk. 581 final locations on disk. 587 582 588 The commit block is described by ``struct comm 583 The commit block is described by ``struct commit_header``, which is 32 589 bytes long (but uses a full block): 584 bytes long (but uses a full block): 590 585 591 .. list-table:: 586 .. list-table:: 592 :widths: 8 8 24 40 587 :widths: 8 8 24 40 593 :header-rows: 1 588 :header-rows: 1 594 589 595 * - Offset 590 * - Offset 596 - Type 591 - Type 597 - Name 592 - Name 598 - Descriptor 593 - Descriptor 599 * - 0x0 594 * - 0x0 600 - journal_header_s 595 - journal_header_s 601 - (open coded) 596 - (open coded) 602 - Common block header. 597 - Common block header. 603 * - 0xC 598 * - 0xC 604 - unsigned char 599 - unsigned char 605 - h_chksum_type 600 - h_chksum_type 606 - The type of checksum to use to verify t 601 - The type of checksum to use to verify the integrity of the data blocks 607 in the transaction. See jbd2_checksum_t 602 in the transaction. See jbd2_checksum_type_ for more info. 608 * - 0xD 603 * - 0xD 609 - unsigned char 604 - unsigned char 610 - h_chksum_size 605 - h_chksum_size 611 - The number of bytes used by the checksu 606 - The number of bytes used by the checksum. Most likely 4. 612 * - 0xE 607 * - 0xE 613 - unsigned char 608 - unsigned char 614 - h_padding[2] 609 - h_padding[2] 615 - 610 - 616 * - 0x10 611 * - 0x10 617 - __be32 612 - __be32 618 - h_chksum[JBD2_CHECKSUM_BYTES] 613 - h_chksum[JBD2_CHECKSUM_BYTES] 619 - 32 bytes of space to store checksums. I 614 - 32 bytes of space to store checksums. If 620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_F 615 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3 621 are set, the first ``__be32`` is the ch 616 are set, the first ``__be32`` is the checksum of the journal UUID and 622 the entire commit block, with this fiel 617 the entire commit block, with this field zeroed. If 623 JBD2_FEATURE_COMPAT_CHECKSUM is set, th 618 JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the 624 crc32 of all the blocks already written 619 crc32 of all the blocks already written to the transaction. 625 * - 0x30 620 * - 0x30 626 - __be64 621 - __be64 627 - h_commit_sec 622 - h_commit_sec 628 - The time that the transaction was commi 623 - The time that the transaction was committed, in seconds since the epoch. 629 * - 0x38 624 * - 0x38 630 - __be32 625 - __be32 631 - h_commit_nsec 626 - h_commit_nsec 632 - Nanoseconds component of the above time 627 - Nanoseconds component of the above timestamp. 633 628 634 Fast commits 629 Fast commits 635 ~~~~~~~~~~~~ 630 ~~~~~~~~~~~~ 636 631 637 Fast commit area is organized as a log of tag 632 Fast commit area is organized as a log of tag length values. Each TLV has 638 a ``struct ext4_fc_tl`` in the beginning which 633 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length 639 of the entire field. It is followed by variabl 634 of the entire field. It is followed by variable length tag specific value. 640 Here is the list of supported tags and their m 635 Here is the list of supported tags and their meanings: 641 636 642 .. list-table:: 637 .. list-table:: 643 :widths: 8 20 20 32 638 :widths: 8 20 20 32 644 :header-rows: 1 639 :header-rows: 1 645 640 646 * - Tag 641 * - Tag 647 - Meaning 642 - Meaning 648 - Value struct 643 - Value struct 649 - Description 644 - Description 650 * - EXT4_FC_TAG_HEAD 645 * - EXT4_FC_TAG_HEAD 651 - Fast commit area header 646 - Fast commit area header 652 - ``struct ext4_fc_head`` 647 - ``struct ext4_fc_head`` 653 - Stores the TID of the transaction after 648 - Stores the TID of the transaction after which these fast commits should 654 be applied. 649 be applied. 655 * - EXT4_FC_TAG_ADD_RANGE 650 * - EXT4_FC_TAG_ADD_RANGE 656 - Add extent to inode 651 - Add extent to inode 657 - ``struct ext4_fc_add_range`` 652 - ``struct ext4_fc_add_range`` 658 - Stores the inode number and extent to b 653 - Stores the inode number and extent to be added in this inode 659 * - EXT4_FC_TAG_DEL_RANGE 654 * - EXT4_FC_TAG_DEL_RANGE 660 - Remove logical offsets to inode 655 - Remove logical offsets to inode 661 - ``struct ext4_fc_del_range`` 656 - ``struct ext4_fc_del_range`` 662 - Stores the inode number and the logical 657 - Stores the inode number and the logical offset range that needs to be 663 removed 658 removed 664 * - EXT4_FC_TAG_CREAT 659 * - EXT4_FC_TAG_CREAT 665 - Create directory entry for a newly crea 660 - Create directory entry for a newly created file 666 - ``struct ext4_fc_dentry_info`` 661 - ``struct ext4_fc_dentry_info`` 667 - Stores the parent inode number, inode n 662 - Stores the parent inode number, inode number and directory entry of the 668 newly created file 663 newly created file 669 * - EXT4_FC_TAG_LINK 664 * - EXT4_FC_TAG_LINK 670 - Link a directory entry to an inode 665 - Link a directory entry to an inode 671 - ``struct ext4_fc_dentry_info`` 666 - ``struct ext4_fc_dentry_info`` 672 - Stores the parent inode number, inode n 667 - Stores the parent inode number, inode number and directory entry 673 * - EXT4_FC_TAG_UNLINK 668 * - EXT4_FC_TAG_UNLINK 674 - Unlink a directory entry of an inode 669 - Unlink a directory entry of an inode 675 - ``struct ext4_fc_dentry_info`` 670 - ``struct ext4_fc_dentry_info`` 676 - Stores the parent inode number, inode n 671 - Stores the parent inode number, inode number and directory entry 677 672 678 * - EXT4_FC_TAG_PAD 673 * - EXT4_FC_TAG_PAD 679 - Padding (unused area) 674 - Padding (unused area) 680 - None 675 - None 681 - Unused bytes in the fast commit area. 676 - Unused bytes in the fast commit area. 682 677 683 * - EXT4_FC_TAG_TAIL 678 * - EXT4_FC_TAG_TAIL 684 - Mark the end of a fast commit 679 - Mark the end of a fast commit 685 - ``struct ext4_fc_tail`` 680 - ``struct ext4_fc_tail`` 686 - Stores the TID of the commit, CRC of th 681 - Stores the TID of the commit, CRC of the fast commit of which this tag 687 represents the end of 682 represents the end of 688 683 689 Fast Commit Replay Idempotence 684 Fast Commit Replay Idempotence 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 685 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 691 686 692 Fast commits tags are idempotent in nature pro 687 Fast commits tags are idempotent in nature provided the recovery code follows 693 certain rules. The guiding principle that the 688 certain rules. The guiding principle that the commit path follows while 694 committing is that it stores the result of a p 689 committing is that it stores the result of a particular operation instead of 695 storing the procedure. 690 storing the procedure. 696 691 697 Let's consider this rename operation: 'mv /a / 692 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a' 698 was associated with inode 10. During fast comm 693 was associated with inode 10. During fast commit, instead of storing this 699 operation as a procedure "rename a to b", we s 694 operation as a procedure "rename a to b", we store the resulting file system 700 state as a "series" of outcomes: 695 state as a "series" of outcomes: 701 696 702 - Link dirent b to inode 10 697 - Link dirent b to inode 10 703 - Unlink dirent a 698 - Unlink dirent a 704 - Inode 10 with valid refcount 699 - Inode 10 with valid refcount 705 700 706 Now when recovery code runs, it needs "enforce 701 Now when recovery code runs, it needs "enforce" this state on the file 707 system. This is what guarantees idempotence of 702 system. This is what guarantees idempotence of fast commit replay. 708 703 709 Let's take an example of a procedure that is n 704 Let's take an example of a procedure that is not idempotent and see how fast 710 commits make it idempotent. Consider following 705 commits make it idempotent. Consider following sequence of operations: 711 706 712 1) rm A 707 1) rm A 713 2) mv B A 708 2) mv B A 714 3) read A 709 3) read A 715 710 716 If we store this sequence of operations as is 711 If we store this sequence of operations as is then the replay is not idempotent. 717 Let's say while in replay, we crash after (2). 712 Let's say while in replay, we crash after (2). During the second replay, 718 file A (which was actually created as a result 713 file A (which was actually created as a result of "mv B A" operation) would get 719 deleted. Thus, file named A would be absent wh 714 deleted. Thus, file named A would be absent when we try to read A. So, this 720 sequence of operations is not idempotent. Howe 715 sequence of operations is not idempotent. However, as mentioned above, instead 721 of storing the procedure fast commits store th 716 of storing the procedure fast commits store the outcome of each procedure. Thus 722 the fast commit log for above procedure would 717 the fast commit log for above procedure would be as follows: 723 718 724 (Let's assume dirent A was linked to inode 10 719 (Let's assume dirent A was linked to inode 10 and dirent B was linked to 725 inode 11 before the replay) 720 inode 11 before the replay) 726 721 727 1) Unlink A 722 1) Unlink A 728 2) Link A to inode 11 723 2) Link A to inode 11 729 3) Unlink B 724 3) Unlink B 730 4) Inode 11 725 4) Inode 11 731 726 732 If we crash after (3) we will have file A link 727 If we crash after (3) we will have file A linked to inode 11. During the second 733 replay, we will remove file A (inode 11). But 728 replay, we will remove file A (inode 11). But we will create it back and make 734 it point to inode 11. We won't find B, so we'l 729 it point to inode 11. We won't find B, so we'll just skip that step. At this 735 point, the refcount for inode 11 is not reliab 730 point, the refcount for inode 11 is not reliable, but that gets fixed by the 736 replay of last inode 11 tag. Thus, by converti 731 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure 737 into a series of idempotent outcomes, fast com 732 into a series of idempotent outcomes, fast commits ensured idempotence during 738 the replay. 733 the replay. 739 734 740 Journal Checkpoint 735 Journal Checkpoint 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 736 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 742 737 743 Checkpointing the journal ensures all transact 738 Checkpointing the journal ensures all transactions and their associated buffers 744 are submitted to the disk. In-progress transac 739 are submitted to the disk. In-progress transactions are waited upon and included 745 in the checkpoint. Checkpointing is used inter 740 in the checkpoint. Checkpointing is used internally during critical updates to 746 the filesystem including journal recovery, fil 741 the filesystem including journal recovery, filesystem resizing, and freeing of 747 the journal_t structure. 742 the journal_t structure. 748 743 749 A journal checkpoint can be triggered from use 744 A journal checkpoint can be triggered from userspace via the ioctl 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single 745 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags. 751 Currently, three flags are supported. First, E 746 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN 752 can be used to verify input to the ioctl. It r 747 can be used to verify input to the ioctl. It returns error if there is any 753 invalid input, otherwise it returns success wi 748 invalid input, otherwise it returns success without performing 754 any checkpointing. This can be used to check w 749 any checkpointing. This can be used to check whether the ioctl exists on a 755 system and to verify there are no issues with 750 system and to verify there are no issues with arguments or flags. The 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_D 751 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags 752 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be 758 discarded or zero-filled, respectively, after 753 discarded or zero-filled, respectively, after the journal checkpoint is 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and 754 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT 760 cannot both be set. The ioctl may be useful wh 755 cannot both be set. The ioctl may be useful when snapshotting a system or for 761 complying with content deletion SLOs. 756 complying with content deletion SLOs.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.