1 .. SPDX-License-Identifier: GPL-2.0 2 3 Journal (jbd2) 4 -------------- 5 6 Introduced in ext3, the ext4 filesystem employ 7 filesystem against metadata inconsistencies in 8 to 10,240,000 file system blocks (see man mke2 9 size limits) can be reserved inside the filesy 10 “important” data writes on-disk as quickly 11 data transaction is fully written to the disk 12 cache, a record of the data being committed is 13 some later point in time, the journal code wri 14 final locations on disk (this could involve a 15 read-write-erases) before erasing the commit r 16 crash during the second slow write, the journa 17 way to the latest commit record, guaranteeing 18 gets written through the journal to the disk. 19 guarantee that the filesystem does not become 20 metadata update. 21 22 For performance reasons, ext4 by default only 23 through the journal. This means that file data 24 guaranteed to be in any consistent state after 25 guarantee level (``data=ordered``) is not sati 26 option to control journal behavior. If ``data= 27 metadata are written to disk through the journ 28 safest. If ``data=writeback``, dirty data bloc 29 disk before the metadata are written to disk t 30 31 In case of ``data=ordered`` mode, Ext4 also su 32 help reduce commit latency significantly. The 33 mode works by logging metadata blocks to the j 34 mode, Ext4 only stores the minimal delta neede 35 affected metadata in fast commit space that is 36 Once the fast commit area fills in or if fast 37 or if JBD2 commit timer goes off, Ext4 perform 38 A full commit invalidates all the fast commits 39 it and thus it makes the fast commit area empt 40 commits. This feature needs to be enabled at m 41 42 The journal inode is typically inode 8. The fi 43 journal inode are replicated in the ext4 super 44 is normal (but hidden) file within the filesys 45 consumes an entire block group, though mke2fs 46 middle of the disk. 47 48 All fields in jbd2 are written to disk in big- 49 opposite of ext4. 50 51 NOTE: Both ext4 and ocfs2 use jbd2. 52 53 The maximum size of a journal embedded in an e 54 blocks. jbd2 itself does not seem to care. 55 56 Layout 57 ~~~~~~ 58 59 Generally speaking, the journal has this forma 60 61 .. list-table:: 62 :widths: 16 48 16 63 :header-rows: 1 64 65 * - Superblock 66 - descriptor_block (data_blocks or revoca 67 revocations] commmit_block 68 - [more transactions...] 69 * - 70 - One transaction 71 - 72 73 Notice that a transaction begins with either a 74 or a block revocation list. A finished transac 75 commit. If there is no commit record (or the c 76 transaction will be discarded during replay. 77 78 External Journal 79 ~~~~~~~~~~~~~~~~ 80 81 Optionally, an ext4 filesystem can be created 82 device (as opposed to an internal journal, whi 83 In this case, on the filesystem device, ``s_jo 84 zero and ``s_journal_uuid`` should be set. On 85 will be an ext4 super block in the usual place 86 The journal superblock will be in the next ful 87 superblock. 88 89 .. list-table:: 90 :widths: 12 12 12 32 12 91 :header-rows: 1 92 93 * - 1024 bytes of padding 94 - ext4 Superblock 95 - Journal Superblock 96 - descriptor_block (data_blocks or revoca 97 revocations] commmit_block 98 - [more transactions...] 99 * - 100 - 101 - 102 - One transaction 103 - 104 105 Block Header 106 ~~~~~~~~~~~~ 107 108 Every block in the journal starts with a commo 109 ``struct journal_header_s``: 110 111 .. list-table:: 112 :widths: 8 8 24 40 113 :header-rows: 1 114 115 * - Offset 116 - Type 117 - Name 118 - Description 119 * - 0x0 120 - __be32 121 - h_magic 122 - jbd2 magic number, 0xC03B3998. 123 * - 0x4 124 - __be32 125 - h_blocktype 126 - Description of what this block contains 127 below. 128 * - 0x8 129 - __be32 130 - h_sequence 131 - The transaction ID that goes with this 132 133 .. _jbd2_blocktype: 134 135 The journal block type can be any one of: 136 137 .. list-table:: 138 :widths: 16 64 139 :header-rows: 1 140 141 * - Value 142 - Description 143 * - 1 144 - Descriptor. This block precedes a serie 145 written through the journal during a tr 146 * - 2 147 - Block commit record. This block signifi 148 transaction. 149 * - 3 150 - Journal superblock, v1. 151 * - 4 152 - Journal superblock, v2. 153 * - 5 154 - Block revocation records. This speeds u 155 journal to skip writing blocks that wer 156 157 Super Block 158 ~~~~~~~~~~~ 159 160 The super block for the journal is much simple 161 The key data kept within are size of the journ 162 start of the log of transactions. 163 164 The journal superblock is recorded as ``struct 165 which is 1024 bytes long: 166 167 .. list-table:: 168 :widths: 8 8 24 40 169 :header-rows: 1 170 171 * - Offset 172 - Type 173 - Name 174 - Description 175 * - 176 - 177 - 178 - Static information describing the journ 179 * - 0x0 180 - journal_header_t (12 bytes) 181 - s_header 182 - Common header identifying this as a sup 183 * - 0xC 184 - __be32 185 - s_blocksize 186 - Journal device block size. 187 * - 0x10 188 - __be32 189 - s_maxlen 190 - Total number of blocks in this journal. 191 * - 0x14 192 - __be32 193 - s_first 194 - First block of log information. 195 * - 196 - 197 - 198 - Dynamic information describing the curr 199 * - 0x18 200 - __be32 201 - s_sequence 202 - First commit ID expected in log. 203 * - 0x1C 204 - __be32 205 - s_start 206 - Block number of the start of log. Contr 207 being zero does not imply that the jour 208 * - 0x20 209 - __be32 210 - s_errno 211 - Error value, as set by jbd2_journal_abo 212 * - 213 - 214 - 215 - The remaining fields are only valid in 216 * - 0x24 217 - __be32 218 - s_feature_compat; 219 - Compatible feature set. See the table j 220 * - 0x28 221 - __be32 222 - s_feature_incompat 223 - Incompatible feature set. See the table 224 * - 0x2C 225 - __be32 226 - s_feature_ro_compat 227 - Read-only compatible feature set. There 228 * - 0x30 229 - __u8 230 - s_uuid[16] 231 - 128-bit uuid for journal. This is compa 232 super block at mount time. 233 * - 0x40 234 - __be32 235 - s_nr_users 236 - Number of file systems sharing this jou 237 * - 0x44 238 - __be32 239 - s_dynsuper 240 - Location of dynamic super block copy. ( 241 * - 0x48 242 - __be32 243 - s_max_transaction 244 - Limit of journal blocks per transaction 245 * - 0x4C 246 - __be32 247 - s_max_trans_data 248 - Limit of data blocks per transaction. ( 249 * - 0x50 250 - __u8 251 - s_checksum_type 252 - Checksum algorithm used for the journal 253 more info. 254 * - 0x51 255 - __u8[3] 256 - s_padding2 257 - 258 * - 0x54 259 - __be32 260 - s_num_fc_blocks 261 - Number of fast commit blocks in the jou 262 * - 0x58 263 - __be32 264 - s_head 265 - Block number of the head (first unused 266 up-to-date when the journal is empty. 267 * - 0x5C 268 - __u32 269 - s_padding[40] 270 - 271 * - 0xFC 272 - __be32 273 - s_checksum 274 - Checksum of the entire superblock, with 275 * - 0x100 276 - __u8 277 - s_users[16*48] 278 - ids of all file systems sharing the log 279 shared external journals, but I imagine 280 the jbd2 code, might. 281 282 .. _jbd2_compat: 283 284 The journal compat features are any combinatio 285 286 .. list-table:: 287 :widths: 16 64 288 :header-rows: 1 289 290 * - Value 291 - Description 292 * - 0x1 293 - Journal maintains checksums on the data 294 (JBD2_FEATURE_COMPAT_CHECKSUM) 295 296 .. _jbd2_incompat: 297 298 The journal incompat features are any combinat 299 300 .. list-table:: 301 :widths: 16 64 302 :header-rows: 1 303 304 * - Value 305 - Description 306 * - 0x1 307 - Journal has block revocation records. ( 308 * - 0x2 309 - Journal can deal with 64-bit block numb 310 (JBD2_FEATURE_INCOMPAT_64BIT) 311 * - 0x4 312 - Journal commits asynchronously. (JBD2_F 313 * - 0x8 314 - This journal uses v2 of the checksum on 315 metadata block gets its own checksum, a 316 descriptor table contain checksums for 317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2 318 * - 0x10 319 - This journal uses v3 of the checksum on 320 v2, but the journal block tag size is f 321 block numbers. (JBD2_FEATURE_INCOMPAT_C 322 * - 0x20 323 - Journal has fast commit blocks. (JBD2_F 324 325 .. _jbd2_checksum_type: 326 327 Journal checksum type codes are one of the fol 328 most likely choices. 329 330 .. list-table:: 331 :widths: 16 64 332 :header-rows: 1 333 334 * - Value 335 - Description 336 * - 1 337 - CRC32 338 * - 2 339 - MD5 340 * - 3 341 - SHA1 342 * - 4 343 - CRC32C 344 345 Descriptor Block 346 ~~~~~~~~~~~~~~~~ 347 348 The descriptor block contains an array of jour 349 describe the final locations of the data block 350 journal. Descriptor blocks are open-coded inst 351 described by a data structure, but here is the 352 Descriptor blocks consume at least 36 bytes, b 353 354 .. list-table:: 355 :widths: 8 8 24 40 356 :header-rows: 1 357 358 * - Offset 359 - Type 360 - Name 361 - Descriptor 362 * - 0x0 363 - journal_header_t 364 - (open coded) 365 - Common block header. 366 * - 0xC 367 - struct journal_block_tag_s 368 - open coded array[] 369 - Enough tags either to fill up the block 370 blocks that follow this descriptor bloc 371 372 Journal block tags have any of the following f 373 journal feature and block tag flags are set. 374 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the j 376 defined as ``struct journal_block_tag3_s``, wh 377 following. The size is 16 or 32 bytes. 378 379 .. list-table:: 380 :widths: 8 8 24 40 381 :header-rows: 1 382 383 * - Offset 384 - Type 385 - Name 386 - Descriptor 387 * - 0x0 388 - __be32 389 - t_blocknr 390 - Lower 32-bits of the location of where 391 should end up on disk. 392 * - 0x4 393 - __be32 394 - t_flags 395 - Flags that go with the descriptor. See 396 more info. 397 * - 0x8 398 - __be32 399 - t_blocknr_high 400 - Upper 32-bits of the location of where 401 should end up on disk. This is zero if 402 not enabled. 403 * - 0xC 404 - __be32 405 - t_checksum 406 - Checksum of the journal UUID, the seque 407 * - 408 - 409 - 410 - This field appears to be open coded. It 411 tag, after t_checksum. This field is no 412 is set. 413 * - 0x8 or 0xC 414 - char 415 - uuid[16] 416 - A UUID to go with this tag. This field 417 ``j_uuid`` field in ``struct journal_s` 418 field. 419 420 .. _jbd2_tag_flags: 421 422 The journal tag flags are any combination of t 423 424 .. list-table:: 425 :widths: 16 64 426 :header-rows: 1 427 428 * - Value 429 - Description 430 * - 0x1 431 - On-disk block is escaped. The first fou 432 happened to match the jbd2 magic number 433 * - 0x2 434 - This block has the same UUID as previou 435 omitted. 436 * - 0x4 437 - The data block was deleted by the trans 438 * - 0x8 439 - This is the last tag in this descriptor 440 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, t 442 is defined as ``struct journal_block_tag_s``, 443 following. The size is 8, 12, 24, or 28 bytes: 444 445 .. list-table:: 446 :widths: 8 8 24 40 447 :header-rows: 1 448 449 * - Offset 450 - Type 451 - Name 452 - Descriptor 453 * - 0x0 454 - __be32 455 - t_blocknr 456 - Lower 32-bits of the location of where 457 should end up on disk. 458 * - 0x4 459 - __be16 460 - t_checksum 461 - Checksum of the journal UUID, the seque 462 Note that only the lower 16 bits are st 463 * - 0x6 464 - __be16 465 - t_flags 466 - Flags that go with the descriptor. See 467 more info. 468 * - 469 - 470 - 471 - This next field is only present if the 472 64-bit block numbers. 473 * - 0x8 474 - __be32 475 - t_blocknr_high 476 - Upper 32-bits of the location of where 477 should end up on disk. 478 * - 479 - 480 - 481 - This field appears to be open coded. It 482 tag, after t_flags or t_blocknr_high. T 483 "same UUID" flag is set. 484 * - 0x8 or 0xC 485 - char 486 - uuid[16] 487 - A UUID to go with this tag. This field 488 ``j_uuid`` field in ``struct journal_s` 489 field. 490 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 493 ``struct jbd2_journal_block_tail``, which look 494 495 .. list-table:: 496 :widths: 8 8 24 40 497 :header-rows: 1 498 499 * - Offset 500 - Type 501 - Name 502 - Descriptor 503 * - 0x0 504 - __be32 505 - t_checksum 506 - Checksum of the journal UUID + the desc 507 to zero. 508 509 Data Block 510 ~~~~~~~~~~ 511 512 In general, the data blocks being written to d 513 are written verbatim into the journal file aft 514 However, if the first four bytes of the block 515 number then those four bytes are replaced with 516 flag is set in the descriptor block tag. 517 518 Revocation Block 519 ~~~~~~~~~~~~~~~~ 520 521 A revocation block is used to prevent replay o 522 transaction. This is used to mark blocks that 523 time but are no longer journalled. Typically t 524 block is freed and re-allocated as a file data 525 journal replay after the file block was writte 526 corruption. 527 528 **NOTE**: This mechanism is NOT used to expres 529 superseded by this other journal block”, as 530 mistakenly thought. Any block being added to a 531 the removal of all existing revocation records 532 533 Revocation blocks are described in 534 ``struct jbd2_journal_revoke_header_s``, are a 535 length, but use a full block: 536 537 .. list-table:: 538 :widths: 8 8 24 40 539 :header-rows: 1 540 541 * - Offset 542 - Type 543 - Name 544 - Description 545 * - 0x0 546 - journal_header_t 547 - r_header 548 - Common block header. 549 * - 0xC 550 - __be32 551 - r_count 552 - Number of bytes used in this block. 553 * - 0x10 554 - __be32 or __be64 555 - blocks[0] 556 - Blocks to revoke. 557 558 After r_count is a linear array of block numbe 559 revoked by this transaction. The size of each 560 the superblock advertises 64-bit block number 561 otherwise. 562 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end 565 block is a ``struct jbd2_journal_revoke_tail`` 566 567 .. list-table:: 568 :widths: 8 8 24 40 569 :header-rows: 1 570 571 * - Offset 572 - Type 573 - Name 574 - Description 575 * - 0x0 576 - __be32 577 - r_checksum 578 - Checksum of the journal UUID + revocati 579 580 Commit Block 581 ~~~~~~~~~~~~ 582 583 The commit block is a sentry that indicates th 584 completely written to the journal. Once this c 585 journal, the data stored with this transaction 586 final locations on disk. 587 588 The commit block is described by ``struct comm 589 bytes long (but uses a full block): 590 591 .. list-table:: 592 :widths: 8 8 24 40 593 :header-rows: 1 594 595 * - Offset 596 - Type 597 - Name 598 - Descriptor 599 * - 0x0 600 - journal_header_s 601 - (open coded) 602 - Common block header. 603 * - 0xC 604 - unsigned char 605 - h_chksum_type 606 - The type of checksum to use to verify t 607 in the transaction. See jbd2_checksum_t 608 * - 0xD 609 - unsigned char 610 - h_chksum_size 611 - The number of bytes used by the checksu 612 * - 0xE 613 - unsigned char 614 - h_padding[2] 615 - 616 * - 0x10 617 - __be32 618 - h_chksum[JBD2_CHECKSUM_BYTES] 619 - 32 bytes of space to store checksums. I 620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_F 621 are set, the first ``__be32`` is the ch 622 the entire commit block, with this fiel 623 JBD2_FEATURE_COMPAT_CHECKSUM is set, th 624 crc32 of all the blocks already written 625 * - 0x30 626 - __be64 627 - h_commit_sec 628 - The time that the transaction was commi 629 * - 0x38 630 - __be32 631 - h_commit_nsec 632 - Nanoseconds component of the above time 633 634 Fast commits 635 ~~~~~~~~~~~~ 636 637 Fast commit area is organized as a log of tag 638 a ``struct ext4_fc_tl`` in the beginning which 639 of the entire field. It is followed by variabl 640 Here is the list of supported tags and their m 641 642 .. list-table:: 643 :widths: 8 20 20 32 644 :header-rows: 1 645 646 * - Tag 647 - Meaning 648 - Value struct 649 - Description 650 * - EXT4_FC_TAG_HEAD 651 - Fast commit area header 652 - ``struct ext4_fc_head`` 653 - Stores the TID of the transaction after 654 be applied. 655 * - EXT4_FC_TAG_ADD_RANGE 656 - Add extent to inode 657 - ``struct ext4_fc_add_range`` 658 - Stores the inode number and extent to b 659 * - EXT4_FC_TAG_DEL_RANGE 660 - Remove logical offsets to inode 661 - ``struct ext4_fc_del_range`` 662 - Stores the inode number and the logical 663 removed 664 * - EXT4_FC_TAG_CREAT 665 - Create directory entry for a newly crea 666 - ``struct ext4_fc_dentry_info`` 667 - Stores the parent inode number, inode n 668 newly created file 669 * - EXT4_FC_TAG_LINK 670 - Link a directory entry to an inode 671 - ``struct ext4_fc_dentry_info`` 672 - Stores the parent inode number, inode n 673 * - EXT4_FC_TAG_UNLINK 674 - Unlink a directory entry of an inode 675 - ``struct ext4_fc_dentry_info`` 676 - Stores the parent inode number, inode n 677 678 * - EXT4_FC_TAG_PAD 679 - Padding (unused area) 680 - None 681 - Unused bytes in the fast commit area. 682 683 * - EXT4_FC_TAG_TAIL 684 - Mark the end of a fast commit 685 - ``struct ext4_fc_tail`` 686 - Stores the TID of the commit, CRC of th 687 represents the end of 688 689 Fast Commit Replay Idempotence 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 691 692 Fast commits tags are idempotent in nature pro 693 certain rules. The guiding principle that the 694 committing is that it stores the result of a p 695 storing the procedure. 696 697 Let's consider this rename operation: 'mv /a / 698 was associated with inode 10. During fast comm 699 operation as a procedure "rename a to b", we s 700 state as a "series" of outcomes: 701 702 - Link dirent b to inode 10 703 - Unlink dirent a 704 - Inode 10 with valid refcount 705 706 Now when recovery code runs, it needs "enforce 707 system. This is what guarantees idempotence of 708 709 Let's take an example of a procedure that is n 710 commits make it idempotent. Consider following 711 712 1) rm A 713 2) mv B A 714 3) read A 715 716 If we store this sequence of operations as is 717 Let's say while in replay, we crash after (2). 718 file A (which was actually created as a result 719 deleted. Thus, file named A would be absent wh 720 sequence of operations is not idempotent. Howe 721 of storing the procedure fast commits store th 722 the fast commit log for above procedure would 723 724 (Let's assume dirent A was linked to inode 10 725 inode 11 before the replay) 726 727 1) Unlink A 728 2) Link A to inode 11 729 3) Unlink B 730 4) Inode 11 731 732 If we crash after (3) we will have file A link 733 replay, we will remove file A (inode 11). But 734 it point to inode 11. We won't find B, so we'l 735 point, the refcount for inode 11 is not reliab 736 replay of last inode 11 tag. Thus, by converti 737 into a series of idempotent outcomes, fast com 738 the replay. 739 740 Journal Checkpoint 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 742 743 Checkpointing the journal ensures all transact 744 are submitted to the disk. In-progress transac 745 in the checkpoint. Checkpointing is used inter 746 the filesystem including journal recovery, fil 747 the journal_t structure. 748 749 A journal checkpoint can be triggered from use 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single 751 Currently, three flags are supported. First, E 752 can be used to verify input to the ioctl. It r 753 invalid input, otherwise it returns success wi 754 any checkpointing. This can be used to check w 755 system and to verify there are no issues with 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_D 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags 758 discarded or zero-filled, respectively, after 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and 760 cannot both be set. The ioctl may be useful wh 761 complying with content deletion SLOs.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.