1 .. SPDX-License-Identifier: GPL-2.0 2 .. _xfs_self_describing_metadata: 3 4 ============================ 5 XFS Self Describing Metadata 6 ============================ 7 8 Introduction 9 ============ 10 11 The largest scalability problem facing XFS is not one of algorithmic 12 scalability, but of verification of the filesystem structure. Scalabilty of the 13 structures and indexes on disk and the algorithms for iterating them are 14 adequate for supporting PB scale filesystems with billions of inodes, however it 15 is this very scalability that causes the verification problem. 16 17 Almost all metadata on XFS is dynamically allocated. The only fixed location 18 metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all 19 other metadata structures need to be discovered by walking the filesystem 20 structure in different ways. While this is already done by userspace tools for 21 validating and repairing the structure, there are limits to what they can 22 verify, and this in turn limits the supportable size of an XFS filesystem. 23 24 For example, it is entirely possible to manually use xfs_db and a bit of 25 scripting to analyse the structure of a 100TB filesystem when trying to 26 determine the root cause of a corruption problem, but it is still mainly a 27 manual task of verifying that things like single bit errors or misplaced writes 28 weren't the ultimate cause of a corruption event. It may take a few hours to a 29 few days to perform such forensic analysis, so for at this scale root cause 30 analysis is entirely possible. 31 32 However, if we scale the filesystem up to 1PB, we now have 10x as much metadata 33 to analyse and so that analysis blows out towards weeks/months of forensic work. 34 Most of the analysis work is slow and tedious, so as the amount of analysis goes 35 up, the more likely that the cause will be lost in the noise. Hence the primary 36 concern for supporting PB scale filesystems is minimising the time and effort 37 required for basic forensic analysis of the filesystem structure. 38 39 40 Self Describing Metadata 41 ======================== 42 43 One of the problems with the current metadata format is that apart from the 44 magic number in the metadata block, we have no other way of identifying what it 45 is supposed to be. We can't even identify if it is the right place. Put simply, 46 you can't look at a single metadata block in isolation and say "yes, it is 47 supposed to be there and the contents are valid". 48 49 Hence most of the time spent on forensic analysis is spent doing basic 50 verification of metadata values, looking for values that are in range (and hence 51 not detected by automated verification checks) but are not correct. Finding and 52 understanding how things like cross linked block lists (e.g. sibling 53 pointers in a btree end up with loops in them) are the key to understanding what 54 went wrong, but it is impossible to tell what order the blocks were linked into 55 each other or written to disk after the fact. 56 57 Hence we need to record more information into the metadata to allow us to 58 quickly determine if the metadata is intact and can be ignored for the purpose 59 of analysis. We can't protect against every possible type of error, but we can 60 ensure that common types of errors are easily detectable. Hence the concept of 61 self describing metadata. 62 63 The first, fundamental requirement of self describing metadata is that the 64 metadata object contains some form of unique identifier in a well known 65 location. This allows us to identify the expected contents of the block and 66 hence parse and verify the metadata object. IF we can't independently identify 67 the type of metadata in the object, then the metadata doesn't describe itself 68 very well at all! 69 70 Luckily, almost all XFS metadata has magic numbers embedded already - only the 71 AGFL, remote symlinks and remote attribute blocks do not contain identifying 72 magic numbers. Hence we can change the on-disk format of all these objects to 73 add more identifying information and detect this simply by changing the magic 74 numbers in the metadata objects. That is, if it has the current magic number, 75 the metadata isn't self identifying. If it contains a new magic number, it is 76 self identifying and we can do much more expansive automated verification of the 77 metadata object at runtime, during forensic analysis or repair. 78 79 As a primary concern, self describing metadata needs some form of overall 80 integrity checking. We cannot trust the metadata if we cannot verify that it has 81 not been changed as a result of external influences. Hence we need some form of 82 integrity check, and this is done by adding CRC32c validation to the metadata 83 block. If we can verify the block contains the metadata it was intended to 84 contain, a large amount of the manual verification work can be skipped. 85 86 CRC32c was selected as metadata cannot be more than 64k in length in XFS and 87 hence a 32 bit CRC is more than sufficient to detect multi-bit errors in 88 metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is 89 fast. So while CRC32c is not the strongest of possible integrity checks that 90 could be used, it is more than sufficient for our needs and has relatively 91 little overhead. Adding support for larger integrity fields and/or algorithms 92 does really provide any extra value over CRC32c, but it does add a lot of 93 complexity and so there is no provision for changing the integrity checking 94 mechanism. 95 96 Self describing metadata needs to contain enough information so that the 97 metadata block can be verified as being in the correct place without needing to 98 look at any other metadata. This means it needs to contain location information. 99 Just adding a block number to the metadata is not sufficient to protect against 100 mis-directed writes - a write might be misdirected to the wrong LUN and so be 101 written to the "correct block" of the wrong filesystem. Hence location 102 information must contain a filesystem identifier as well as a block number. 103 104 Another key information point in forensic analysis is knowing who the metadata 105 block belongs to. We already know the type, the location, that it is valid 106 and/or corrupted, and how long ago that it was last modified. Knowing the owner 107 of the block is important as it allows us to find other related metadata to 108 determine the scope of the corruption. For example, if we have a extent btree 109 object, we don't know what inode it belongs to and hence have to walk the entire 110 filesystem to find the owner of the block. Worse, the corruption could mean that 111 no owner can be found (i.e. it's an orphan block), and so without an owner field 112 in the metadata we have no idea of the scope of the corruption. If we have an 113 owner field in the metadata object, we can immediately do top down validation to 114 determine the scope of the problem. 115 116 Different types of metadata have different owner identifiers. For example, 117 directory, attribute and extent tree blocks are all owned by an inode, while 118 freespace btree blocks are owned by an allocation group. Hence the size and 119 contents of the owner field are determined by the type of metadata object we are 120 looking at. The owner information can also identify misplaced writes (e.g. 121 freespace btree block written to the wrong AG). 122 123 Self describing metadata also needs to contain some indication of when it was 124 written to the filesystem. One of the key information points when doing forensic 125 analysis is how recently the block was modified. Correlation of set of corrupted 126 metadata blocks based on modification times is important as it can indicate 127 whether the corruptions are related, whether there's been multiple corruption 128 events that lead to the eventual failure, and even whether there are corruptions 129 present that the run-time verification is not detecting. 130 131 For example, we can determine whether a metadata object is supposed to be free 132 space or still allocated if it is still referenced by its owner by looking at 133 when the free space btree block that contains the block was last written 134 compared to when the metadata object itself was last written. If the free space 135 block is more recent than the object and the object's owner, then there is a 136 very good chance that the block should have been removed from the owner. 137 138 To provide this "written timestamp", each metadata block gets the Log Sequence 139 Number (LSN) of the most recent transaction it was modified on written into it. 140 This number will always increase over the life of the filesystem, and the only 141 thing that resets it is running xfs_repair on the filesystem. Further, by use of 142 the LSN we can tell if the corrupted metadata all belonged to the same log 143 checkpoint and hence have some idea of how much modification occurred between 144 the first and last instance of corrupt metadata on disk and, further, how much 145 modification occurred between the corruption being written and when it was 146 detected. 147 148 Runtime Validation 149 ================== 150 151 Validation of self-describing metadata takes place at runtime in two places: 152 153 - immediately after a successful read from disk 154 - immediately prior to write IO submission 155 156 The verification is completely stateless - it is done independently of the 157 modification process, and seeks only to check that the metadata is what it says 158 it is and that the metadata fields are within bounds and internally consistent. 159 As such, we cannot catch all types of corruption that can occur within a block 160 as there may be certain limitations that operational state enforces of the 161 metadata, or there may be corruption of interblock relationships (e.g. corrupted 162 sibling pointer lists). Hence we still need stateful checking in the main code 163 body, but in general most of the per-field validation is handled by the 164 verifiers. 165 166 For read verification, the caller needs to specify the expected type of metadata 167 that it should see, and the IO completion process verifies that the metadata 168 object matches what was expected. If the verification process fails, then it 169 marks the object being read as EFSCORRUPTED. The caller needs to catch this 170 error (same as for IO errors), and if it needs to take special action due to a 171 verification error it can do so by catching the EFSCORRUPTED error value. If we 172 need more discrimination of error type at higher levels, we can define new 173 error numbers for different errors as necessary. 174 175 The first step in read verification is checking the magic number and determining 176 whether CRC validating is necessary. If it is, the CRC32c is calculated and 177 compared against the value stored in the object itself. Once this is validated, 178 further checks are made against the location information, followed by extensive 179 object specific metadata validation. If any of these checks fail, then the 180 buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. 181 182 Write verification is the opposite of the read verification - first the object 183 is extensively verified and if it is OK we then update the LSN from the last 184 modification made to the object, After this, we calculate the CRC and insert it 185 into the object. Once this is done the write IO is allowed to continue. If any 186 error occurs during this process, the buffer is again marked with a EFSCORRUPTED 187 error for the higher layers to catch. 188 189 Structures 190 ========== 191 192 A typical on-disk structure needs to contain the following information:: 193 194 struct xfs_ondisk_hdr { 195 __be32 magic; /* magic number */ 196 __be32 crc; /* CRC, not logged */ 197 uuid_t uuid; /* filesystem identifier */ 198 __be64 owner; /* parent object */ 199 __be64 blkno; /* location on disk */ 200 __be64 lsn; /* last modification in log, not logged */ 201 }; 202 203 Depending on the metadata, this information may be part of a header structure 204 separate to the metadata contents, or may be distributed through an existing 205 structure. The latter occurs with metadata that already contains some of this 206 information, such as the superblock and AG headers. 207 208 Other metadata may have different formats for the information, but the same 209 level of information is generally provided. For example: 210 211 - short btree blocks have a 32 bit owner (ag number) and a 32 bit block 212 number for location. The two of these combined provide the same 213 information as @owner and @blkno in eh above structure, but using 8 214 bytes less space on disk. 215 216 - directory/attribute node blocks have a 16 bit magic number, and the 217 header that contains the magic number has other information in it as 218 well. hence the additional metadata headers change the overall format 219 of the metadata. 220 221 A typical buffer read verifier is structured as follows:: 222 223 #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) 224 225 static void 226 xfs_foo_read_verify( 227 struct xfs_buf *bp) 228 { 229 struct xfs_mount *mp = bp->b_mount; 230 231 if ((xfs_sb_version_hascrc(&mp->m_sb) && 232 !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), 233 XFS_FOO_CRC_OFF)) || 234 !xfs_foo_verify(bp)) { 235 XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 236 xfs_buf_ioerror(bp, EFSCORRUPTED); 237 } 238 } 239 240 The code ensures that the CRC is only checked if the filesystem has CRCs enabled 241 by checking the superblock of the feature bit, and then if the CRC verifies OK 242 (or is not needed) it verifies the actual contents of the block. 243 244 The verifier function will take a couple of different forms, depending on 245 whether the magic number can be used to determine the format of the block. In 246 the case it can't, the code is structured as follows:: 247 248 static bool 249 xfs_foo_verify( 250 struct xfs_buf *bp) 251 { 252 struct xfs_mount *mp = bp->b_mount; 253 struct xfs_ondisk_hdr *hdr = bp->b_addr; 254 255 if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 256 return false; 257 258 if (!xfs_sb_version_hascrc(&mp->m_sb)) { 259 if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 260 return false; 261 if (bp->b_bn != be64_to_cpu(hdr->blkno)) 262 return false; 263 if (hdr->owner == 0) 264 return false; 265 } 266 267 /* object specific verification checks here */ 268 269 return true; 270 } 271 272 If there are different magic numbers for the different formats, the verifier 273 will look like:: 274 275 static bool 276 xfs_foo_verify( 277 struct xfs_buf *bp) 278 { 279 struct xfs_mount *mp = bp->b_mount; 280 struct xfs_ondisk_hdr *hdr = bp->b_addr; 281 282 if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { 283 if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 284 return false; 285 if (bp->b_bn != be64_to_cpu(hdr->blkno)) 286 return false; 287 if (hdr->owner == 0) 288 return false; 289 } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 290 return false; 291 292 /* object specific verification checks here */ 293 294 return true; 295 } 296 297 Write verifiers are very similar to the read verifiers, they just do things in 298 the opposite order to the read verifiers. A typical write verifier:: 299 300 static void 301 xfs_foo_write_verify( 302 struct xfs_buf *bp) 303 { 304 struct xfs_mount *mp = bp->b_mount; 305 struct xfs_buf_log_item *bip = bp->b_fspriv; 306 307 if (!xfs_foo_verify(bp)) { 308 XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 309 xfs_buf_ioerror(bp, EFSCORRUPTED); 310 return; 311 } 312 313 if (!xfs_sb_version_hascrc(&mp->m_sb)) 314 return; 315 316 317 if (bip) { 318 struct xfs_ondisk_hdr *hdr = bp->b_addr; 319 hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); 320 } 321 xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); 322 } 323 324 This will verify the internal structure of the metadata before we go any 325 further, detecting corruptions that have occurred as the metadata has been 326 modified in memory. If the metadata verifies OK, and CRCs are enabled, we then 327 update the LSN field (when it was last modified) and calculate the CRC on the 328 metadata. Once this is done, we can issue the IO. 329 330 Inodes and Dquots 331 ================= 332 333 Inodes and dquots are special snowflakes. They have per-object CRC and 334 self-identifiers, but they are packed so that there are multiple objects per 335 buffer. Hence we do not use per-buffer verifiers to do the work of per-object 336 verification and CRC calculations. The per-buffer verifiers simply perform basic 337 identification of the buffer - that they contain inodes or dquots, and that 338 there are magic numbers in all the expected spots. All further CRC and 339 verification checks are done when each inode is read from or written back to the 340 buffer. 341 342 The structure of the verifiers and the identifiers checks is very similar to the 343 buffer code described above. The only difference is where they are called. For 344 example, inode read verification is done in xfs_inode_from_disk() when the inode 345 is first read out of the buffer and the struct xfs_inode is instantiated. The 346 inode is already extensively verified during writeback in xfs_iflush_int, so the 347 only addition here is to add the LSN and CRC to the inode as it is copied back 348 into the buffer. 349 350 XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of 351 the unlinked list modifications check or update CRCs, neither during unlink nor 352 log recovery. So, it's gone unnoticed until now. This won't matter immediately - 353 repair will probably complain about it - but it needs to be fixed.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.