1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 Block and Inode Allocation Policy 3 Block and Inode Allocation Policy 4 --------------------------------- 4 --------------------------------- 5 5 6 ext4 recognizes (better than ext3, anyway) tha 6 ext4 recognizes (better than ext3, anyway) that data locality is 7 generally a desirably quality of a filesystem. 7 generally a desirably quality of a filesystem. On a spinning disk, 8 keeping related blocks near each other reduces 8 keeping related blocks near each other reduces the amount of movement 9 that the head actuator and disk must perform t 9 that the head actuator and disk must perform to access a data block, 10 thus speeding up disk IO. On an SSD there of c 10 thus speeding up disk IO. On an SSD there of course are no moving parts, 11 but locality can increase the size of each tra 11 but locality can increase the size of each transfer request while 12 reducing the total number of requests. This lo 12 reducing the total number of requests. This locality may also have the 13 effect of concentrating writes on a single era 13 effect of concentrating writes on a single erase block, which can speed 14 up file rewrites significantly. Therefore, it 14 up file rewrites significantly. Therefore, it is useful to reduce 15 fragmentation whenever possible. 15 fragmentation whenever possible. 16 16 17 The first tool that ext4 uses to combat fragme 17 The first tool that ext4 uses to combat fragmentation is the multi-block 18 allocator. When a file is first created, the b 18 allocator. When a file is first created, the block allocator 19 speculatively allocates 8KiB of disk space to 19 speculatively allocates 8KiB of disk space to the file on the assumption 20 that the space will get written soon. When the 20 that the space will get written soon. When the file is closed, the 21 unused speculative allocations are of course f 21 unused speculative allocations are of course freed, but if the 22 speculation is correct (typically the case for 22 speculation is correct (typically the case for full writes of small 23 files) then the file data gets written out in 23 files) then the file data gets written out in a single multi-block 24 extent. A second related trick that ext4 uses 24 extent. A second related trick that ext4 uses is delayed allocation. 25 Under this scheme, when a file needs more bloc 25 Under this scheme, when a file needs more blocks to absorb file writes, 26 the filesystem defers deciding the exact place 26 the filesystem defers deciding the exact placement on the disk until all 27 the dirty buffers are being written out to dis 27 the dirty buffers are being written out to disk. By not committing to a 28 particular placement until it's absolutely nec 28 particular placement until it's absolutely necessary (the commit timeout 29 is hit, or sync() is called, or the kernel run 29 is hit, or sync() is called, or the kernel runs out of memory), the hope 30 is that the filesystem can make better locatio 30 is that the filesystem can make better location decisions. 31 31 32 The third trick that ext4 (and ext3) uses is t 32 The third trick that ext4 (and ext3) uses is that it tries to keep a 33 file's data blocks in the same block group as 33 file's data blocks in the same block group as its inode. This cuts down 34 on the seek penalty when the filesystem first 34 on the seek penalty when the filesystem first has to read a file's inode 35 to learn where the file's data blocks live and 35 to learn where the file's data blocks live and then seek over to the 36 file's data blocks to begin I/O operations. 36 file's data blocks to begin I/O operations. 37 37 38 The fourth trick is that all the inodes in a d 38 The fourth trick is that all the inodes in a directory are placed in the 39 same block group as the directory, when feasib 39 same block group as the directory, when feasible. The working assumption 40 here is that all the files in a directory migh 40 here is that all the files in a directory might be related, therefore it 41 is useful to try to keep them all together. 41 is useful to try to keep them all together. 42 42 43 The fifth trick is that the disk volume is cut 43 The fifth trick is that the disk volume is cut up into 128MB block 44 groups; these mini-containers are used as outl 44 groups; these mini-containers are used as outlined above to try to 45 maintain data locality. However, there is a de 45 maintain data locality. However, there is a deliberate quirk -- when a 46 directory is created in the root directory, th 46 directory is created in the root directory, the inode allocator scans 47 the block groups and puts that directory into 47 the block groups and puts that directory into the least heavily loaded 48 block group that it can find. This encourages 48 block group that it can find. This encourages directories to spread out 49 over a disk; as the top-level directory/file b 49 over a disk; as the top-level directory/file blobs fill up one block 50 group, the allocators simply move on to the ne 50 group, the allocators simply move on to the next block group. Allegedly 51 this scheme evens out the loading on the block 51 this scheme evens out the loading on the block groups, though the author 52 suspects that the directories which are so unl 52 suspects that the directories which are so unlucky as to land towards 53 the end of a spinning drive get a raw deal per 53 the end of a spinning drive get a raw deal performance-wise. 54 54 55 Of course if all of these mechanisms fail, one 55 Of course if all of these mechanisms fail, one can always use e4defrag 56 to defragment files. 56 to defragment files.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.