Linux/Documentation/mm/hugetlbfs

Version: ~ [ linux-6.12-rc7 ] ~ [ linux-6.11.7 ] ~ [ linux-6.10.14 ] ~ [ linux-6.9.12 ] ~ [ linux-6.8.12 ] ~ [ linux-6.7.12 ] ~ [ linux-6.6.60 ] ~ [ linux-6.5.13 ] ~ [ linux-6.4.16 ] ~ [ linux-6.3.13 ] ~ [ linux-6.2.16 ] ~ [ linux-6.1.116 ] ~ [ linux-6.0.19 ] ~ [ linux-5.19.17 ] ~ [ linux-5.18.19 ] ~ [ linux-5.17.15 ] ~ [ linux-5.16.20 ] ~ [ linux-5.15.171 ] ~ [ linux-5.14.21 ] ~ [ linux-5.13.19 ] ~ [ linux-5.12.19 ] ~ [ linux-5.11.22 ] ~ [ linux-5.10.229 ] ~ [ linux-5.9.16 ] ~ [ linux-5.8.18 ] ~ [ linux-5.7.19 ] ~ [ linux-5.6.19 ] ~ [ linux-5.5.19 ] ~ [ linux-5.4.285 ] ~ [ linux-5.3.18 ] ~ [ linux-5.2.21 ] ~ [ linux-5.1.21 ] ~ [ linux-5.0.21 ] ~ [ linux-4.20.17 ] ~ [ linux-4.19.323 ] ~ [ linux-4.18.20 ] ~ [ linux-4.17.19 ] ~ [ linux-4.16.18 ] ~ [ linux-4.15.18 ] ~ [ linux-4.14.336 ] ~ [ linux-4.13.16 ] ~ [ linux-4.12.14 ] ~ [ linux-4.11.12 ] ~ [ linux-4.10.17 ] ~ [ linux-4.9.337 ] ~ [ linux-4.4.302 ] ~ [ linux-3.10.108 ] ~ [ linux-2.6.32.71 ] ~ [ linux-2.6.0 ] ~ [ linux-2.4.37.11 ] ~ [ unix-v6-master ] ~ [ ccs-tools-1.8.12 ] ~ [ policy-sample ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

1 ===================== 2 Hugetlbfs Reservation 3 ===================== 4 5 Overview 6 ======== 7 8 Huge pages as described at Documentation/admin-guide/mm/hugetlbpage.rst are 9 typically preallocated for application use. These huge pages are instantiated 10 in a task's address space at page fault time if the VMA indicates huge pages 11 are to be used. If no huge page exists at page fault time, the task is sent 12 a SIGBUS and often dies an unhappy death. Shortly after huge page support 13 was added, it was determined that it would be better to detect a shortage 14 of huge pages at mmap() time. The idea is that if there were not enough 15 huge pages to cover the mapping, the mmap() would fail. This was first 16 done with a simple check in the code at mmap() time to determine if there 17 were enough free huge pages to cover the mapping. Like most things in the 18 kernel, the code has evolved over time. However, the basic idea was to 19 'reserve' huge pages at mmap() time to ensure that huge pages would be 20 available for page faults in that mapping. The description below attempts to 21 describe how huge page reserve processing is done in the v4.10 kernel. 22 23 24 Audience 25 ======== 26 This description is primarily targeted at kernel developers who are modifying 27 hugetlbfs code. 28 29 30 The Data Structures 31 =================== 32 33 resv_huge_pages 34 This is a global (per-hstate) count of reserved huge pages. Reserved 35 huge pages are only available to the task which reserved them. 36 Therefore, the number of huge pages generally available is computed 37 as (``free_huge_pages - resv_huge_pages``). 38 Reserve Map 39 A reserve map is described by the structure:: 40 41 struct resv_map { 42 struct kref refs; 43 spinlock_t lock; 44 struct list_head regions; 45 long adds_in_progress; 46 struct list_head region_cache; 47 long region_cache_count; 48 }; 49 50 There is one reserve map for each huge page mapping in the system. 51 The regions list within the resv_map describes the regions within 52 the mapping. A region is described as:: 53 54 struct file_region { 55 struct list_head link; 56 long from; 57 long to; 58 }; 59 60 The 'from' and 'to' fields of the file region structure are huge page 61 indices into the mapping. Depending on the type of mapping, a 62 region in the reserv_map may indicate reservations exist for the 63 range, or reservations do not exist. 64 Flags for MAP_PRIVATE Reservations 65 These are stored in the bottom bits of the reservation map pointer. 66 67 ``#define HPAGE_RESV_OWNER (1UL << 0)`` 68 Indicates this task is the owner of the reservations 69 associated with the mapping. 70 ``#define HPAGE_RESV_UNMAPPED (1UL << 1)`` 71 Indicates task originally mapping this range (and creating 72 reserves) has unmapped a page from this task (the child) 73 due to a failed COW. 74 Page Flags 75 The PagePrivate page flag is used to indicate that a huge page 76 reservation must be restored when the huge page is freed. More 77 details will be discussed in the "Freeing huge pages" section. 78 79 80 Reservation Map Location (Private or Shared) 81 ============================================ 82 83 A huge page mapping or segment is either private or shared. If private, 84 it is typically only available to a single address space (task). If shared, 85 it can be mapped into multiple address spaces (tasks). The location and 86 semantics of the reservation map is significantly different for the two types 87 of mappings. Location differences are: 88 89 - For private mappings, the reservation map hangs off the VMA structure. 90 Specifically, vma->vm_private_data. This reserve map is created at the 91 time the mapping (mmap(MAP_PRIVATE)) is created. 92 - For shared mappings, the reservation map hangs off the inode. Specifically, 93 inode->i_mapping->private_data. Since shared mappings are always backed 94 by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode 95 contains a reservation map. As a result, the reservation map is allocated 96 when the inode is created. 97 98 99 Creating Reservations 100 ===================== 101 Reservations are created when a huge page backed shared memory segment is 102 created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). 103 These operations result in a call to the routine hugetlb_reserve_pages():: 104 105 int hugetlb_reserve_pages(struct inode *inode, 106 long from, long to, 107 struct vm_area_struct *vma, 108 vm_flags_t vm_flags) 109 110 The first thing hugetlb_reserve_pages() does is check if the NORESERVE 111 flag was specified in either the shmget() or mmap() call. If NORESERVE 112 was specified, then this routine returns immediately as no reservations 113 are desired. 114 115 The arguments 'from' and 'to' are huge page indices into the mapping or 116 underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to 117 the length of the segment/mapping. For mmap(), the offset argument could 118 be used to specify the offset into the underlying file. In such a case, 119 the 'from' and 'to' arguments have been adjusted by this offset. 120 121 One of the big differences between PRIVATE and SHARED mappings is the way 122 in which reservations are represented in the reservation map. 123 124 - For shared mappings, an entry in the reservation map indicates a reservation 125 exists or did exist for the corresponding page. As reservations are 126 consumed, the reservation map is not modified. 127 - For private mappings, the lack of an entry in the reservation map indicates 128 a reservation exists for the corresponding page. As reservations are 129 consumed, entries are added to the reservation map. Therefore, the 130 reservation map can also be used to determine which reservations have 131 been consumed. 132 133 For private mappings, hugetlb_reserve_pages() creates the reservation map and 134 hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set 135 to indicate this VMA owns the reservations. 136 137 The reservation map is consulted to determine how many huge page reservations 138 are needed for the current mapping/segment. For private mappings, this is 139 always the value (to - from). However, for shared mappings it is possible that 140 some reservations may already exist within the range (to - from). See the 141 section :ref:`Reservation Map Modifications <resv_map_modifications>` 142 for details on how this is accomplished. 143 144 The mapping may be associated with a subpool. If so, the subpool is consulted 145 to ensure there is sufficient space for the mapping. It is possible that the 146 subpool has set aside reservations that can be used for the mapping. See the 147 section :ref:`Subpool Reservations <sub_pool_resv>` for more details. 148 149 After consulting the reservation map and subpool, the number of needed new 150 reservations is known. The routine hugetlb_acct_memory() is called to check 151 for and take the requested number of reservations. hugetlb_acct_memory() 152 calls into routines that potentially allocate and adjust surplus page counts. 153 However, within those routines the code is simply checking to ensure there 154 are enough free huge pages to accommodate the reservation. If there are, 155 the global reservation count resv_huge_pages is adjusted something like the 156 following:: 157 158 if (resv_needed <= (resv_huge_pages - free_huge_pages)) 159 resv_huge_pages += resv_needed; 160 161 Note that the global lock hugetlb_lock is held when checking and adjusting 162 these counters. 163 164 If there were enough free huge pages and the global count resv_huge_pages 165 was adjusted, then the reservation map associated with the mapping is 166 modified to reflect the reservations. In the case of a shared mapping, a 167 file_region will exist that includes the range 'from' - 'to'. For private 168 mappings, no modifications are made to the reservation map as lack of an 169 entry indicates a reservation exists. 170 171 If hugetlb_reserve_pages() was successful, the global reservation count and 172 reservation map associated with the mapping will be modified as required to 173 ensure reservations exist for the range 'from' - 'to'. 174 175 .. _consume_resv: 176 177 Consuming Reservations/Allocating a Huge Page 178 ============================================= 179 180 Reservations are consumed when huge pages associated with the reservations 181 are allocated and instantiated in the corresponding mapping. The allocation 182 is performed within the routine alloc_hugetlb_folio():: 183 184 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, 185 unsigned long addr, int avoid_reserve) 186 187 alloc_hugetlb_folio is passed a VMA pointer and a virtual address, so it can 188 consult the reservation map to determine if a reservation exists. In addition, 189 alloc_hugetlb_folio takes the argument avoid_reserve which indicates reserves 190 should not be used even if it appears they have been set aside for the 191 specified address. The avoid_reserve argument is most often used in the case 192 of Copy on Write and Page Migration where additional copies of an existing 193 page are being allocated. 194 195 The helper routine vma_needs_reservation() is called to determine if a 196 reservation exists for the address within the mapping(vma). See the section 197 :ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed 198 information on what this routine does. 199 The value returned from vma_needs_reservation() is generally 200 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. 201 If a reservation does not exist, and there is a subpool associated with the 202 mapping the subpool is consulted to determine if it contains reservations. 203 If the subpool contains reservations, one can be used for this allocation. 204 However, in every case the avoid_reserve argument overrides the use of 205 a reservation for the allocation. After determining whether a reservation 206 exists and can be used for the allocation, the routine dequeue_huge_page_vma() 207 is called. This routine takes two arguments related to reservations: 208 209 - avoid_reserve, this is the same value/argument passed to 210 alloc_hugetlb_folio(). 211 - chg, even though this argument is of type long only the values 0 or 1 are 212 passed to dequeue_huge_page_vma. If the value is 0, it indicates a 213 reservation exists (see the section "Memory Policy and Reservations" for 214 possible issues). If the value is 1, it indicates a reservation does not 215 exist and the page must be taken from the global free pool if possible. 216 217 The free lists associated with the memory policy of the VMA are searched for 218 a free page. If a page is found, the value free_huge_pages is decremented 219 when the page is removed from the free list. If there was a reservation 220 associated with the page, the following adjustments are made:: 221 222 SetPagePrivate(page); /* Indicates allocating this page consumed 223 * a reservation, and if an error is 224 * encountered such that the page must be 225 * freed, the reservation will be restored. */ 226 resv_huge_pages--; /* Decrement the global reservation count */ 227 228 Note, if no huge page can be found that satisfies the VMA's memory policy 229 an attempt will be made to allocate one using the buddy allocator. This 230 brings up the issue of surplus huge pages and overcommit which is beyond 231 the scope reservations. Even if a surplus page is allocated, the same 232 reservation based adjustments as above will be made: SetPagePrivate(page) and 233 resv_huge_pages--. 234 235 After obtaining a new hugetlb folio, (folio)->_hugetlb_subpool is set to the 236 value of the subpool associated with the page if it exists. This will be used 237 for subpool accounting when the folio is freed. 238 239 The routine vma_commit_reservation() is then called to adjust the reserve 240 map based on the consumption of the reservation. In general, this involves 241 ensuring the page is represented within a file_region structure of the region 242 map. For shared mappings where the reservation was present, an entry 243 in the reserve map already existed so no change is made. However, if there 244 was no reservation in a shared mapping or this was a private mapping a new 245 entry must be created. 246 247 It is possible that the reserve map could have been changed between the call 248 to vma_needs_reservation() at the beginning of alloc_hugetlb_folio() and the 249 call to vma_commit_reservation() after the folio was allocated. This would 250 be possible if hugetlb_reserve_pages was called for the same page in a shared 251 mapping. In such cases, the reservation count and subpool free page count 252 will be off by one. This rare condition can be identified by comparing the 253 return value from vma_needs_reservation and vma_commit_reservation. If such 254 a race is detected, the subpool and global reserve counts are adjusted to 255 compensate. See the section 256 :ref:`Reservation Map Helper Routines <resv_map_helpers>` for more 257 information on these routines. 258 259 260 Instantiate Huge Pages 261 ====================== 262 263 After huge page allocation, the page is typically added to the page tables 264 of the allocating task. Before this, pages in a shared mapping are added 265 to the page cache and pages in private mappings are added to an anonymous 266 reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore, 267 when a huge page that has been instantiated is freed no adjustment is made 268 to the global reservation count (resv_huge_pages). 269 270 271 Freeing Huge Pages 272 ================== 273 274 Huge pages are freed by free_huge_folio(). It is only passed a pointer 275 to the folio as it is called from the generic MM code. When a huge page 276 is freed, reservation accounting may need to be performed. This would 277 be the case if the page was associated with a subpool that contained 278 reserves, or the page is being freed on an error path where a global 279 reserve count must be restored. 280 281 The page->private field points to any subpool associated with the page. 282 If the PagePrivate flag is set, it indicates the global reserve count should 283 be adjusted (see the section 284 :ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>` 285 for information on how these are set). 286 287 The routine first calls hugepage_subpool_put_pages() for the page. If this 288 routine returns a value of 0 (which does not equal the value passed 1) it 289 indicates reserves are associated with the subpool, and this newly free page 290 must be used to keep the number of subpool reserves above the minimum size. 291 Therefore, the global resv_huge_pages counter is incremented in this case. 292 293 If the PagePrivate flag was set in the page, the global resv_huge_pages counter 294 will always be incremented. 295 296 .. _sub_pool_resv: 297 298 Subpool Reservations 299 ==================== 300 301 There is a struct hstate associated with each huge page size. The hstate 302 tracks all huge pages of the specified size. A subpool represents a subset 303 of pages within a hstate that is associated with a mounted hugetlbfs 304 filesystem. 305 306 When a hugetlbfs filesystem is mounted a min_size option can be specified 307 which indicates the minimum number of huge pages required by the filesystem. 308 If this option is specified, the number of huge pages corresponding to 309 min_size are reserved for use by the filesystem. This number is tracked in 310 the min_hpages field of a struct hugepage_subpool. At mount time, 311 hugetlb_acct_memory(min_hpages) is called to reserve the specified number of 312 huge pages. If they can not be reserved, the mount fails. 313 314 The routines hugepage_subpool_get/put_pages() are called when pages are 315 obtained from or released back to a subpool. They perform all subpool 316 accounting, and track any reservations associated with the subpool. 317 hugepage_subpool_get/put_pages are passed the number of huge pages by which 318 to adjust the subpool 'used page' count (down for get, up for put). Normally, 319 they return the same value that was passed or an error if not enough pages 320 exist in the subpool. 321 322 However, if reserves are associated with the subpool a return value less 323 than the passed value may be returned. This return value indicates the 324 number of additional global pool adjustments which must be made. For example, 325 suppose a subpool contains 3 reserved huge pages and someone asks for 5. 326 The 3 reserved pages associated with the subpool can be used to satisfy part 327 of the request. But, 2 pages must be obtained from the global pools. To 328 relay this information to the caller, the value 2 is returned. The caller 329 is then responsible for attempting to obtain the additional two pages from 330 the global pools. 331 332 333 COW and Reservations 334 ==================== 335 336 Since shared mappings all point to and use the same underlying pages, the 337 biggest reservation concern for COW is private mappings. In this case, 338 two tasks can be pointing at the same previously allocated page. One task 339 attempts to write to the page, so a new page must be allocated so that each 340 task points to its own page. 341 342 When the page was originally allocated, the reservation for that page was 343 consumed. When an attempt to allocate a new page is made as a result of 344 COW, it is possible that no free huge pages are free and the allocation 345 will fail. 346 347 When the private mapping was originally created, the owner of the mapping 348 was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation 349 map of the owner. Since the owner created the mapping, the owner owns all 350 the reservations associated with the mapping. Therefore, when a write fault 351 occurs and there is no page available, different action is taken for the owner 352 and non-owner of the reservation. 353 354 In the case where the faulting task is not the owner, the fault will fail and 355 the task will typically receive a SIGBUS. 356 357 If the owner is the faulting task, we want it to succeed since it owned the 358 original reservation. To accomplish this, the page is unmapped from the 359 non-owning task. In this way, the only reference is from the owning task. 360 In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer 361 of the non-owning task. The non-owning task may receive a SIGBUS if it later 362 faults on a non-present page. But, the original owner of the 363 mapping/reservation will behave as expected. 364 365 366 .. _resv_map_modifications: 367 368 Reservation Map Modifications 369 ============================= 370 371 The following low level routines are used to make modifications to a 372 reservation map. Typically, these routines are not called directly. Rather, 373 a reservation map helper routine is called which calls one of these low level 374 routines. These low level routines are fairly well documented in the source 375 code (mm/hugetlb.c). These routines are:: 376 377 long region_chg(struct resv_map *resv, long f, long t); 378 long region_add(struct resv_map *resv, long f, long t); 379 void region_abort(struct resv_map *resv, long f, long t); 380 long region_count(struct resv_map *resv, long f, long t); 381 382 Operations on the reservation map typically involve two operations: 383 384 1) region_chg() is called to examine the reserve map and determine how 385 many pages in the specified range [f, t) are NOT currently represented. 386 387 The calling code performs global checks and allocations to determine if 388 there are enough huge pages for the operation to succeed. 389 390 2) 391 a) If the operation can succeed, region_add() is called to actually modify 392 the reservation map for the same range [f, t) previously passed to 393 region_chg(). 394 b) If the operation can not succeed, region_abort is called for the same 395 range [f, t) to abort the operation. 396 397 Note that this is a two step process where region_add() and region_abort() 398 are guaranteed to succeed after a prior call to region_chg() for the same 399 range. region_chg() is responsible for pre-allocating any data structures 400 necessary to ensure the subsequent operations (specifically region_add())) 401 will succeed. 402 403 As mentioned above, region_chg() determines the number of pages in the range 404 which are NOT currently represented in the map. This number is returned to 405 the caller. region_add() returns the number of pages in the range added to 406 the map. In most cases, the return value of region_add() is the same as the 407 return value of region_chg(). However, in the case of shared mappings it is 408 possible for changes to the reservation map to be made between the calls to 409 region_chg() and region_add(). In this case, the return value of region_add() 410 will not match the return value of region_chg(). It is likely that in such 411 cases global counts and subpool accounting will be incorrect and in need of 412 adjustment. It is the responsibility of the caller to check for this condition 413 and make the appropriate adjustments. 414 415 The routine region_del() is called to remove regions from a reservation map. 416 It is typically called in the following situations: 417 418 - When a file in the hugetlbfs filesystem is being removed, the inode will 419 be released and the reservation map freed. Before freeing the reservation 420 map, all the individual file_region structures must be freed. In this case 421 region_del is passed the range [0, LONG_MAX). 422 - When a hugetlbfs file is being truncated. In this case, all allocated pages 423 after the new file size must be freed. In addition, any file_region entries 424 in the reservation map past the new end of file must be deleted. In this 425 case, region_del is passed the range [new_end_of_file, LONG_MAX). 426 - When a hole is being punched in a hugetlbfs file. In this case, huge pages 427 are removed from the middle of the file one at a time. As the pages are 428 removed, region_del() is called to remove the corresponding entry from the 429 reservation map. In this case, region_del is passed the range 430 [page_idx, page_idx + 1). 431 432 In every case, region_del() will return the number of pages removed from the 433 reservation map. In VERY rare cases, region_del() can fail. This can only 434 happen in the hole punch case where it has to split an existing file_region 435 entry and can not allocate a new structure. In this error case, region_del() 436 will return -ENOMEM. The problem here is that the reservation map will 437 indicate that there is a reservation for the page. However, the subpool and 438 global reservation counts will not reflect the reservation. To handle this 439 situation, the routine hugetlb_fix_reserve_counts() is called to adjust the 440 counters so that they correspond with the reservation map entry that could 441 not be deleted. 442 443 region_count() is called when unmapping a private huge page mapping. In 444 private mappings, the lack of a entry in the reservation map indicates that 445 a reservation exists. Therefore, by counting the number of entries in the 446 reservation map we know how many reservations were consumed and how many are 447 outstanding (outstanding = (end - start) - region_count(resv, start, end)). 448 Since the mapping is going away, the subpool and global reservation counts 449 are decremented by the number of outstanding reservations. 450 451 .. _resv_map_helpers: 452 453 Reservation Map Helper Routines 454 =============================== 455 456 Several helper routines exist to query and modify the reservation maps. 457 These routines are only interested with reservations for a specific huge 458 page, so they just pass in an address instead of a range. In addition, 459 they pass in the associated VMA. From the VMA, the type of mapping (private 460 or shared) and the location of the reservation map (inode or VMA) can be 461 determined. These routines simply call the underlying routines described 462 in the section "Reservation Map Modifications". However, they do take into 463 account the 'opposite' meaning of reservation map entries for private and 464 shared mappings and hide this detail from the caller:: 465 466 long vma_needs_reservation(struct hstate *h, 467 struct vm_area_struct *vma, 468 unsigned long addr) 469 470 This routine calls region_chg() for the specified page. If no reservation 471 exists, 1 is returned. If a reservation exists, 0 is returned:: 472 473 long vma_commit_reservation(struct hstate *h, 474 struct vm_area_struct *vma, 475 unsigned long addr) 476 477 This calls region_add() for the specified page. As in the case of region_chg 478 and region_add, this routine is to be called after a previous call to 479 vma_needs_reservation. It will add a reservation entry for the page. It 480 returns 1 if the reservation was added and 0 if not. The return value should 481 be compared with the return value of the previous call to 482 vma_needs_reservation. An unexpected difference indicates the reservation 483 map was modified between calls:: 484 485 void vma_end_reservation(struct hstate *h, 486 struct vm_area_struct *vma, 487 unsigned long addr) 488 489 This calls region_abort() for the specified page. As in the case of region_chg 490 and region_abort, this routine is to be called after a previous call to 491 vma_needs_reservation. It will abort/end the in progress reservation add 492 operation:: 493 494 long vma_add_reservation(struct hstate *h, 495 struct vm_area_struct *vma, 496 unsigned long addr) 497 498 This is a special wrapper routine to help facilitate reservation cleanup 499 on error paths. It is only called from the routine restore_reserve_on_error(). 500 This routine is used in conjunction with vma_needs_reservation in an attempt 501 to add a reservation to the reservation map. It takes into account the 502 different reservation map semantics for private and shared mappings. Hence, 503 region_add is called for shared mappings (as an entry present in the map 504 indicates a reservation), and region_del is called for private mappings (as 505 the absence of an entry in the map indicates a reservation). See the section 506 "Reservation cleanup in error paths" for more information on what needs to 507 be done on error paths. 508 509 510 Reservation Cleanup in Error Paths 511 ================================== 512 513 As mentioned in the section 514 :ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation 515 map modifications are performed in two steps. First vma_needs_reservation 516 is called before a page is allocated. If the allocation is successful, 517 then vma_commit_reservation is called. If not, vma_end_reservation is called. 518 Global and subpool reservation counts are adjusted based on success or failure 519 of the operation and all is well. 520 521 Additionally, after a huge page is instantiated the PagePrivate flag is 522 cleared so that accounting when the page is ultimately freed is correct. 523 524 However, there are several instances where errors are encountered after a huge 525 page is allocated but before it is instantiated. In this case, the page 526 allocation has consumed the reservation and made the appropriate subpool, 527 reservation map and global count adjustments. If the page is freed at this 528 time (before instantiation and clearing of PagePrivate), then free_huge_folio 529 will increment the global reservation count. However, the reservation map 530 indicates the reservation was consumed. This resulting inconsistent state 531 will cause the 'leak' of a reserved huge page. The global reserve count will 532 be higher than it should and prevent allocation of a pre-allocated page. 533 534 The routine restore_reserve_on_error() attempts to handle this situation. It 535 is fairly well documented. The intention of this routine is to restore 536 the reservation map to the way it was before the page allocation. In this 537 way, the state of the reservation map will correspond to the global reservation 538 count after the page is freed. 539 540 The routine restore_reserve_on_error itself may encounter errors while 541 attempting to restore the reservation map entry. In this case, it will 542 simply clear the PagePrivate flag of the page. In this way, the global 543 reserve count will not be incremented when the page is freed. However, the 544 reservation map will continue to look as though the reservation was consumed. 545 A page can still be allocated for the address, but it will not use a reserved 546 page as originally intended. 547 548 There is some code (most notably userfaultfd) which can not call 549 restore_reserve_on_error. In this case, it simply modifies the PagePrivate 550 so that a reservation will not be leaked when the huge page is freed. 551 552 553 Reservations and Memory Policy 554 ============================== 555 Per-node huge page lists existed in struct hstate when git was first used 556 to manage Linux code. The concept of reservations was added some time later. 557 When reservations were added, no attempt was made to take memory policy 558 into account. While cpusets are not exactly the same as memory policy, this 559 comment in hugetlb_acct_memory sums up the interaction between reservations 560 and cpusets/memory policy:: 561 562 /* 563 * When cpuset is configured, it breaks the strict hugetlb page 564 * reservation as the accounting is done on a global variable. Such 565 * reservation is completely rubbish in the presence of cpuset because 566 * the reservation is not checked against page availability for the 567 * current cpuset. Application can still potentially OOM'ed by kernel 568 * with lack of free htlb page in cpuset that the task is in. 569 * Attempt to enforce strict accounting with cpuset is almost 570 * impossible (or too ugly) because cpuset is too fluid that 571 * task or memory node can be dynamically moved between cpusets. 572 * 573 * The change of semantics for shared hugetlb mapping with cpuset is 574 * undesirable. However, in order to preserve some of the semantics, 575 * we fall back to check against current free page availability as 576 * a best attempt and hopefully to minimize the impact of changing 577 * semantics that cpuset has. 578 */ 579 580 Huge page reservations were added to prevent unexpected page allocation 581 failures (OOM) at page fault time. However, if an application makes use 582 of cpusets or memory policy there is no guarantee that huge pages will be 583 available on the required nodes. This is true even if there are a sufficient 584 number of global reservations. 585 586 Hugetlbfs regression testing 587 ============================ 588 589 The most complete set of hugetlb tests are in the libhugetlbfs repository. 590 If you modify any hugetlb related code, use the libhugetlbfs test suite 591 to check for regressions. In addition, if you add any new hugetlb 592 functionality, please add appropriate tests to libhugetlbfs. 593 594 -- 595 Mike Kravetz, 7 April 2017

Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.

TOMOYO Linux Cross Reference Linux/Documentation/mm/hugetlbfs_reserv.rst

TOMOYO Linux Cross Reference
Linux/Documentation/mm/hugetlbfs_reserv.rst