1 ================ 2 Futex Requeue PI 3 ================ 4 5 Requeueing of tasks from a non-PI futex to a PI futex requires 6 special handling in order to ensure the underlying rt_mutex is never 7 left without an owner if it has waiters; doing so would break the PI 8 boosting logic [see rt-mutex-design.rst] For the purposes of 9 brevity, this action will be referred to as "requeue_pi" throughout 10 this document. Priority inheritance is abbreviated throughout as 11 "PI". 12 13 Motivation 14 ---------- 15 16 Without requeue_pi, the glibc implementation of 17 pthread_cond_broadcast() must resort to waking all the tasks waiting 18 on a pthread_condvar and letting them try to sort out which task 19 gets to run first in classic thundering-herd formation. An ideal 20 implementation would wake the highest-priority waiter, and leave the 21 rest to the natural wakeup inherent in unlocking the mutex 22 associated with the condvar. 23 24 Consider the simplified glibc calls:: 25 26 /* caller must lock mutex */ 27 pthread_cond_wait(cond, mutex) 28 { 29 lock(cond->__data.__lock); 30 unlock(mutex); 31 do { 32 unlock(cond->__data.__lock); 33 futex_wait(cond->__data.__futex); 34 lock(cond->__data.__lock); 35 } while(...) 36 unlock(cond->__data.__lock); 37 lock(mutex); 38 } 39 40 pthread_cond_broadcast(cond) 41 { 42 lock(cond->__data.__lock); 43 unlock(cond->__data.__lock); 44 futex_requeue(cond->data.__futex, cond->mutex); 45 } 46 47 Once pthread_cond_broadcast() requeues the tasks, the cond->mutex 48 has waiters. Note that pthread_cond_wait() attempts to lock the 49 mutex only after it has returned to user space. This will leave the 50 underlying rt_mutex with waiters, and no owner, breaking the 51 previously mentioned PI-boosting algorithms. 52 53 In order to support PI-aware pthread_condvar's, the kernel needs to 54 be able to requeue tasks to PI futexes. This support implies that 55 upon a successful futex_wait system call, the caller would return to 56 user space already holding the PI futex. The glibc implementation 57 would be modified as follows:: 58 59 60 /* caller must lock mutex */ 61 pthread_cond_wait_pi(cond, mutex) 62 { 63 lock(cond->__data.__lock); 64 unlock(mutex); 65 do { 66 unlock(cond->__data.__lock); 67 futex_wait_requeue_pi(cond->__data.__futex); 68 lock(cond->__data.__lock); 69 } while(...) 70 unlock(cond->__data.__lock); 71 /* the kernel acquired the mutex for us */ 72 } 73 74 pthread_cond_broadcast_pi(cond) 75 { 76 lock(cond->__data.__lock); 77 unlock(cond->__data.__lock); 78 futex_requeue_pi(cond->data.__futex, cond->mutex); 79 } 80 81 The actual glibc implementation will likely test for PI and make the 82 necessary changes inside the existing calls rather than creating new 83 calls for the PI cases. Similar changes are needed for 84 pthread_cond_timedwait() and pthread_cond_signal(). 85 86 Implementation 87 -------------- 88 89 In order to ensure the rt_mutex has an owner if it has waiters, it 90 is necessary for both the requeue code, as well as the waiting code, 91 to be able to acquire the rt_mutex before returning to user space. 92 The requeue code cannot simply wake the waiter and leave it to 93 acquire the rt_mutex as it would open a race window between the 94 requeue call returning to user space and the waiter waking and 95 starting to run. This is especially true in the uncontended case. 96 97 The solution involves two new rt_mutex helper routines, 98 rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which 99 allow the requeue code to acquire an uncontended rt_mutex on behalf 100 of the waiter and to enqueue the waiter on a contended rt_mutex. 101 Two new system calls provide the kernel<->user interface to 102 requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. 103 104 FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() 105 and pthread_cond_timedwait()) to block on the initial futex and wait 106 to be requeued to a PI-aware futex. The implementation is the 107 result of a high-speed collision between futex_wait() and 108 futex_lock_pi(), with some extra logic to check for the additional 109 wake-up scenarios. 110 111 FUTEX_CMP_REQUEUE_PI is called by the waker 112 (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and 113 possibly wake the waiting tasks. Internally, this system call is 114 still handled by futex_requeue (by passing requeue_pi=1). Before 115 requeueing, futex_requeue() attempts to acquire the requeue target 116 PI futex on behalf of the top waiter. If it can, this waiter is 117 woken. futex_requeue() then proceeds to requeue the remaining 118 nr_wake+nr_requeue tasks to the PI futex, calling 119 rt_mutex_start_proxy_lock() prior to each requeue to prepare the 120 task as a waiter on the underlying rt_mutex. It is possible that 121 the lock can be acquired at this stage as well, if so, the next 122 waiter is woken to finish the acquisition of the lock. 123 124 FUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but 125 their sum is all that really matters. futex_requeue() will wake or 126 requeue up to nr_wake + nr_requeue tasks. It will wake only as many 127 tasks as it can acquire the lock for, which in the majority of cases 128 should be 0 as good programming practice dictates that the caller of 129 either pthread_cond_broadcast() or pthread_cond_signal() acquire the 130 mutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that 131 nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for 132 signal.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.