1 2 ============= 3 eBPF verifier 4 ============= 5 6 The safety of the eBPF program is determined in two steps. 7 8 First step does DAG check to disallow loops and other CFG validation. 9 In particular it will detect programs that have unreachable instructions. 10 (though classic BPF checker allows them) 11 12 Second step starts from the first insn and descends all possible paths. 13 It simulates execution of every insn and observes the state change of 14 registers and stack. 15 16 At the start of the program the register R1 contains a pointer to context 17 and has type PTR_TO_CTX. 18 If verifier sees an insn that does R2=R1, then R2 has now type 19 PTR_TO_CTX as well and can be used on the right hand side of expression. 20 If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, 21 since addition of two valid pointers makes invalid pointer. 22 (In 'secure' mode verifier will reject any type of pointer arithmetic to make 23 sure that kernel addresses don't leak to unprivileged users) 24 25 If register was never written to, it's not readable:: 26 27 bpf_mov R0 = R2 28 bpf_exit 29 30 will be rejected, since R2 is unreadable at the start of the program. 31 32 After kernel function call, R1-R5 are reset to unreadable and 33 R0 has a return type of the function. 34 35 Since R6-R9 are callee saved, their state is preserved across the call. 36 37 :: 38 39 bpf_mov R6 = 1 40 bpf_call foo 41 bpf_mov R0 = R6 42 bpf_exit 43 44 is a correct program. If there was R1 instead of R6, it would have 45 been rejected. 46 47 load/store instructions are allowed only with registers of valid types, which 48 are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. 49 For example:: 50 51 bpf_mov R1 = 1 52 bpf_mov R2 = 2 53 bpf_xadd *(u32 *)(R1 + 3) += R2 54 bpf_exit 55 56 will be rejected, since R1 doesn't have a valid pointer type at the time of 57 execution of instruction bpf_xadd. 58 59 At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) 60 A callback is used to customize verifier to restrict eBPF program access to only 61 certain fields within ctx structure with specified size and alignment. 62 63 For example, the following insn:: 64 65 bpf_ld R0 = *(u32 *)(R6 + 8) 66 67 intends to load a word from address R6 + 8 and store it into R0 68 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know 69 that offset 8 of size 4 bytes can be accessed for reading, otherwise 70 the verifier will reject the program. 71 If R6=PTR_TO_STACK, then access should be aligned and be within 72 stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, 73 so it will fail verification, since it's out of bounds. 74 75 The verifier will allow eBPF program to read data from stack only after 76 it wrote into it. 77 78 Classic BPF verifier does similar check with M[0-15] memory slots. 79 For example:: 80 81 bpf_ld R0 = *(u32 *)(R10 - 4) 82 bpf_exit 83 84 is invalid program. 85 Though R10 is correct read-only register and has type PTR_TO_STACK 86 and R10 - 4 is within stack bounds, there were no stores into that location. 87 88 Pointer register spill/fill is tracked as well, since four (R6-R9) 89 callee saved registers may not be enough for some programs. 90 91 Allowed function calls are customized with bpf_verifier_ops->get_func_proto() 92 The eBPF verifier will check that registers match argument constraints. 93 After the call register R0 will be set to return type of the function. 94 95 Function calls is a main mechanism to extend functionality of eBPF programs. 96 Socket filters may let programs to call one set of functions, whereas tracing 97 filters may allow completely different set. 98 99 If a function made accessible to eBPF program, it needs to be thought through 100 from safety point of view. The verifier will guarantee that the function is 101 called with valid arguments. 102 103 seccomp vs socket filters have different security restrictions for classic BPF. 104 Seccomp solves this by two stage verifier: classic BPF verifier is followed 105 by seccomp verifier. In case of eBPF one configurable verifier is shared for 106 all use cases. 107 108 See details of eBPF verifier in kernel/bpf/verifier.c 109 110 Register value tracking 111 ======================= 112 113 In order to determine the safety of an eBPF program, the verifier must track 114 the range of possible values in each register and also in each stack slot. 115 This is done with ``struct bpf_reg_state``, defined in include/linux/ 116 bpf_verifier.h, which unifies tracking of scalar and pointer values. Each 117 register state has a type, which is either NOT_INIT (the register has not been 118 written to), SCALAR_VALUE (some value which is not usable as a pointer), or a 119 pointer type. The types of pointers describe their base, as follows: 120 121 122 PTR_TO_CTX 123 Pointer to bpf_context. 124 CONST_PTR_TO_MAP 125 Pointer to struct bpf_map. "Const" because arithmetic 126 on these pointers is forbidden. 127 PTR_TO_MAP_VALUE 128 Pointer to the value stored in a map element. 129 PTR_TO_MAP_VALUE_OR_NULL 130 Either a pointer to a map value, or NULL; map accesses 131 (see maps.rst) return this type, which becomes a 132 PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on 133 these pointers is forbidden. 134 PTR_TO_STACK 135 Frame pointer. 136 PTR_TO_PACKET 137 skb->data. 138 PTR_TO_PACKET_END 139 skb->data + headlen; arithmetic forbidden. 140 PTR_TO_SOCKET 141 Pointer to struct bpf_sock_ops, implicitly refcounted. 142 PTR_TO_SOCKET_OR_NULL 143 Either a pointer to a socket, or NULL; socket lookup 144 returns this type, which becomes a PTR_TO_SOCKET when 145 checked != NULL. PTR_TO_SOCKET is reference-counted, 146 so programs must release the reference through the 147 socket release function before the end of the program. 148 Arithmetic on these pointers is forbidden. 149 150 However, a pointer may be offset from this base (as a result of pointer 151 arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable 152 offset'. The former is used when an exactly-known value (e.g. an immediate 153 operand) is added to a pointer, while the latter is used for values which are 154 not exactly known. The variable offset is also used in SCALAR_VALUEs, to track 155 the range of possible values in the register. 156 157 The verifier's knowledge about the variable offset consists of: 158 159 * minimum and maximum values as unsigned 160 * minimum and maximum values as signed 161 162 * knowledge of the values of individual bits, in the form of a 'tnum': a u64 163 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 164 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both 165 mask and value; no bit should ever be 1 in both. For example, if a byte is read 166 into a register from memory, the register's top 56 bits are known zero, while 167 the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we 168 then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 169 0x1ff), because of potential carries. 170 171 Besides arithmetic, the register state can also be updated by conditional 172 branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch 173 it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' 174 branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or 175 BPF_JSGE) would instead update the signed minimum/maximum values. Information 176 from the signed and unsigned bounds can be combined; for instance if a value is 177 first tested < 8 and then tested s> 4, the verifier will conclude that the value 178 is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. 179 180 PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all 181 pointers sharing that same variable offset. This is important for packet range 182 checks: after adding a variable to a packet pointer register A, if you then copy 183 it to another register B and then add a constant 4 to A, both registers will 184 share the same 'id' but the A will have a fixed offset of +4. Then if A is 185 bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is 186 now known to have a safe range of at least 4 bytes. See 'Direct packet access', 187 below, for more on PTR_TO_PACKET ranges. 188 189 The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of 190 the pointer returned from a map lookup. This means that when one copy is 191 checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. 192 As well as range-checking, the tracked information is also used for enforcing 193 alignment of pointer accesses. For instance, on most systems the packet pointer 194 is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump 195 over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting 196 pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 197 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through 198 that pointer are safe. 199 The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common 200 to all copies of the pointer returned from a socket lookup. This has similar 201 behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but 202 it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly 203 represents a reference to the corresponding ``struct sock``. To ensure that the 204 reference is not leaked, it is imperative to NULL-check the reference and in 205 the non-NULL case, and pass the valid reference to the socket release function. 206 207 Direct packet access 208 ==================== 209 210 In cls_bpf and act_bpf programs the verifier allows direct access to the packet 211 data via skb->data and skb->data_end pointers. 212 Ex:: 213 214 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ 215 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ 216 3: r5 = r3 217 4: r5 += 14 218 5: if r5 > r4 goto pc+16 219 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 220 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ 221 222 this 2byte load from the packet is safe to do, since the program author 223 did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which 224 means that in the fall-through case the register R3 (which points to skb->data) 225 has at least 14 directly accessible bytes. The verifier marks it 226 as R3=pkt(id=0,off=0,r=14). 227 id=0 means that no additional variables were added to the register. 228 off=0 means that no additional constants were added. 229 r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. 230 Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points 231 to the packet data, but constant 14 was added to the register, so 232 it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) 233 which is zero bytes. 234 235 More complex packet access may look like:: 236 237 238 R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 239 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 240 7: r4 = *(u8 *)(r3 +12) 241 8: r4 *= 14 242 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ 243 10: r3 += r4 244 11: r2 = r1 245 12: r2 <<= 48 246 13: r2 >>= 48 247 14: r3 += r2 248 15: r2 = r3 249 16: r2 += 8 250 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 251 18: if r2 > r1 goto pc+2 252 R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 253 19: r1 = *(u8 *)(r3 +4) 254 255 The state of the register R3 is R3=pkt(id=2,off=0,r=8) 256 id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some 257 offset within a packet and since the program author did 258 ``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). 259 The verifier only allows 'add'/'sub' operations on packet registers. Any other 260 operation will set the register state to 'SCALAR_VALUE' and it won't be 261 available for direct packet access. 262 263 Operation ``r3 += rX`` may overflow and become less than original skb->data, 264 therefore the verifier has to prevent that. So when it sees ``r3 += rX`` 265 instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 266 against skb->data_end will not give us 'range' information, so attempts to read 267 through the pointer will give "invalid access to packet" error. 268 269 Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is 270 R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits 271 of the register are guaranteed to be zero, and nothing is known about the lower 272 8 bits. After insn ``r4 *= 14`` the state becomes 273 R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit 274 value by constant 14 will keep upper 52 bits as zero, also the least significant 275 bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make 276 R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign 277 extending. This logic is implemented in adjust_reg_min_max_vals() function, 278 which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice 279 versa) and adjust_scalar_min_max_vals() for operations on two scalars. 280 281 The end result is that bpf program author can access packet directly 282 using normal C code as:: 283 284 void *data = (void *)(long)skb->data; 285 void *data_end = (void *)(long)skb->data_end; 286 struct eth_hdr *eth = data; 287 struct iphdr *iph = data + sizeof(*eth); 288 struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); 289 290 if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) 291 return 0; 292 if (eth->h_proto != htons(ETH_P_IP)) 293 return 0; 294 if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) 295 return 0; 296 if (udp->dest == 53 || udp->source == 9) 297 ...; 298 299 which makes such programs easier to write comparing to LD_ABS insn 300 and significantly faster. 301 302 Pruning 303 ======= 304 305 The verifier does not actually walk all possible paths through the program. For 306 each new branch to analyse, the verifier looks at all the states it's previously 307 been in when at this instruction. If any of them contain the current state as a 308 subset, the branch is 'pruned' - that is, the fact that the previous state was 309 accepted implies the current state would be as well. For instance, if in the 310 previous state, r1 held a packet-pointer, and in the current state, r1 holds a 311 packet-pointer with a range as long or longer and at least as strict an 312 alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't 313 have been used by any path from that point, so any value in r2 (including 314 another NOT_INIT) is safe. The implementation is in the function regsafe(). 315 Pruning considers not only the registers but also the stack (and any spilled 316 registers it may hold). They must all be safe for the branch to be pruned. 317 This is implemented in states_equal(). 318 319 Some technical details about state pruning implementation could be found below. 320 321 Register liveness tracking 322 -------------------------- 323 324 In order to make state pruning effective, liveness state is tracked for each 325 register and stack slot. The basic idea is to track which registers and stack 326 slots are actually used during subseqeuent execution of the program, until 327 program exit is reached. Registers and stack slots that were never used could be 328 removed from the cached state thus making more states equivalent to a cached 329 state. This could be illustrated by the following program:: 330 331 0: call bpf_get_prandom_u32() 332 1: r1 = 0 333 2: if r0 == 0 goto +1 334 3: r0 = 1 335 --- checkpoint --- 336 4: r0 = r1 337 5: exit 338 339 Suppose that a state cache entry is created at instruction #4 (such entries are 340 also called "checkpoints" in the text below). The verifier could reach the 341 instruction with one of two possible register states: 342 343 * r0 = 1, r1 = 0 344 * r0 = 0, r1 = 0 345 346 However, only the value of register ``r1`` is important to successfully finish 347 verification. The goal of the liveness tracking algorithm is to spot this fact 348 and figure out that both states are actually equivalent. 349 350 Data structures 351 ~~~~~~~~~~~~~~~ 352 353 Liveness is tracked using the following data structures:: 354 355 enum bpf_reg_liveness { 356 REG_LIVE_NONE = 0, 357 REG_LIVE_READ32 = 0x1, 358 REG_LIVE_READ64 = 0x2, 359 REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64, 360 REG_LIVE_WRITTEN = 0x4, 361 REG_LIVE_DONE = 0x8, 362 }; 363 364 struct bpf_reg_state { 365 ... 366 struct bpf_reg_state *parent; 367 ... 368 enum bpf_reg_liveness live; 369 ... 370 }; 371 372 struct bpf_stack_state { 373 struct bpf_reg_state spilled_ptr; 374 ... 375 }; 376 377 struct bpf_func_state { 378 struct bpf_reg_state regs[MAX_BPF_REG]; 379 ... 380 struct bpf_stack_state *stack; 381 } 382 383 struct bpf_verifier_state { 384 struct bpf_func_state *frame[MAX_CALL_FRAMES]; 385 struct bpf_verifier_state *parent; 386 ... 387 } 388 389 * ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new 390 verifier state creation; 391 392 * ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is 393 defined by some instruction verified between this verifier state's parent and 394 verifier state itself; 395 396 * ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot) 397 is read by a some child state of this verifier state; 398 399 * ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid 400 processing same verifier state multiple times and for some sanity checks; 401 402 * ``->live`` field values are formed by combining ``enum bpf_reg_liveness`` 403 values using bitwise or. 404 405 Register parentage chains 406 ~~~~~~~~~~~~~~~~~~~~~~~~~ 407 408 In order to propagate information between parent and child states, a *register 409 parentage chain* is established. Each register or stack slot is linked to a 410 corresponding register or stack slot in its parent state via a ``->parent`` 411 pointer. This link is established upon state creation in ``is_state_visited()`` 412 and might be modified by ``set_callee_state()`` called from 413 ``__check_func_call()``. 414 415 The rules for correspondence between registers / stack slots are as follows: 416 417 * For the current stack frame, registers and stack slots of the new state are 418 linked to the registers and stack slots of the parent state with the same 419 indices. 420 421 * For the outer stack frames, only caller saved registers (r6-r9) and stack 422 slots are linked to the registers and stack slots of the parent state with the 423 same indices. 424 425 * When function call is processed a new ``struct bpf_func_state`` instance is 426 allocated, it encapsulates a new set of registers and stack slots. For this 427 new frame, parent links for r6-r9 and stack slots are set to nil, parent links 428 for r1-r5 are set to match caller r1-r5 parent links. 429 430 This could be illustrated by the following diagram (arrows stand for 431 ``->parent`` pointers):: 432 433 ... ; Frame #0, some instructions 434 --- checkpoint #0 --- 435 1 : r6 = 42 ; Frame #0 436 --- checkpoint #1 --- 437 2 : call foo() ; Frame #0 438 ... ; Frame #1, instructions from foo() 439 --- checkpoint #2 --- 440 ... ; Frame #1, instructions from foo() 441 --- checkpoint #3 --- 442 exit ; Frame #1, return from foo() 443 3 : r1 = r6 ; Frame #0 <- current state 444 445 +-------------------------------+-------------------------------+ 446 | Frame #0 | Frame #1 | 447 Checkpoint +-------------------------------+-------------------------------+ 448 #0 | r0 | r1-r5 | r6-r9 | fp-8 ... | 449 +-------------------------------+ 450 ^ ^ ^ ^ 451 | | | | 452 Checkpoint +-------------------------------+ 453 #1 | r0 | r1-r5 | r6-r9 | fp-8 ... | 454 +-------------------------------+ 455 ^ ^ ^ 456 |_______|_______|_______________ 457 | | | 458 nil nil | | | nil nil 459 | | | | | | | 460 Checkpoint +-------------------------------+-------------------------------+ 461 #2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | 462 +-------------------------------+-------------------------------+ 463 ^ ^ ^ ^ ^ 464 nil nil | | | | | 465 | | | | | | | 466 Checkpoint +-------------------------------+-------------------------------+ 467 #3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | 468 +-------------------------------+-------------------------------+ 469 ^ ^ 470 nil nil | | 471 | | | | 472 Current +-------------------------------+ 473 state | r0 | r1-r5 | r6-r9 | fp-8 ... | 474 +-------------------------------+ 475 \ 476 r6 read mark is propagated via these links 477 all the way up to checkpoint #1. 478 The checkpoint #1 contains a write mark for r6 479 because of instruction (1), thus read propagation 480 does not reach checkpoint #0 (see section below). 481 482 Liveness marks tracking 483 ~~~~~~~~~~~~~~~~~~~~~~~ 484 485 For each processed instruction, the verifier tracks read and written registers 486 and stack slots. The main idea of the algorithm is that read marks propagate 487 back along the state parentage chain until they hit a write mark, which 'screens 488 off' earlier states from the read. The information about reads is propagated by 489 function ``mark_reg_read()`` which could be summarized as follows:: 490 491 mark_reg_read(struct bpf_reg_state *state, ...): 492 parent = state->parent 493 while parent: 494 if state->live & REG_LIVE_WRITTEN: 495 break 496 if parent->live & REG_LIVE_READ64: 497 break 498 parent->live |= REG_LIVE_READ64 499 state = parent 500 parent = state->parent 501 502 Notes: 503 504 * The read marks are applied to the **parent** state while write marks are 505 applied to the **current** state. The write mark on a register or stack slot 506 means that it is updated by some instruction in the straight-line code leading 507 from the parent state to the current state. 508 509 * Details about REG_LIVE_READ32 are omitted. 510 511 * Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`) 512 might override the first parent link. Please refer to the comments in the 513 ``propagate_liveness()`` and ``mark_reg_read()`` source code for further 514 details. 515 516 Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are 517 applied conservatively: stack slots are marked as written only if write size 518 corresponds to the size of the register, e.g. see function ``save_register_state()``. 519 520 Consider the following example:: 521 522 0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8 523 --- checkpoint #0 --- 524 1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes 525 2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1) 526 3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0) 527 528 As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should 529 it be otherwise, the algorithm above wouldn't be able to propagate the read mark 530 from (3) to checkpoint #0. 531 532 Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is 533 called to update the ``->branches`` counter for each verifier state in a chain 534 of parent verifier states. When the ``->branches`` counter reaches zero the 535 verifier state becomes a valid entry in a set of cached verifier states. 536 537 Each entry of the verifier states cache is post-processed by a function 538 ``clean_live_states()``. This function marks all registers and stack slots 539 without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``. 540 Registers/stack slots marked in this way are ignored in function ``stacksafe()`` 541 called from ``states_equal()`` when a state cache entry is considered for 542 equivalence with a current state. 543 544 Now it is possible to explain how the example from the beginning of the section 545 works:: 546 547 0: call bpf_get_prandom_u32() 548 1: r1 = 0 549 2: if r0 == 0 goto +1 550 3: r0 = 1 551 --- checkpoint[0] --- 552 4: r0 = r1 553 5: exit 554 555 * At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }`` 556 is pushed to states processing queue (pc stands for program counter). 557 558 * At instruction #4: 559 560 * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``; 561 * ``checkpoint[0].r0`` is marked as written; 562 * ``checkpoint[0].r1`` is marked as read; 563 564 * At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed 565 by ``clean_live_states()``. After this processing ``checkpoint[0].r1`` has a 566 read mark and all other registers and stack slots are marked as ``NOT_INIT`` 567 or ``STACK_INVALID`` 568 569 * The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue 570 and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states 571 are considered equivalent. 572 573 .. _read_marks_for_cache_hits: 574 575 Read marks propagation for cache hits 576 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 577 578 Another point is the handling of read marks when a previously verified state is 579 found in the states cache. Upon cache hit verifier must behave in the same way 580 as if the current state was verified to the program exit. This means that all 581 read marks, present on registers and stack slots of the cached state, must be 582 propagated over the parentage chain of the current state. Example below shows 583 why this is important. Function ``propagate_liveness()`` handles this case. 584 585 Consider the following state parentage chain (S is a starting state, A-E are 586 derived states, -> arrows show which state is derived from which):: 587 588 r1 read 589 <------------- A[r1] == 0 590 C[r1] == 0 591 S ---> A ---> B ---> exit E[r1] == 1 592 | 593 ` ---> C ---> D 594 | 595 ` ---> E ^ 596 |___ suppose all these 597 ^ states are at insn #Y 598 | 599 suppose all these 600 states are at insn #X 601 602 * Chain of states ``S -> A -> B -> exit`` is verified first. 603 604 * While ``B -> exit`` is verified, register ``r1`` is read and this read mark is 605 propagated up to state ``A``. 606 607 * When chain of states ``C -> D`` is verified the state ``D`` turns out to be 608 equivalent to state ``B``. 609 610 * The read mark for ``r1`` has to be propagated to state ``C``, otherwise state 611 ``C`` might get mistakenly marked as equivalent to state ``E`` even though 612 values for register ``r1`` differ between ``C`` and ``E``. 613 614 Understanding eBPF verifier messages 615 ==================================== 616 617 The following are few examples of invalid eBPF programs and verifier error 618 messages as seen in the log: 619 620 Program with unreachable instructions:: 621 622 static struct bpf_insn prog[] = { 623 BPF_EXIT_INSN(), 624 BPF_EXIT_INSN(), 625 }; 626 627 Error:: 628 629 unreachable insn 1 630 631 Program that reads uninitialized register:: 632 633 BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), 634 BPF_EXIT_INSN(), 635 636 Error:: 637 638 0: (bf) r0 = r2 639 R2 !read_ok 640 641 Program that doesn't initialize R0 before exiting:: 642 643 BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), 644 BPF_EXIT_INSN(), 645 646 Error:: 647 648 0: (bf) r2 = r1 649 1: (95) exit 650 R0 !read_ok 651 652 Program that accesses stack out of bounds:: 653 654 BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), 655 BPF_EXIT_INSN(), 656 657 Error:: 658 659 0: (7a) *(u64 *)(r10 +8) = 0 660 invalid stack off=8 size=8 661 662 Program that doesn't initialize stack before passing its address into function:: 663 664 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 665 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 666 BPF_LD_MAP_FD(BPF_REG_1, 0), 667 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 668 BPF_EXIT_INSN(), 669 670 Error:: 671 672 0: (bf) r2 = r10 673 1: (07) r2 += -8 674 2: (b7) r1 = 0x0 675 3: (85) call 1 676 invalid indirect read from stack off -8+0 size 8 677 678 Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: 679 680 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 681 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 682 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 683 BPF_LD_MAP_FD(BPF_REG_1, 0), 684 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 685 BPF_EXIT_INSN(), 686 687 Error:: 688 689 0: (7a) *(u64 *)(r10 -8) = 0 690 1: (bf) r2 = r10 691 2: (07) r2 += -8 692 3: (b7) r1 = 0x0 693 4: (85) call 1 694 fd 0 is not pointing to valid bpf_map 695 696 Program that doesn't check return value of map_lookup_elem() before accessing 697 map element:: 698 699 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 700 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 701 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 702 BPF_LD_MAP_FD(BPF_REG_1, 0), 703 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 704 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), 705 BPF_EXIT_INSN(), 706 707 Error:: 708 709 0: (7a) *(u64 *)(r10 -8) = 0 710 1: (bf) r2 = r10 711 2: (07) r2 += -8 712 3: (b7) r1 = 0x0 713 4: (85) call 1 714 5: (7a) *(u64 *)(r0 +0) = 0 715 R0 invalid mem access 'map_value_or_null' 716 717 Program that correctly checks map_lookup_elem() returned value for NULL, but 718 accesses the memory with incorrect alignment:: 719 720 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 721 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 722 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 723 BPF_LD_MAP_FD(BPF_REG_1, 0), 724 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 725 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), 726 BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), 727 BPF_EXIT_INSN(), 728 729 Error:: 730 731 0: (7a) *(u64 *)(r10 -8) = 0 732 1: (bf) r2 = r10 733 2: (07) r2 += -8 734 3: (b7) r1 = 1 735 4: (85) call 1 736 5: (15) if r0 == 0x0 goto pc+1 737 R0=map_ptr R10=fp 738 6: (7a) *(u64 *)(r0 +4) = 0 739 misaligned access off 4 size 8 740 741 Program that correctly checks map_lookup_elem() returned value for NULL and 742 accesses memory with correct alignment in one side of 'if' branch, but fails 743 to do so in the other side of 'if' branch:: 744 745 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 746 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 747 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 748 BPF_LD_MAP_FD(BPF_REG_1, 0), 749 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 750 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), 751 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), 752 BPF_EXIT_INSN(), 753 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), 754 BPF_EXIT_INSN(), 755 756 Error:: 757 758 0: (7a) *(u64 *)(r10 -8) = 0 759 1: (bf) r2 = r10 760 2: (07) r2 += -8 761 3: (b7) r1 = 1 762 4: (85) call 1 763 5: (15) if r0 == 0x0 goto pc+2 764 R0=map_ptr R10=fp 765 6: (7a) *(u64 *)(r0 +0) = 0 766 7: (95) exit 767 768 from 5 to 8: R0=imm0 R10=fp 769 8: (7a) *(u64 *)(r0 +0) = 1 770 R0 invalid mem access 'imm' 771 772 Program that performs a socket lookup then sets the pointer to NULL without 773 checking it:: 774 775 BPF_MOV64_IMM(BPF_REG_2, 0), 776 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 777 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 778 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 779 BPF_MOV64_IMM(BPF_REG_3, 4), 780 BPF_MOV64_IMM(BPF_REG_4, 0), 781 BPF_MOV64_IMM(BPF_REG_5, 0), 782 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 783 BPF_MOV64_IMM(BPF_REG_0, 0), 784 BPF_EXIT_INSN(), 785 786 Error:: 787 788 0: (b7) r2 = 0 789 1: (63) *(u32 *)(r10 -8) = r2 790 2: (bf) r2 = r10 791 3: (07) r2 += -8 792 4: (b7) r3 = 4 793 5: (b7) r4 = 0 794 6: (b7) r5 = 0 795 7: (85) call bpf_sk_lookup_tcp#65 796 8: (b7) r0 = 0 797 9: (95) exit 798 Unreleased reference id=1, alloc_insn=7 799 800 Program that performs a socket lookup but does not NULL-check the returned 801 value:: 802 803 BPF_MOV64_IMM(BPF_REG_2, 0), 804 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 805 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 806 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 807 BPF_MOV64_IMM(BPF_REG_3, 4), 808 BPF_MOV64_IMM(BPF_REG_4, 0), 809 BPF_MOV64_IMM(BPF_REG_5, 0), 810 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 811 BPF_EXIT_INSN(), 812 813 Error:: 814 815 0: (b7) r2 = 0 816 1: (63) *(u32 *)(r10 -8) = r2 817 2: (bf) r2 = r10 818 3: (07) r2 += -8 819 4: (b7) r3 = 4 820 5: (b7) r4 = 0 821 6: (b7) r5 = 0 822 7: (85) call bpf_sk_lookup_tcp#65 823 8: (95) exit 824 Unreleased reference id=1, alloc_insn=7
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.