1 .. SPDX-License-Identifier: GPL-2.0 1 .. SPDX-License-Identifier: GPL-2.0 2 2 3 ============================ 3 ============================ 4 LC-trie implementation notes 4 LC-trie implementation notes 5 ============================ 5 ============================ 6 6 7 Node types 7 Node types 8 ---------- 8 ---------- 9 leaf 9 leaf 10 An end node with data. This has a copy 10 An end node with data. This has a copy of the relevant key, along 11 with 'hlist' with routing table entrie 11 with 'hlist' with routing table entries sorted by prefix length. 12 See struct leaf and struct leaf_info. 12 See struct leaf and struct leaf_info. 13 13 14 trie node or tnode 14 trie node or tnode 15 An internal node, holding an array of 15 An internal node, holding an array of child (leaf or tnode) pointers, 16 indexed through a subset of the key. S 16 indexed through a subset of the key. See Level Compression. 17 17 18 A few concepts explained 18 A few concepts explained 19 ------------------------ 19 ------------------------ 20 Bits (tnode) 20 Bits (tnode) 21 The number of bits in the key segment 21 The number of bits in the key segment used for indexing into the 22 child array - the "child index". See L 22 child array - the "child index". See Level Compression. 23 23 24 Pos (tnode) 24 Pos (tnode) 25 The position (in the key) of the key s 25 The position (in the key) of the key segment used for indexing into 26 the child array. See Path Compression. 26 the child array. See Path Compression. 27 27 28 Path Compression / skipped bits 28 Path Compression / skipped bits 29 Any given tnode is linked to from the 29 Any given tnode is linked to from the child array of its parent, using 30 a segment of the key specified by the 30 a segment of the key specified by the parent's "pos" and "bits" 31 In certain cases, this tnode's own "po 31 In certain cases, this tnode's own "pos" will not be immediately 32 adjacent to the parent (pos+bits), but 32 adjacent to the parent (pos+bits), but there will be some bits 33 in the key skipped over because they r 33 in the key skipped over because they represent a single path with no 34 deviations. These "skipped bits" const 34 deviations. These "skipped bits" constitute Path Compression. 35 Note that the search algorithm will si 35 Note that the search algorithm will simply skip over these bits when 36 searching, making it necessary to save 36 searching, making it necessary to save the keys in the leaves to 37 verify that they actually do match the 37 verify that they actually do match the key we are searching for. 38 38 39 Level Compression / child arrays 39 Level Compression / child arrays 40 the trie is kept level balanced moving 40 the trie is kept level balanced moving, under certain conditions, the 41 children of a full child (see "full_ch 41 children of a full child (see "full_children") up one level, so that 42 instead of a pure binary tree, each in 42 instead of a pure binary tree, each internal node ("tnode") may 43 contain an arbitrarily large array of 43 contain an arbitrarily large array of links to several children. 44 Conversely, a tnode with a mostly empt 44 Conversely, a tnode with a mostly empty child array (see empty_children) 45 may be "halved", having some of its ch 45 may be "halved", having some of its children moved downwards one level, 46 in order to avoid ever-increasing chil 46 in order to avoid ever-increasing child arrays. 47 47 48 empty_children 48 empty_children 49 the number of positions in the child a 49 the number of positions in the child array of a given tnode that are 50 NULL. 50 NULL. 51 51 52 full_children 52 full_children 53 the number of children of a given tnod 53 the number of children of a given tnode that aren't path compressed. 54 (in other words, they aren't NULL or l 54 (in other words, they aren't NULL or leaves and their "pos" is equal 55 to this tnode's "pos"+"bits"). 55 to this tnode's "pos"+"bits"). 56 56 57 (The word "full" here is used more in 57 (The word "full" here is used more in the sense of "complete" than 58 as the opposite of "empty", which migh 58 as the opposite of "empty", which might be a tad confusing.) 59 59 60 Comments 60 Comments 61 --------- 61 --------- 62 62 63 We have tried to keep the structure of the cod 63 We have tried to keep the structure of the code as close to fib_hash as 64 possible to allow verification and help up rev 64 possible to allow verification and help up reviewing. 65 65 66 fib_find_node() 66 fib_find_node() 67 A good start for understanding this co 67 A good start for understanding this code. This function implements a 68 straightforward trie lookup. 68 straightforward trie lookup. 69 69 70 fib_insert_node() 70 fib_insert_node() 71 Inserts a new leaf node in the trie. T 71 Inserts a new leaf node in the trie. This is bit more complicated than 72 fib_find_node(). Inserting a new node 72 fib_find_node(). Inserting a new node means we might have to run the 73 level compression algorithm on part of 73 level compression algorithm on part of the trie. 74 74 75 trie_leaf_remove() 75 trie_leaf_remove() 76 Looks up a key, deletes it and runs th 76 Looks up a key, deletes it and runs the level compression algorithm. 77 77 78 trie_rebalance() 78 trie_rebalance() 79 The key function for the dynamic trie 79 The key function for the dynamic trie after any change in the trie 80 it is run to optimize and reorganize. 80 it is run to optimize and reorganize. It will walk the trie upwards 81 towards the root from a given tnode, d 81 towards the root from a given tnode, doing a resize() at each step 82 to implement level compression. 82 to implement level compression. 83 83 84 resize() 84 resize() 85 Analyzes a tnode and optimizes the chi 85 Analyzes a tnode and optimizes the child array size by either inflating 86 or shrinking it repeatedly until it fu 86 or shrinking it repeatedly until it fulfills the criteria for optimal 87 level compression. This part follows t 87 level compression. This part follows the original paper pretty closely 88 and there may be some room for experim 88 and there may be some room for experimentation here. 89 89 90 inflate() 90 inflate() 91 Doubles the size of the child array wi 91 Doubles the size of the child array within a tnode. Used by resize(). 92 92 93 halve() 93 halve() 94 Halves the size of the child array wit 94 Halves the size of the child array within a tnode - the inverse of 95 inflate(). Used by resize(); 95 inflate(). Used by resize(); 96 96 97 fn_trie_insert(), fn_trie_delete(), fn_trie_se 97 fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() 98 The route manipulation functions. Shou 98 The route manipulation functions. Should conform pretty closely to the 99 corresponding functions in fib_hash. 99 corresponding functions in fib_hash. 100 100 101 fn_trie_flush() 101 fn_trie_flush() 102 This walks the full trie (using nextle 102 This walks the full trie (using nextleaf()) and searches for empty 103 leaves which have to be removed. 103 leaves which have to be removed. 104 104 105 fn_trie_dump() 105 fn_trie_dump() 106 Dumps the routing table ordered by pre 106 Dumps the routing table ordered by prefix length. This is somewhat 107 slower than the corresponding fib_hash 107 slower than the corresponding fib_hash function, as we have to walk the 108 entire trie for each prefix length. In 108 entire trie for each prefix length. In comparison, fib_hash is organized 109 as one "zone"/hash per prefix length. 109 as one "zone"/hash per prefix length. 110 110 111 Locking 111 Locking 112 ------- 112 ------- 113 113 114 fib_lock is used for an RW-lock in the same wa 114 fib_lock is used for an RW-lock in the same way that this is done in fib_hash. 115 However, the functions are somewhat separated 115 However, the functions are somewhat separated for other possible locking 116 scenarios. It might conceivably be possible to 116 scenarios. It might conceivably be possible to run trie_rebalance via RCU 117 to avoid read_lock in the fn_trie_lookup() fun 117 to avoid read_lock in the fn_trie_lookup() function. 118 118 119 Main lookup mechanism 119 Main lookup mechanism 120 --------------------- 120 --------------------- 121 fn_trie_lookup() is the main lookup function. 121 fn_trie_lookup() is the main lookup function. 122 122 123 The lookup is in its simplest form just like f 123 The lookup is in its simplest form just like fib_find_node(). We descend the 124 trie, key segment by key segment, until we fin 124 trie, key segment by key segment, until we find a leaf. check_leaf() does 125 the fib_semantic_match in the leaf's sorted pr 125 the fib_semantic_match in the leaf's sorted prefix hlist. 126 126 127 If we find a match, we are done. 127 If we find a match, we are done. 128 128 129 If we don't find a match, we enter prefix matc 129 If we don't find a match, we enter prefix matching mode. The prefix length, 130 starting out at the same as the key length, is 130 starting out at the same as the key length, is reduced one step at a time, 131 and we backtrack upwards through the trie tryi 131 and we backtrack upwards through the trie trying to find a longest matching 132 prefix. The goal is always to reach a leaf and 132 prefix. The goal is always to reach a leaf and get a positive result from the 133 fib_semantic_match mechanism. 133 fib_semantic_match mechanism. 134 134 135 Inside each tnode, the search for longest matc 135 Inside each tnode, the search for longest matching prefix consists of searching 136 through the child array, chopping off (zeroing 136 through the child array, chopping off (zeroing) the least significant "1" of 137 the child index until we find a match or the c 137 the child index until we find a match or the child index consists of nothing but 138 zeros. 138 zeros. 139 139 140 At this point we backtrack (t->stats.backtrack 140 At this point we backtrack (t->stats.backtrack++) up the trie, continuing to 141 chop off part of the key in order to find the 141 chop off part of the key in order to find the longest matching prefix. 142 142 143 At this point we will repeatedly descend subtr 143 At this point we will repeatedly descend subtries to look for a match, and there 144 are some optimizations available that can prov 144 are some optimizations available that can provide us with "shortcuts" to avoid 145 descending into dead ends. Look for "HL_OPTIMI 145 descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. 146 146 147 To alleviate any doubts about the correctness 147 To alleviate any doubts about the correctness of the route selection process, 148 a new netlink operation has been added. Look f 148 a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which 149 gives userland access to fib_lookup(). 149 gives userland access to fib_lookup().
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.