1 ================================= 2 Brief tutorial on CRC computation 3 ================================= 4 5 A CRC is a long-division remainder. You add the CRC to the message, 6 and the whole thing (message+CRC) is a multiple of the given 7 CRC polynomial. To check the CRC, you can either check that the 8 CRC matches the recomputed value, *or* you can check that the 9 remainder computed on the message+CRC is 0. This latter approach 10 is used by a lot of hardware implementations, and is why so many 11 protocols put the end-of-frame flag after the CRC. 12 13 It's actually the same long division you learned in school, except that: 14 15 - We're working in binary, so the digits are only 0 and 1, and 16 - When dividing polynomials, there are no carries. Rather than add and 17 subtract, we just xor. Thus, we tend to get a bit sloppy about 18 the difference between adding and subtracting. 19 20 Like all division, the remainder is always smaller than the divisor. 21 To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial. 22 Since it's 33 bits long, bit 32 is always going to be set, so usually the 23 CRC is written in hex with the most significant bit omitted. (If you're 24 familiar with the IEEE 754 floating-point format, it's the same idea.) 25 26 Note that a CRC is computed over a string of *bits*, so you have 27 to decide on the endianness of the bits within each byte. To get 28 the best error-detecting properties, this should correspond to the 29 order they're actually sent. For example, standard RS-232 serial is 30 little-endian; the most significant bit (sometimes used for parity) 31 is sent last. And when appending a CRC word to a message, you should 32 do it in the right order, matching the endianness. 33 34 Just like with ordinary division, you proceed one digit (bit) at a time. 35 Each step of the division you take one more digit (bit) of the dividend 36 and append it to the current remainder. Then you figure out the 37 appropriate multiple of the divisor to subtract to being the remainder 38 back into range. In binary, this is easy - it has to be either 0 or 1, 39 and to make the XOR cancel, it's just a copy of bit 32 of the remainder. 40 41 When computing a CRC, we don't care about the quotient, so we can 42 throw the quotient bit away, but subtract the appropriate multiple of 43 the polynomial from the remainder and we're back to where we started, 44 ready to process the next bit. 45 46 A big-endian CRC written this way would be coded like:: 47 48 for (i = 0; i < input_bits; i++) { 49 multiple = remainder & 0x80000000 ? CRCPOLY : 0; 50 remainder = (remainder << 1 | next_input_bit()) ^ multiple; 51 } 52 53 Notice how, to get at bit 32 of the shifted remainder, we look 54 at bit 31 of the remainder *before* shifting it. 55 56 But also notice how the next_input_bit() bits we're shifting into 57 the remainder don't actually affect any decision-making until 58 32 bits later. Thus, the first 32 cycles of this are pretty boring. 59 Also, to add the CRC to a message, we need a 32-bit-long hole for it at 60 the end, so we have to add 32 extra cycles shifting in zeros at the 61 end of every message. 62 63 These details lead to a standard trick: rearrange merging in the 64 next_input_bit() until the moment it's needed. Then the first 32 cycles 65 can be precomputed, and merging in the final 32 zero bits to make room 66 for the CRC can be skipped entirely. This changes the code to:: 67 68 for (i = 0; i < input_bits; i++) { 69 remainder ^= next_input_bit() << 31; 70 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 71 remainder = (remainder << 1) ^ multiple; 72 } 73 74 With this optimization, the little-endian code is particularly simple:: 75 76 for (i = 0; i < input_bits; i++) { 77 remainder ^= next_input_bit(); 78 multiple = (remainder & 1) ? CRCPOLY : 0; 79 remainder = (remainder >> 1) ^ multiple; 80 } 81 82 The most significant coefficient of the remainder polynomial is stored 83 in the least significant bit of the binary "remainder" variable. 84 The other details of endianness have been hidden in CRCPOLY (which must 85 be bit-reversed) and next_input_bit(). 86 87 As long as next_input_bit is returning the bits in a sensible order, we don't 88 *have* to wait until the last possible moment to merge in additional bits. 89 We can do it 8 bits at a time rather than 1 bit at a time:: 90 91 for (i = 0; i < input_bytes; i++) { 92 remainder ^= next_input_byte() << 24; 93 for (j = 0; j < 8; j++) { 94 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 95 remainder = (remainder << 1) ^ multiple; 96 } 97 } 98 99 Or in little-endian:: 100 101 for (i = 0; i < input_bytes; i++) { 102 remainder ^= next_input_byte(); 103 for (j = 0; j < 8; j++) { 104 multiple = (remainder & 1) ? CRCPOLY : 0; 105 remainder = (remainder >> 1) ^ multiple; 106 } 107 } 108 109 If the input is a multiple of 32 bits, you can even XOR in a 32-bit 110 word at a time and increase the inner loop count to 32. 111 112 You can also mix and match the two loop styles, for example doing the 113 bulk of a message byte-at-a-time and adding bit-at-a-time processing 114 for any fractional bytes at the end. 115 116 To reduce the number of conditional branches, software commonly uses 117 the byte-at-a-time table method, popularized by Dilip V. Sarwate, 118 "Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM 119 v.31 no.8 (August 1998) p. 1008-1013. 120 121 Here, rather than just shifting one bit of the remainder to decide 122 in the correct multiple to subtract, we can shift a byte at a time. 123 This produces a 40-bit (rather than a 33-bit) intermediate remainder, 124 and the correct multiple of the polynomial to subtract is found using 125 a 256-entry lookup table indexed by the high 8 bits. 126 127 (The table entries are simply the CRC-32 of the given one-byte messages.) 128 129 When space is more constrained, smaller tables can be used, e.g. two 130 4-bit shifts followed by a lookup in a 16-entry table. 131 132 It is not practical to process much more than 8 bits at a time using this 133 technique, because tables larger than 256 entries use too much memory and, 134 more importantly, too much of the L1 cache. 135 136 To get higher software performance, a "slicing" technique can be used. 137 See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm", 138 ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf 139 140 This does not change the number of table lookups, but does increase 141 the parallelism. With the classic Sarwate algorithm, each table lookup 142 must be completed before the index of the next can be computed. 143 144 A "slicing by 2" technique would shift the remainder 16 bits at a time, 145 producing a 48-bit intermediate remainder. Rather than doing a single 146 lookup in a 65536-entry table, the two high bytes are looked up in 147 two different 256-entry tables. Each contains the remainder required 148 to cancel out the corresponding byte. The tables are different because the 149 polynomials to cancel are different. One has non-zero coefficients from 150 x^32 to x^39, while the other goes from x^40 to x^47. 151 152 Since modern processors can handle many parallel memory operations, this 153 takes barely longer than a single table look-up and thus performs almost 154 twice as fast as the basic Sarwate algorithm. 155 156 This can be extended to "slicing by 4" using 4 256-entry tables. 157 Each step, 32 bits of data is fetched, XORed with the CRC, and the result 158 broken into bytes and looked up in the tables. Because the 32-bit shift 159 leaves the low-order bits of the intermediate remainder zero, the 160 final CRC is simply the XOR of the 4 table look-ups. 161 162 But this still enforces sequential execution: a second group of table 163 look-ups cannot begin until the previous groups 4 table look-ups have all 164 been completed. Thus, the processor's load/store unit is sometimes idle. 165 166 To make maximum use of the processor, "slicing by 8" performs 8 look-ups 167 in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed 168 with 64 bits of input data. What is important to note is that 4 of 169 those 8 bytes are simply copies of the input data; they do not depend 170 on the previous CRC at all. Thus, those 4 table look-ups may commence 171 immediately, without waiting for the previous loop iteration. 172 173 By always having 4 loads in flight, a modern superscalar processor can 174 be kept busy and make full use of its L1 cache. 175 176 Two more details about CRC implementation in the real world: 177 178 Normally, appending zero bits to a message which is already a multiple 179 of a polynomial produces a larger multiple of that polynomial. Thus, 180 a basic CRC will not detect appended zero bits (or bytes). To enable 181 a CRC to detect this condition, it's common to invert the CRC before 182 appending it. This makes the remainder of the message+crc come out not 183 as zero, but some fixed non-zero value. (The CRC of the inversion 184 pattern, 0xffffffff.) 185 186 The same problem applies to zero bits prepended to the message, and a 187 similar solution is used. Instead of starting the CRC computation with 188 a remainder of 0, an initial remainder of all ones is used. As long as 189 you start the same way on decoding, it doesn't make a difference.
Linux® is a registered trademark of Linus Torvalds in the United States and other countries.
TOMOYO® is a registered trademark of NTT DATA CORPORATION.