Contents:
That's the end of the theory; now we turn to implementations. To start with, we examine an absolutely straight-down-the-middle boring straightforward low-speed implementation that doesn't use any speed tricks at all. We'll then transform that program progessively until we end up with the compact table-driven code we all know and love and which some of us would like to understand.
To implement a CRC algorithm all we have to do is implement CRC division. There are two reasons why we cannot simply use the divide instruction of whatever machine we are on. The first is that we have to do the divide in CRC arithmetic. The second is that the dividend might be ten megabytes long, and todays processors do not have registers that big.
So to implement CRC division, we have to feed the message through a division register. At this point, we have to be absolutely precise about the message data. In all the following examples the message will be considered to be a stream of bytes (each of 8 bits) with bit 7 of each byte being considered to be the most significant bit (MSB). The bit stream formed from these bytes will be the bit stream with the MSB (bit 7) of the first byte first, going down to bit 0 of the first byte, and then the MSB of the second byte and so on.
With this in mind, we can sketch an implementation of the CRC division. For the purposes of example, consider a poly with W=4 and the poly=10111. Then, the perform the division, we need to use a 4-bit register:
3 2 1 0 Bits +---+---+---+---+ Pop! <-- | | | | | <----- Augmented message +---+---+---+---+ 1 0 1 1 1 = The Poly
(Reminder: The augmented message is the message followed by W zero bits.)
To perform the division perform the following:
Load the register with zero bits. Augment the message by appending W zero bits to the end of it. While (more message bits) Begin Shift the register left by one bit, reading the next bit of the augmented message into register bit position 0. If (a 1 bit popped out of the register during step 3) Register = Register XOR Poly. End The register now contains the remainder.
(Note: In practice, the IF condition can be tested by testing the top bit of R before performing the shift.)
We will call this algorithm "SIMPLE".
This might look a bit messy, but all we are really doing is "subtracting" various powers (i.e. shiftings) of the poly from the message until there is nothing left but the remainder. Study the manual examples of long division if you don't understand this.
It should be clear that the above algorithm will work for any width W.
The SIMPLE algorithm above is a good starting point because it corresponds directly to the theory presented so far, and because it is so SIMPLE. However, because it operates at the bit level, it is rather awkward to code (even in C), and inefficient to execute (it has to loop once for each bit). To speed it up, we need to find a way to enable the algorithm to process the message in units larger than one bit. Candidate quantities are nibbles (4 bits), bytes (8 bits), words (16 bits) and longwords (32 bits) and higher if we can achieve it. Of these, 4 bits is best avoided because it does not correspond to a byte boundary. At the very least, any speedup should allow us to operate at byte boundaries, and in fact most of the table driven algorithms operate a byte at a time.
For the purposes of discussion, let us switch from a 4-bit poly to a 32-bit one. Our register looks much the same, except the boxes represent bytes instead of bits, and the Poly is 33 bits (one implicit 1 bit at the top and 32 "active" bits) (W=32).
3 2 1 0 Bytes +----+----+----+----+ Pop! <-- | | | | | <----- Augmented message +----+----+----+----+ 1<------32 bits------>
The SIMPLE algorithm is still applicable. Let us examine what it does. Imagine that the SIMPLE algorithm is in full swing and consider the top 8 bits of the 32-bit register (byte 3) to have the values:
t7 t6 t5 t4 t3 t2 t1 t0
In the next iteration of SIMPLE, t7 will determine whether the Poly will be XORed into the entire register. If t7=1, this will happen, otherwise it will not. Suppose that the top 8 bits of the poly are g7 g6.. g0, then after the next iteration, the top byte will be:
t6 t5 t4 t3 t2 t1 t0 ?? + t7 * (g7 g6 g5 g4 g3 g2 g1 g0) [Reminder: + is XOR]
The NEW top bit (that will control what happens in the next iteration) now has the value t6 + t7*g7. The important thing to notice here is that from an informational point of view, all the information required to calculate the NEW top bit was present in the top TWO bits of the original top byte. Similarly, the NEXT top bit can be calculated in advance SOLELY from the top THREE bits t7, t6, and t5. In fact, in general, the value of the top bit in the register in k iterations can be calculated from the top k bits of the register. Let us take this for granted for a moment.
Consider for a moment that we use the top 8 bits of the register to calculate the value of the top bit of the register during the next 8 iterations. Suppose that we drive the next 8 iterations using the calculated values (which we could perhaps store in a single byte register and shift out to pick off each bit). Then we note three things:
AND
Now consider the effect of XORing in a constant value at various offsets to a register. For example:
0100010 Register ...0110 XOR this ..0110. XOR this 0110... XOR this ------- 0011000 -------
The point of this is that you can XOR constant values into a register to your heart's delight, and in the end, there will exist a value which when XORed in with the original register will have the same effect as all the other XORs.
Perhaps you can see the solution now. Putting all the pieces together we have an algorithm that goes like this:
While (augmented message is not exhausted) Begin Examine the top byte of the register Calculate the control byte from the top byte of the register Sum all the Polys at various offsets that are to be XORed into the register in accordance with the control byte Shift the register left by one byte, reading a new message byte into the rightmost byte of the register XOR the summed polys to the register End
As it stands this is not much better than the SIMPLE algorithm. However, it turns out that most of the calculation can be precomputed and assembled into a table. As a result, the above algorithm can be reduced to:
While (augmented message is not exhaused) Begin Top = top_byte(Register); Register = (Register << 24) | next_augmessage_byte; Register = Register XOR precomputed_table[Top]; End
There! If you understand this, you've grasped the main idea of table-driven CRC algorithms. The above is a very efficient algorithm requiring just a shift, and OR, an XOR, and a table lookup per byte. Graphically, it looks like this:
3 2 1 0 Bytes +----+----+----+----+ +-----<| | | | | <----- Augmented message | +----+----+----+----+ | ^ | | | XOR | | | 0+----+----+----+----+ v +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ +----->+----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ 255+----+----+----+----+
Algorithm
In C, the algorithm main loop looks like this:
r=0; while (len--) { byte t = (r >> 24) & 0xFF; r = (r << 8) | *++; r^=table[t]; }
where len is the length of the augmented message in bytes, p points to the augmented message, r is the register, t is a temporary, and table is the computed table. This code can be made even more unreadable as follows:
r=0; while (len--) r = ((r << 8) | *++) ^ t[(r >> 24) & 0xFF];
This is a very clean, efficient loop, although not a very obvious one to the casual observer not versed in CRC theory. We will call this the TABLE algorithm.
Despite the terse beauty of the line
r=0; while (len--) r = ((r << 8) | *++) ^ t[(r >> 24) & 0xFF];
those optimizing hackers couldn't leave it alone. The trouble, you see, is that this loop operates upon the AUGMENTED message and in order to use this code, you have to append W/8 zero bytes to the end of the message before pointing p at it. Depending on the run-time environment, this may or may not be a problem; if the block of data was handed to us by some other code, it could be a BIG problem. One alternative is simply to append the following line after the above loop, once for each zero byte:
for (i=0; i<W/4; i++) r = (r << 8) ^ t[(r >> 24) & 0xFF];
This looks like a sane enough solution to me. However, at the further expense of clarity (which, you must admit, is already a pretty scare commodity in this code) we can reorganize this small loop further so as to avoid the need to either augment the message with zero bytes, or to explicitly process zero bytes at the end as above. To explain the optimization, we return to the processing diagram given earlier.
3 2 1 0 Bytes +----+----+----+----+ +-----<| | | | | <----- Augmented message | +----+----+----+----+ | ^ | | | XOR | | | 0+----+----+----+----+ v +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ +----->+----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ 255+----+----+----+----+
Algorithm
Now, note the following facts:
These facts, combined with the XOR property
(A xor B) xor C = A xor (B xor C)
mean that message bytes need not actually travel through the W/4 bytes of the register. Instead, they can be XORed into the top byte just before it is used to index the lookup table. This leads to the following modified version of the algorithm.
+-----<Message (non augmented) | v 3 2 1 0 Bytes | +----+----+----+----+ XOR----<| | | | | | +----+----+----+----+ | ^ | | | XOR | | | 0+----+----+----+----+ v +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ | +----+----+----+----+ +----->+----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ +----+----+----+----+ 255+----+----+----+----+
Algorithm
Note: The initial register value for this algorithm must be the initial value of the register for the previous algorithm fed through the table four times. Note: The table is such that if the previous algorithm used 0, the new algorithm will too.
This is an IDENTICAL algorithm and will yield IDENTICAL results. The C code looks something like this:
r=0; while (len--) r = (r<<8) ^ t[(r >> 24) ^ *++];
and THIS is the code that you are likely to find inside current table-driven CRC implementations. Some FF masks might have to be ANDed in here and there for portability's sake, but basically, the above loop is IT. We will call this the DIRECT TABLE ALGORITHM.
During the process of trying to understand all this stuff, I managed to derive the SIMPLE algorithm and the table-driven version derived from that. However, when I compared my code with the code found in real-implementations, I was totally bamboozled as to why the bytes were being XORed in at the wrong end of the register! It took quite a while before I figured out that theirs and my algorithms were actually the same. Part of why I am writing this document is that, while the link between division and my earlier table-driven code is vaguely apparent, any such link is fairly well erased when you start pumping bytes in at the "wrong end" of the register. It looks all wrong!
If you've got this far, you not only understand the theory, the practice, the optimized practice, but you also understand the real code you are likely to run into. Could get any more complicated? Yes it can.