Minimal ZIP file – part III
This article is part of a series.
- Creating a valid ZIP-file
- Use .NET’s
DeflateStream - No dependencies (coding the DEFLATE algorithm) 👈
- Calculating the CRC-32 (Work in Progress)
Introduction
In the previous part a little cheating was done by using an off-the-shelve DeflateStream.
So how challenging is it to implement the DEFLATE algorithm from scratch? It depends!
About DEFLATE
A DEFLATE stream is a combination of LZ771 and Huffman coding2. Easy, right? I did both of them before, but creating valid ZIP-files with both of them is another animal.
DEFLATE is documented in RFC 19513 which is only 17 pages. However, after reading it many times and looking at existing implementations (some of them over 1000 lines of cryptic C code), it left me puzzled.
Solution
Ultimately, I decided to take a bunch of shortcuts and ended up with a very simple Deflater for educational purposes. Only LZ77 is implemented without dynamic Huffman, and fixed Huffman codes4 are hardcoded. This explains points 1 and 2 below. Fixed Huffman codes only work with small values.
The limits of this implementation are:
- ASCII characters only.
- Maximum back reference is 4 characters, maximum length is 10 characters.
- Pretty buggy due to
misunderstandinglaziness of LSB / MSB on my part.
Despite these horrible limits, a string like ‘BANANABANANABANANABANANABANANA’ gets 20% smaller. Success!
Source code
The idea behind the algorithm is to brute force search for repeating runs of characters, when found a pair (length, distance) is written out instead of the run. If no run is found, a so-called literal is written.
There is no gain or even a loss for small runs, but longer runs take advantage.
public class Deflater(Stream stream){private const int MinLength = 3;private const int MaxLength = 4;private const int MaxBackReference = 4;public void Write(byte[] input){var writer = new BitWriter(stream);// begin of block 0b011 (LSB first)writer.WriteBits(0b110, 3);var cursor = 0;while (cursor < input.Length){// find a matchvar bestIndex = -1;var bestLength = 0;for (var i = Math.Max(0, cursor - MaxBackReference); i < cursor; i++){var len = 0;while (cursor + len < input.Length &&input[i + len] == input[cursor + len] &&len < MaxLength){len++;}if (len <= bestLength){continue;}bestLength = len;bestIndex = i;}if (bestLength >= MinLength){// (length, distance)var length = bestLength - MinLength + 1;var distance = (uint)(cursor - bestIndex - 1);// distancewriter.WriteBits((uint)length, 7);writer.WriteBits(distance, 5);cursor += bestLength;}else{// literalvar literal = (uint)(input[cursor] + 48); // Fixed Huffmanwriter.WriteBits(literal, 8);cursor++;}}// end of blockwriter.WriteBits(0b0000000, 7);writer.Flush();}}
A separate class BitWriter (BitWriter.cs) was introduced to keep a minimum amount of red tape. Although BitWriter does not
have the same contract as e.g. BinaryWriter integrating it in the existing code is easy:
-var deflateStream = new DeflateStream(outputStream, CompressionLevel.Optimal);+var deflateStream = new Deflater(outputStream);
Again, Info-ZIP’s zip agrees:
dotnet runzip -T minimal.zip
test of minimal.zip OK
Writing more extensive Huffman codes (either fixed or dynamic) is left as an exercise for the reader or out of scope
as Bertrik likes to say.
Next up
In the next part another dependency will be removed; calculating the CRC-32.
-
Very important: RFC 1951 Errata ↩︎
-
Doing fixed Huffman instead of LZ77 was also an option, but it actually inflates with the compression data in this case. ↩︎