joriszwart.nl

Data compression

Minimal ZIP file – part III

This article is part of a series.

  1. Creating a valid ZIP-file
  2. Use .NET’s DeflateStream
  3. No dependencies (coding the DEFLATE algorithm) 👈
  4. Calculating the CRC-32 (Work in Progress)

Introduction

In the previous part a little cheating was done by using an off-the-shelve DeflateStream.

So how challenging is it to implement the DEFLATE algorithm from scratch? It depends!

About DEFLATE

A DEFLATE stream is a combination of LZ771 and Huffman coding2. Easy, right? I did both of them before, but creating valid ZIP-files with both of them is another animal.

DEFLATE is documented in RFC 19513 which is only 17 pages. However, after reading it many times and looking at existing implementations (some of them over 1000 lines of cryptic C code), it left me puzzled.

Solution

Ultimately, I decided to take a bunch of shortcuts and ended up with a very simple Deflater for educational purposes. Only LZ77 is implemented without dynamic Huffman, and fixed Huffman codes4 are hardcoded. This explains points 1 and 2 below. Fixed Huffman codes only work with small values.

The limits of this implementation are:

  1. ASCII characters only.
  2. Maximum back reference is 4 characters, maximum length is 10 characters.
  3. Pretty buggy due to misunderstandinglaziness of LSB / MSB on my part.

Despite these horrible limits, a string like ‘BANANABANANABANANABANANABANANA’ gets 20% smaller. Success!

Source code

The idea behind the algorithm is to brute force search for repeating runs of characters, when found a pair (length, distance) is written out instead of the run. If no run is found, a so-called literal is written.

There is no gain or even a loss for small runs, but longer runs take advantage.

public class Deflater(Stream stream)
{
    private const int MinLength = 3;
    private const int MaxLength = 4;
    private const int MaxBackReference = 4;

    public void Write(byte[] input)
    {
        var writer = new BitWriter(stream);

        // begin of block 0b011 (LSB first)
        writer.WriteBits(0b110, 3);

        var cursor = 0;
        while (cursor < input.Length)
        {
            // find a match
            var bestIndex = -1;
            var bestLength = 0;

            for (var i = Math.Max(0, cursor - MaxBackReference); i < cursor; i++)
            {
                var len = 0;
                while (cursor + len < input.Length &&
                       input[i + len] == input[cursor + len] &&
                       len < MaxLength)
                {
                    len++;
                }

                if (len <= bestLength)
                {
                    continue;
                }

                bestLength = len;
                bestIndex = i;
            }

            if (bestLength >= MinLength)
            {
                // (length, distance)
                var length = bestLength - MinLength + 1;
                var distance = (uint)(cursor - bestIndex - 1);

                // distance
                writer.WriteBits((uint)length, 7);
                writer.WriteBits(distance, 5);

                cursor += bestLength;
            }
            else
            {
                // literal
                var literal = (uint)(input[cursor] + 48); // Fixed Huffman
                writer.WriteBits(literal, 8);
                cursor++;
            }
        }

        // end of block
        writer.WriteBits(0b0000000, 7);

        writer.Flush();
    }
}
Source code of LZ77.cs

A separate class BitWriter (BitWriter.cs) was introduced to keep a minimum amount of red tape. Although BitWriter does not have the same contract as e.g. BinaryWriter integrating it in the existing code is easy:

-var deflateStream = new DeflateStream(outputStream, CompressionLevel.Optimal);
+var deflateStream = new Deflater(outputStream);

Again, Info-ZIP’s zip agrees:

dotnet run
zip -T minimal.zip
test of minimal.zip OK

Writing more extensive Huffman codes (either fixed or dynamic) is left as an exercise for the reader or out of scope as Bertrik likes to say.

Next up

In the next part another dependency will be removed; calculating the CRC-32.


  1. en.wikipedia.org/wiki/LZ77_and_LZ78 ↩︎

  2. en.wikipedia.org/wiki/Huffman_coding ↩︎

  3. Very important: RFC 1951 Errata ↩︎

  4. Doing fixed Huffman instead of LZ77 was also an option, but it actually inflates with the compression data in this case. ↩︎

Related