I’m close to finishing up the implementation of the text compression algorithm. Along the way, I did quite a few experiments with different parts of the program. To keep things open for experimentation, I introduced a few abstractions in the code. These abstractions were useful while I was still tweaking the algorithm, but the downside was that the code became more complicated than needed. Once I’m done with the experiments and selected the best solutions, I no longer need to keep these abstractions in the code, or at least I don’t need to expose them to the outside world.

In this post, I’m going to review the work I’ve done so far, and see what parts of the program can be removed from the final version.

Working with substrings

My most recent task was to come up with an efficient data structure to manage the list of substrings and their counts, while we are learning the common substrings from the source text.

At the very beginning of the project, I started with a simple HashMap, but shortly after switched to using BTreeMap to avoid unnecessary key sorting. In its turn, BTreeMap also proved to be an inefficient for the task, so I ended up implementing a trie data structure. Using the trie gave a huge performance boost to the compression algorithm, thanks to its efficient method for the substring search.

For experimentation, I created the SubstringCounts trait, and two different implementations: BTreeSubstringCounts, and TrieSubstringCounts. The latter proved to be a winner, and I don’t need the b-tree implementation anymore, so it’s the first thing to let go. Then, since I only end up with a single implementation for SubstringCounts, I don’t think this abstraction is worth keeping either: it doesn’t seem to give me any benefits.

That became my first clean-up task. I removed the unused BTreeSubstringCounts implementation, and merged SubstringCounts trait with TrieSubstringCounts. I liked the name SubstringCounts, so I re-used it for the final struct.

Selecting substrings for encoding

Another subject for experiments was to determine which substrings make their way to the final encoding table. I had two options to choose from:

  • select the most frequently occurring substrings, looking solely on their counts;
  • select the substrings by their overall compression gain, that would take into account both frequency and the substring lengths.

As it turned out, it didn’t make much of a difference which approach to choose. Frequency-based selection showed to be marginally better, and also simpler to reason about, so I decided to keep that approach as a final version. That led to the simplifications on the side of the SubstringSelector struct, since I didn’t need to maintain two different ways for substring selection.

I also played with the idea of getting rid of SubstringSelector altogether and moving its responsibilities to the SubstringLedger struct, but eventually decided against it. I like to keep that logic separate from the rest of the code: it allows me to test it more easily in isolation.

Substring ledger book-keeping

Yet another area where I spent quite a bit of time and effort was limiting the substring ledger size. As I usually do, I started with a simple approach, where I kept all repeated substrings. Later I applied a strategy to keep the ledger size limited. The strategy was to check the condition when a new substring is added to the ledger, and also to remove rare substrings when we needed to free up the space for new ones.

I experimented with different ledger sizes and found that the the limit on the ledger size has a significant impact on the overall compression ratio. However, after a certain threshold, it didn’t make much difference how many substrings we kept.

The main takeaway from these experiments is that it makes sense to limit the ledger size to 65 536 entries, since bigger numbers don’t impact the compression ratio much. I took this strategy into use in the final version of the main encode function.

As for further simplifications, it proved to be useful to be able to apply different ledger policies, particularly for testing. By keeping a level of flexibility here, I can apply different test versions of ledger policies, to test the learning algorithm with those simpler implementations: a technique called test doubles.

However, I no longer expose the ledger policies to the outside world: all the details are hidden inside the encoder module. The only thing that’s exported from this module is the encode() function, that takes care of all the pesky details.

Next steps

It looks like the core functionality of the text compression algorithm is ready and polished. To celebrate achieving this important milestone, I’m tagging the current state as version 0.1.0.

One other thing I’d like to focus on next is to make encode() and decode() functions more robust, in terms of handling invalid inputs. That leads me to a yet unexplored territory: error handling in Rust.