Fork me on GitHub

Welcome

The best way to learn a programming language is to use it to build something interesting. Lately I’ve been spending some time learning Rust, and at some point I started to feel the itch to go beyond toy exercises and start something more tangible. I wanted my first project to tackle an interesting problem, but be not too complex, since I’m only doing my first steps in Rust.

An old, but classic book Etudes for programmers, by Charles Wetherell, gave me a plethora of ideas. I decided to take on an etude called Ye Soule of Witte, which challenges the reader with writing a program that implements an algorithm for text compression.

I’ve also decided to document my journey through the project, and record my learnings in a series of blog posts. Learning something new is fun, but the new knowledge can quickly fade away. I’ve found that taking notes on what I’ve learned is a great way to solidify the understanding of the subject.

The source code for the project is available on GitHub.

Project diary

  • Text compression: an overview

    Text data usually contains a lot of redundancy. Indeed, in natural language texts, the sequences of characters, words, end even whole phrases are repeated very often throughout the text. Using that fact, we can make the text much shorter, if we replace most commonly occurring patterns with shorter representations.

  • The first iteration: Make It Work

    At the heart of the text compression is the algorithm that builds the dictionary of most frequent substrings. The original algorithm focuses a lot on keeping the dictionary size limited. It does so by applying strategies for adding new substrings, and removing the least frequent ones when the size limit is reached. There’s a lot of room for experimentation here.

  • Splitting crates

    As I mentioned in the previous post, I’m not happy with the performance of my initial implementation, and I’m eager to dive into exploring the bottlenecks. But to make experiments more convenient, I would like to create a second, simpler executable that I could use in performance testing. It will perform the compression on a single input file, so the execution time will be much shorter, compared to the main executable that processes many files.

  • Profiling with Flamegraphs

    To analyze where my program is spending most of its time, I’m going to need a profiler to collect the execution data. It seems that there are a few options existing for Rust, notably:

  • Tackling the performance bottleneck

    So far, I’ve done the round of performance profiling, and identified the bottleneck. In this post, I’m going to describe how I tackled the problem. Spoiler alert: I only alleviate the performance issue, but not solve it completely for now.

  • Some bug fixes

    I’ve played with the program for a while, and found a few bugs that slipped through the unit test suite. As embarrassing as it is, bugs are inevitable, and manual exploratory testing is is always necessary, even in the TDD process. The way I usually work with bug fixes is that, once the bug is found, I first write the test to reproduce it, and then I fix the code. This process strengthens the automated test suite, and I make sure that the bug is fixed and doesn’t come back in the future.

  • Refactoring a messy function

    As I mentioned in the previous post, I’m not very happy with the current implementation of the build_ledger_step function. Today I discuss what my concerns were, and describe the refactoring process that I went through to address them.

  • Limiting the ledger size

    Now it’s time to look into another aspect of building the substring dictionary: how do we keep its size limited? So far, I’ve been skipping this part, letting it grow indefinitely, but the original task is to put a limit on the size of the substring ledger while we’re looking for common substrings.

  • Experimenting with substring ledger limits

    Having introduced the substring ledger limit, I can now experiment with different values, to see how it affects the compression ratio and the time performance of the compression algorithm. My expectation is that the time performance will be better, because smaller ledger size means that we’ll need to sieve through fewer substrings to find the longest match. With compression ratio, it’s not so clear yet how it’s going to be affected.

  • Wasted space in the encoding table

    The results of my previous experiments puzzled me a bit, so I decided to spend some time analyzing the whole algorithm to see if I can spot any clues that would explain the results. It turned out that there’s a chance that we are not using the space in the encoding table as efficiently as we could. Specifically, we may include some substrings into the table, that are never used in the encoding phase.

  • Widening the encoding table

    Until now, I’ve been using a simplified encoding scheme, which limited the size of the encoding table to 256 entries. It can be made larger, though. In this post, I’m describing how I’ve done it, and what impact it’s made on the compression ratio.

  • More efficient substring search

    So far, my primary goal was to optimize the compression algorithm for the best compression ratio. In the last post, I arrived at the solution that gives quite satisfactory results. Now I’d like to shift the focus and address another elephant in the room: the time efficiency of the compression algorithm. Right now, it’s quite slow on long input texts. I’d like to tackle this issue and speed up the algorithm by introducing a data structure that executes substring search more efficiently.

  • Optimizing the encoding process

    In the previous post I implemented a trie data structure for fast substring lookups during the learning phase of the compression algorithm. I also mentioned that there was one more place where tries could improve the performance: the encoding phase. It didn’t bother me much then, but now I’ve decided to bite the bullet and eliminate this inefficiency.

  • Review and cleanup

    I’m close to finishing up the implementation of the text compression algorithm. Along the way, I did quite a few experiments with different parts of the program. To keep things open for experimentation, I introduced a few abstractions in the code. These abstractions were useful while I was still tweaking the algorithm, but the downside was that the code became more complicated than needed. Once I’m done with the experiments and selected the best solutions, I no longer need to keep these abstractions in the code, or at least I don’t need to expose them to the outside world.

  • Tidbits of error handling in Rust

    Error handling in Rust is well documented in multiple sources, so let’s not waste space on obvious things. In this post, I’d like to share a few things I’ve learned on top of the basics.

  • Handling invalid inputs

    Up to this point, I was mostly focusing on getting things to work, and blissfully ignoring various error conditions that can occur. Now, I’d like to switch gears and pay more attention to the aspect of error handling. Two main functions of the compression algorithm are encode() and decode(), so let’s focus on them. In particular, let’s examine if there are possibilities for these functions to accept invalid inputs, and how they should respond to them.

  • Wrapping up the project

    I think this is a good point to wrap up this learning project. No doubt, there’s still a lot of features that can be implemented, and improvements to be made. However, it was meant to be a training exercise, not a complete production-ready solution. I’ve spent quite a bit of time working on this etude, and achieved some interesting results. I will summarize most significant milestones in this post.

subscribe via RSS