The results of my previous experiments puzzled me a bit, so I decided to spend some time analyzing the whole algorithm to see if I can spot any clues that would explain the results. It turned out that there’s a chance that we are not using the space in the encoding table as efficiently as we could. Specifically, we may include some substrings into the table, that are never used in the encoding phase.

In this post, I’m going to explain why that happens and do some more experiments to see if we can improve the situation.

How we end up with wasted space

Let’s take a look at the following example string:

hello world hello world hello world hello world hello world hello hello world world

When we run the learning algorithm to find out the common substrings, we end up with the following result:

" hello world"
" hello w"
" hel"
" wor"
"lo w"
"orld"

Notice the second substring, " hello w". It was learned by the algorithm, but in fact it’s useless for the encoding process, because it’s only a prefix of " hello world". It was useful in the learning phase, as a stepping stone towards the " hello world", but in the end we don’t need it: the encoding algorithm will use " hello world" instead, and nowhere else in the text " hello w" will be applicable on its own.

My hypothesis is that we end up with quite a few such substrings in the final encoding table, and that’s why the compression ratio degrades for longer ledger sizes. I’m going to explore it further.

Tracking unused encoding table slots

It’s quite easy to track which slots in the encoding table were actually used in the process. With the small addition to the EncodingTable struct, I collected the data on the wasted space:

Max Ledger Size Used Entries Compression Ratio
CaptureAll 251/256 25.36%
256 21/23 6.89%
512 85/86 13.20%
1 024 228/228 20.26%
2 048 256/256 26.07%
4 096 254/256 27.16%
8 192 255/256 27.28%
16 384 253/256 26.87%
32 768 251/256 26.20%
65 536 250/256 26.02%

Indeed, we can see that there are some unused slots in the encoding table. The number of unused slots isn’t too big, but still I believe it can have impact on the overall compression ratio, given that we only have 256 available slots in the table. The number of unused slots correlates well with the drop in the compression ratio.

Experimenting with a different way to select substrings

Initially, my idea was to pick substrings for the final encoding table by their “compression gain”. The compression gain takes into account both substring size and its frequency in the string:

compression_gain = (original_size - encoded_size) * count

It looked like a reasonable approach to do, but maybe it’s not the best strategy. Both original articles suggest that we select substrings solely based on their occurrence frequency. I’m curious to see how switching to the frequency-based approach would perform, especially when combined with the limiting the dictionary size. My intuitive understanding is that this approach will select shorter substrings with higher frequencies, and that may mitigate the wasted space problem, since short strings are more likely to be used throughout the the text.

I still would like to keep both approaches in the code, though, because I’d like to have the opportunity to experiment with both strategies in the future. To achieve that, I’m going to extract the substring selection code from SubstringLedger struct, and put it into a separate entity called SubstringSelector. Separating the code makes it easier to work with, and it also looks cleaner in the codebase:

  • SubstringLedger is now only responsible for managing the substrings during the learning phase;
  • SubstringSelector is responsible for selecting the appropriate subset of substrings for the final encoding table. It also gives us a choice how to select the substrings: either by their compression gain, or by their frequency.

The results

Having implemented the necessary changes, I ran my experiments again, only using the frequency-based approach. The results are presented in the table below:

Max Ledger Size Used Entries Compression Ratio
CaptureAll 256/256 26.46%
256 21/23 6.89%
512 85/86 13.20%
1 024 228/228 20.26%
2 048 256/256 25.87%
4 096 254/256 27.36%
8 192 256/256 27.07%
16 384 255/256 27.62%
32 768 256/256 27.46%
65 536 256/256 27.51%

There’s definitely some improvement here. It’s not super dramatic, and I believe that for a variety of texts the difference between these two approaches will be negligible.

It is remarkable, though, that the baseline result for CaptureAll strategy shows worse results than the approach with the limited ledger size. However, I find it hard to reason about why that’s so. My best guess is that the approach with the limited ledger size is picking up shorter, but more frequent substrings, eventually leading to a slightly better compression ratio.

The main takeaway from those experiments is that it makes no sense to keep all found substrings in memory. After a certain threshold, it doesn’t benefit the compression ratio anymore, so the approach to limit the size of the ledger is the one to go.

Next steps

My current progress is available on GitHub under the tag 0.0.7.

It looks to me that the size of the encoding table is a quite important parameter, as soon as eliminating wasted slots has led to improvements in the compression ratio. So the next step in the project will be expanding the encoding table, to fully utilize the benefits of the 2-byte encoding.