Wasted space in the encoding table
The results of my previous experiments puzzled me a bit, so I decided to spend some time analyzing the whole algorithm to see if I can spot any clues that would explain the results. It turned out that there’s a chance that we are not using the space in the encoding table as efficiently as we could. Specifically, we may include some substrings into the table, that are never used in the encoding phase.
In this post, I’m going to explain why that happens and do some more experiments to see if we can improve the situation.
How we end up with wasted space
Let’s take a look at the following example string:
hello world hello world hello world hello world hello world hello hello world world
When we run the learning algorithm to find out the common substrings, we end up with the following result:
" hello world"
" hello w"
" hel"
" wor"
"lo w"
"orld"
Notice the second substring, " hello w"
. It was learned by the algorithm, but in fact it’s useless for the encoding process, because it’s only a prefix of " hello world"
. It was useful in the learning phase, as a stepping stone towards the " hello world"
, but in the end we don’t need it: the encoding algorithm will use " hello world"
instead, and nowhere else in the text " hello w"
will be applicable on its own.
My hypothesis is that we end up with quite a few such substrings in the final encoding table, and that’s why the compression ratio degrades for longer ledger sizes. I’m going to explore it further.
Tracking unused encoding table slots
It’s quite easy to track which slots in the encoding table were actually used in the process. With the small addition to the EncodingTable
struct, I collected the data on the wasted space:
Max Ledger Size | Used Entries | Compression Ratio |
---|---|---|
CaptureAll | 251/256 | 25.36% |
256 | 21/23 | 6.89% |
512 | 85/86 | 13.20% |
1 024 | 228/228 | 20.26% |
2 048 | 256/256 | 26.07% |
4 096 | 254/256 | 27.16% |
8 192 | 255/256 | 27.28% |
16 384 | 253/256 | 26.87% |
32 768 | 251/256 | 26.20% |
65 536 | 250/256 | 26.02% |
Indeed, we can see that there are some unused slots in the encoding table. The number of unused slots isn’t too big, but still I believe it can have impact on the overall compression ratio, given that we only have 256 available slots in the table. The number of unused slots correlates well with the drop in the compression ratio.
Experimenting with a different way to select substrings
Initially, my idea was to pick substrings for the final encoding table by their “compression gain”. The compression gain takes into account both substring size and its frequency in the string:
compression_gain = (original_size - encoded_size) * count
It looked like a reasonable approach to do, but maybe it’s not the best strategy. Both original articles suggest that we select substrings solely based on their occurrence frequency. I’m curious to see how switching to the frequency-based approach would perform, especially when combined with the limiting the dictionary size. My intuitive understanding is that this approach will select shorter substrings with higher frequencies, and that may mitigate the wasted space problem, since short strings are more likely to be used throughout the the text.
I still would like to keep both approaches in the code, though, because I’d like to have the opportunity to experiment with both strategies in the future. To achieve that, I’m going to extract the substring selection code from SubstringLedger
struct, and put it into a separate entity called SubstringSelector
. Separating the code makes it easier to work with, and it also looks cleaner in the codebase:
SubstringLedger
is now only responsible for managing the substrings during the learning phase;SubstringSelector
is responsible for selecting the appropriate subset of substrings for the final encoding table. It also gives us a choice how to select the substrings: either by their compression gain, or by their frequency.
The results
Having implemented the necessary changes, I ran my experiments again, only using the frequency-based approach. The results are presented in the table below:
Max Ledger Size | Used Entries | Compression Ratio |
---|---|---|
CaptureAll | 256/256 | 26.46% |
256 | 21/23 | 6.89% |
512 | 85/86 | 13.20% |
1 024 | 228/228 | 20.26% |
2 048 | 256/256 | 25.87% |
4 096 | 254/256 | 27.36% |
8 192 | 256/256 | 27.07% |
16 384 | 255/256 | 27.62% |
32 768 | 256/256 | 27.46% |
65 536 | 256/256 | 27.51% |
There’s definitely some improvement here. It’s not super dramatic, and I believe that for a variety of texts the difference between these two approaches will be negligible.
It is remarkable, though, that the baseline result for CaptureAll
strategy shows worse results than the approach with the limited ledger size. However, I find it hard to reason about why that’s so. My best guess is that the approach with the limited ledger size is picking up shorter, but more frequent substrings, eventually leading to a slightly better compression ratio.
The main takeaway from those experiments is that it makes no sense to keep all found substrings in memory. After a certain threshold, it doesn’t benefit the compression ratio anymore, so the approach to limit the size of the ledger is the one to go.
Next steps
My current progress is available on GitHub under the tag 0.0.7.
It looks to me that the size of the encoding table is a quite important parameter, as soon as eliminating wasted slots has led to improvements in the compression ratio. So the next step in the project will be expanding the encoding table, to fully utilize the benefits of the 2-byte encoding.