I’ve played with the program for a while, and found a few bugs that slipped through the unit test suite. As embarrassing as it is, bugs are inevitable, and manual exploratory testing is is always necessary, even in the TDD process. The way I usually work with bug fixes is that, once the bug is found, I first write the test to reproduce it, and then I fix the code. This process strengthens the automated test suite, and I make sure that the bug is fixed and doesn’t come back in the future.

Sometimes I need to refactor the code, so that I can write the test more easily. My modus operandi is usually to strive for simple readable tests, and if the design of the code doesn’t allow for that, I refactor the code. All too often folks resist to make that change, and instead try to write the test using the existing code design. However, in my practice I’ve found that that leads to complicated and obscure test setups, which are hard to understand and maintain. Instead, I usually try to make the tests simple and understandable, and let that drive the design of the code under test.

Having said that, let’s get to the bugs I’ve found so far.

Inclusion of single-occurrence strings

That one I spotted while analyzing the substrings found by the algorithm. Going through the list of top 5 most impactful substrings, I found that the following substring occurs only once in the original text:

"follow.\n                                                         Exeunt.\n\n\n\n\n"

Indeed, that’s an error. It doesn’t make sense to extract a substring that occurs only once in the original text: we don’t gain anything from it. The original string content has to be present in the compressed text anyways in some form. A simple test revealed the problem:

#[test]
fn most_impactfult_strings_skip_single_occurence() {
    let mut dict = SubstringLedger::new();
    dict.insert_new(substring("aaaaaa"));

    dict.insert_new(substring("bb"));
    dict.increment_count(&substring("bb"));

    let most_impactful = dict.get_most_impactful_strings(&EncoderSpec {
        num_strings: 10,
        encoded_size: 1,
    });
    assert_eq!(vec!["bb"], most_impactful.to_vec());
}

The fix was trivial: just filter out the substrings whose count is less than 2.

Encoding strings with multi-byte characters

That bug revealed itself when I decided to switch to another test data source. “Hamlet” turned out to be too short, so I decided to switch to the longest piece of prose I knew: War and Peace by Leo Tolstoy. I grabbed the text from Project Gutenberg, and started preparing the test data from it.

But, lo and behold, the program crashed:

File name: wap-100.txt
thread 'main' panicked at src/encoder/encode_string.rs:23:29:
byte index 1 is not a char boundary; it is inside '\u{feff}' (bytes 0..3) of `The Project Gutenberg eBook of War and Peace
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it u`[...]

To understand this bug, I had to learn more about working with strings in Rust.

Strings in Rust

The most common type to work with strings in Rust is String.

Strings are always valid UTF-8 encoded. Since UTF-8 is a multi-byte encoding, it brings a bit of ambiguity when it comes to indexing. What should s[i] return, a character or a byte? Rust resolves this ambiguity by simply not supporting such an operation. In order to access the characters, one needs to use String::chars() method, that will return an iterator over the characters. To access the bytes, you can use String::as_bytes() method, to give you a slice of underlying bytes.

The situation is a bit different when it comes to string slices. An operation like s[i..j] accepts byte indices i and j, and it will return a &str slice. But since the result needs to be a valid UTF-8 string, it will panic if the indices are not correct character boundaries.

That’s exactly what was happening in my code. It turned out I was not respecting the character boundaries when going through the string during encoding, and it caused the panic when working with multi-byte character strings.

The test code to reveal the bug:

#[test]
fn encode_multibyte_string() {
    let source = "こんにちはこんにちは世界世界";

    let encoded = encode_string(&source, &make_dictionary(vec![]));
    assert_eq!(source.as_bytes(), encoded);
}

The fix was to make sure that we slice the string at correct character boundaries.

Skipping the substrings

The last bug I want to tackle is actually in the heart of the algorithm, the build_ledger function. I spotted it while preparing one of the previous posts.

The bug manifests itself in such a way that the algorithm skips some of the substrings from merging. In particular, it goes to find the substring match at the start of the text, and then tries to find a follow-up match. If the follow-up is found, we merge these two substrings, and then jump to the rest of the text after the second match. But, we’re effectively skipping the second match to be merged with what may come after it!

To give you an example, let’s consider the following test scenario. Suppose at some step we have a DictionaryLedger containing the following substrings: ["ca", "me", "lot"], and the text we’re processing next is "camelot". We find the first match, "ca", and then try to find a follow-up match. We find "me", and merge these two substrings into "came". We then jump to the rest of the text after the second match, and go to process "lot". But, we haven’t tried to merge "me" with "lot"!

Putting it all together into the test code that goes through the process step by step:

#[test]
fn merge_three_consecutive_substrings() {
    let mut ledger = SubstringLedger::new();
    ledger.insert_new(substring("ca"));
    ledger.insert_new(substring("me"));
    ledger.insert_new(substring("lot"));

    let source = "camelot";

    // Processing "ca" + "me" = "came"
    let next_head = build_ledger_step(source, &mut ledger);
    assert_eq!(
        vec![("came", 1), ("lot", 1), ("ca", 2), ("me", 1)],
        ledger.entries()
    );

    // Processing "me" + "lot" = "melot"
    let next_head = build_ledger_step(next_head, &mut ledger);
    assert_eq!(
        vec![("melot", 1), ("came", 1), ("lot", 1), ("ca", 2), ("me", 2)],
        ledger.entries()
    );

    // Processing "lot"
    build_ledger_step(next_head, &mut ledger);
    assert_eq!(
        vec![("melot", 1), ("came", 1), ("lot", 2), ("ca", 2), ("me", 2)],
        ledger.entries()
    );
}

I had to extract the build_ledger_step function from the build_ledger function, so that I could easily create the test case preconditions, and apply the fix to make the test pass.

pub fn build_ledger(source: &str) -> SubstringLedger {
    let mut ledger = SubstringLedger::new();
    let mut head: &str = source;

    while head.len() > 0 {
        head = build_ledger_step(head, &mut ledger);
    }
    ledger
}

fn build_ledger_step<'a>(head: &'a str, ledger: &mut SubstringLedger) -> &'a str {
    if let Some(next_char) = head.chars().next() {
        let next_head: &'a str;

        if let Some(substr_match) = ledger.find_longest_match(head) {
            let rest = &head[substr_match.len()..];

            if let Some(follow_up_match) = ledger.find_longest_match(rest) {
                let new_substring = substr_match.concat(&follow_up_match);
                next_head = &head[substr_match.len()..];
                ledger.insert_new(new_substring);
            } else {
                next_head = rest;
            }

            ledger.increment_count(&substr_match);
        } else {
            let new_substring = Substring::from_char(next_char);
            next_head = &head[new_substring.len()..];
            ledger.insert_new(new_substring);
        }
        next_head
    } else {
        ""
    }
}

Next steps

Now with the fix, the test passes, but I’m not very happy with the state of the code in that place. For one, the build_ledger_step is too overloaded with the details. The second thing I noticed is the obvious inefficiency in the algorithm: we’re calling find_longest_match twice for the same substring. It’s done the first time to find the follow-up match, and then in the next step we call it again for the same substring! As the next step, I’m going to refactor the code to make it more readable and efficient.

The link to this version of the code is tagged in the GitHub repository.