At the beginning of this project, I implemented a simple parser for .torrent files. It was an interesting exercise to get familiar with bencoding format. However, there is already an implementation for parsing bencoded data that comes as an extension to a popular Rust deserialization library called Serde. I think it is a good opportunity to get familiar with this library and switch to using Serde for working with bencoded data.

Version 0.0.11 on GitHub

What is Serde?

In real-world programming, the need to serialize and deserialize data to and from various text and binary formats appears very often. Reading configuration files, passing data as JSON in API requests, all that require us programmers to be able to represent data in various text or binary formats. Along with that, data serialization is probably one of the most tedious and boring tasks for a programmer to work on. No wonder, there’s a multitude of libraries in most programming languages, that help developers to simplify this task.

Normally, a library for serialization/deserialization helps you convert internal data structures, such as objects or structs, to their serialized representation in a particular format without the need to write code by hand. These libraries can use runtime reflection or other type of meta-information to explore the structure of a data object, and automatically generate the serialized data.

In Rust, a popular library to help programmers with serialization/deserialization is Serde. Unlike other libraries that rely on reflection, Serde uses Rust’s trait system and macros to generate the serialization code. Another remarkable feature of Serde is that it’s not bound to any particular data format. Instead, Serde builds its internal data model that provides a level of abstraction between data types and serialization formats. Support for specific data formats is outsourced to third-party implementations that come in separate crates. This separation makes Serde very extensible: to add a support for some new data format, anyone can write an implementation in a separate crate, without the need to make changes to serde library itself.

Also, Serde gives developers the mechanisms to plug into the serialization/deserialization process to customize the process for their specific needs. There’s a bunch of attributes out of the box to customize generated serialization code for common scenarios. If they are still not enough, a developer can write their own fully custom serialization logic.

Reading torrent file

We begin by defining the data types that will hold the parsed torrent data:

use serde::Deserialize;
use serde_bytes::ByteBuf;

#[derive(Debug, Deserialize)]
pub struct Torrent {
    pub announce: String,
    pub info: Info,
}

#[derive(Debug, Deserialize)]
pub struct Info {
    pub name: String,
    pub length: u64,
    #[serde(rename = "piece length")]       // [1]
    pub piece_length: u64,
    #[serde(with = "serde_bytes")]          // [2]
    pub pieces: Vec<u8>,
}

We annotate these data types with #[derive(Deserialize)] attribute. That provides us with a sensible implementation of Serde’s Deserialize trait. We have to do a few customizations here, though.

First, the piece_length field. In the torrent file, the name of this field contains a space: piece length. Obviously, it can’t be used directly as the field name in Rust because of the space character. To work around that, we use #[serde(rename = "piece length")] field attribute to instruct Serde that the field piece length in serialized data maps to the field piece_length in our Info struct.

The second trick is the pieces field. In the torrent file, it contains concatenated SHA-1 hashes for all file pieces. The problem is, this is binary data. If we had pieces defined as a String, we’d get a runtime error that we’re trying to deserialize an invalid UTF-8 string. Luckily, we can get round this problem with the help of another crate, serde_bytes. This crate provides us with utilities to efficiently deserialize raw byte data into a Vec<u8>. We plug in this module by using #[serde(with = "serde_bytes")] attribute on pieces field.

Handling deserialization

Now that we have Torrent struct that implements Deserialize trait, we can invoke the Deserialize::deserialize() method:

let mut d: Deserializer<_> = ...;                       // ??? 
let torrent = Torrent::deserialize(&mut d).unwrap();

But hold on a second. deserialize() method requires an implementation of Deserializer trait to be passed in as an argument. Where does that implementation come from?

This is where we see the separation of responsibilities between Serde and specific data format implementations. You see, by itself Serde knows nothing about data formats. It works exclusively with the Deserializer trait to do its part of the job: provide a link between Rust data type and the deserializer. It is the job of a specific implementor of Deserializer trait to handle pesky details of parsing the data in specific data format. In other words, Deserializer trait provides an abstract architectural boundary between serde core and the data parser implementation.

To provide the implementation of the Deserializer trait that knows how to parse bencoded data, we need another crate, serde_bencode. Having added this crate to project’s dependencies, we are now able to read and parse torrent files:

let f = File::open(TORRENT_FILE).unwrap();
let mut d = serde_bencode::Deserializer::new(f);
let torrent = Torrent::deserialize(&mut d).unwrap();

Or, using a utility function serde_bencode::from_bytes(), we can skip the details:

let content = std::fs::read(TORRENT_FILE).unwrap();
let torrent: Torrent = serde_bencode::from_bytes(&content).unwrap();

Bingo! With just a few lines of code, we have a complete implementation of a torrent file parser.

Custom deserialization: Visitor pattern

Our Torrent type is practically ready to use, but there’s one improvement we can make. You see, the pieces field in Info struct is not very convenient to use yet. As you remember, this field contains SHA-1 hashes of each piece, all concatenated into a single giant binary blob. It would be much more convenient if we could split that blob into a vector of individual values during deserialization. To achieve that, we can implement a custom deserialization process for that field.

It requires a bit of a boilerplate code, so bear with me:


// ---- [1] -- Declare a wrapper Hashes type

#[derive(Debug)]
pub struct Hashes(Vec<Sha1>);

#[derive(Debug, Deserialize)]
pub struct Info {
    ... // other Info fields from above
    pub pieces: Hashes
}

// ---- [2] -- Implement a custom visitor for Hashes 

struct HashesVisitor;

impl<'de> Visitor<'de> for HashesVisitor {
    type Value = Hashes;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
        formatter.write_str("a list of SHA1 hashes")
    }

    fn visit_bytes<E>(self, v: &[u8]) -> Result<Self::Value, E>
    where
        E: serde::de::Error,
    {
        let hashes = v.chunks_exact(20).map(Sha1::from_bytes).collect::<Vec<_>>();
        Ok(Hashes(hashes))
    }
}

// ---- [3] -- Plug HashesVisitor into deserializer

impl<'de> Deserialize<'de> for Hashes {
    fn deserialize<D>(deserializer: D) -> Result<Hashes, D::Error>
    where
        D: Deserializer<'de>,
    {
        deserializer.deserialize_bytes(HashesVisitor)
    }
}

First, we introduce a wrapper type Hashes over Vec<Sha1> values, that will implement custom deserialization logic.

Second, we need an implementation of the Visitor trait. This is where the custom logic resides. The Visitor trait contains a lot of methods, one method per each supported data type. Fortunately, we don’t need to implement all of them. The default implementations return errors that indicate that this particular data type is not supported, which is exactly what we need. For example, when we expect to deserialize the list of hashes, but the deserializer encounters an integer value instead, it will raise an error with the default implementation, and that’s what we want.

The method that we have to implement is Visitor::visit_bytes() that deserializer will call when it encounters a byte array in the input. The implementation takes a slice of u8 values, splits them into chunks of 20 bytes, which is the length of SHA-1 hash, and converts these chunks into a vector of Sha1 values.

Finally, we need to implement the Deserialize trait on Hashes type, to plug in our custom visitor into the deserialization process. The implementation is trivial: we simply call deserialize_bytes() on the deserializer, and pass our HashesVisitor as an argument.

Custom deserialization with internal type

There’s one last thing to be implemented: calculation of SHA-1 hash of the entire info section from the torrent file. As you may remember, this value is required in two interactions:

  • When requesting the list of peers from the torrent tracker;
  • When making the initial handshake to a peer.

Essentially SHA-1 value of info section acts as a unique identifier of the file we’re downloading. I would like to calculate that value once when reading the torrent file and store it as a field in Info struct:

#[derive(Deserialize)]
pub struct Info {
    pub sha1: Sha1,
    pub name: String,
    pub piece_length: u32,
    pub length: usize,
    pub pieces: Vec<Sha1>,
}

Unfortunately, there’s no easy way to calculate this value using the extension points that Serde provides to us. You see, to be able to calculate SHA-1, we need access to the raw byte representation of the info section. However, with the abstractions that Serde gives us, we are completely isolated from the low-level data bytes: it’s all hidden behind the Deserializer abstraction. There’s no way for us to get our hands on the raw binary data.

This is a flip side of the abstraction coin. On one hand, we are spared from dealing with pesky low-level details of binary data representation. On the other hand, we lose the ability to do something specific when the access to that representation is really needed.

It seems that the only way to calculate the SHA-1 hash of the info section is to serialize it back into bytes first, and hope that the serialized byte array will be exactly the same as the one we read from the torrent file. Luckily, Serde will help us with serialization, and there’s a way to plug this code into the overall deserialization process, with the use of an intermediate data type that will hold raw data from info section and can be converted to the instance of Info struct.

We begin by declaring the internal data type, InfoInternal:

#[derive(Deserialize, Serialize)]
struct InfoInternal {
    pub name: String,
    #[serde(rename = "piece length")]
    pub piece_length: u32,
    pub length: usize,
    #[serde(with = "serde_bytes")]
    pub pieces: Vec<u8>,
}

Nothing surprising here, it has the same shape as our original Info type, and maps directly to the contents of info block from the torrent file. Notice, however, that we also need to derive the Serialize trait from Serde. This is because we’ll need to serialize that data structure back to bencoded format, to calculate its SHA-1 hash.

Next, we need a procedure to create an instance of Info struct from InfoInternal. We do that by implementing the TryFrom trait for Info:

impl TryFrom<InfoInternal> for Info {
    type Error = Error;

    fn try_from(info_internal: InfoInternal) -> Result<Info, Self::Error> {
        let sha1 = Sha1::calculate(&serde_bencode::to_bytes(&info_internal)?);
        let pieces = info_internal
            .pieces
            .chunks_exact(20)
            .map(Sha1::from_bytes)
            .collect::<Vec<_>>();

        Ok(Self {
            name: info_internal.name,
            piece_length: info_internal.piece_length,
            length: info_internal.length,
            pieces,
            sha1,
        })
    }
}

We combine all custom logic inside try_from() method. First, we serialize the instance of InfoInternal back to the bencoded array, and calculate SHA-1 of its contents. Second, we convert its pieces byte array into a vector of Sha1 values. Previously, we did it by means of a custom Visitor implementation, but we don’t need that approach anymore.

Finally, we can instruct Serde to use InfoInternal during the deserialization process, and convert it into Info struct automatically, by yet another use of #[serde] attribute:

#[derive(Deserialize)]
#[serde(try_from = "InfoInternal")]
pub struct Info {
    pub sha1: Sha1,
    pub name: String,
    pub piece_length: u32,
    pub length: usize,
    pub pieces: Vec<Sha1>,
}

#[serde(try_from = "InfoInternal")] instructs Serde to use our intermediate InfoInternal struct to do the low-level deserialization, and then create the resulting Info instance with the help of TryFrom<InfoInternal> that we’ve just defined. No need for any additional coding to do the conversion! All the details will be handled by Serde, and the code for reading the torrent file becomes trivial:

pub fn read_torrent_file() -> Result<Torrent, Error> {
    let contents = fs::read(TORRENT_FILE)?;
    let decoded = serde_bencode::from_bytes(&contents)?;
    Ok(decoded)
}

What we’ve done

It’s been a lengthy programming session with a lot of new information. Let’s recap what we’ve done:

  • We’ve created data structures to contain the data from the torrent file: Torrent and Info. We annotated them with Serde attributes to handle the deserialization from bencoded format;
  • We’ve explored how we can customize the deserialization process using a Visitor pattern built into Serde. We ended up not using it after all, but nonetheless it was a useful exercise in getting to know the mechanisms of deserialization;
  • We’ve created an internal helper type InfoInternal to do some custom work during deserialization: calculate SHA-1 hash of the info block and split pieces blob into individual piece hash values.

This improvement reduced significantly the amount of code written by hand. Now I can delete my own implementation of parsing bencoded format. As a result, with the help from Serde, all the work to read torrent files is accomplished by a small piece of Rust code.

Well done!