CSS-Plus
CSS-Plus
Rooted in CSS, since branched out

Understanding The UTF-8 Encoding Algorithm in Rust

April 05, 2025
Understanding The UTF-8 Encoding Algorithm in Rust

Strings in Rust are UTF-8 encoded using the UTF-8 Encoding Algorithm, so understanding that algorithm is key to understanding how String and &str work in Rust.

What is UTF-8?

UTF-8 is a character encoding (map of numbers to characters) created in 1993 to support characters for all languages, not just the English language. Before UTF-8, ASCII was used in English speaking countries. UTF-8 is backwards compatible with ASCII.

ASCII supports 128 characters (English alphabet, digits and other punctuation characters), while UTF-8 supports 1,112,064 characters.

UTF-8 Encoding Algorithm

The Unicode Consortium understood the fact that a character spanning more than 1 byte of memory adds complexity and therefore created an algorithm for programming languages to use to store characters as binary bytes in memory and deal with this complexity.

Keeping in mind that a Rust String is a Vec<u8>, the UTF-8 byte sequence should be clear when the sequence starts and should be self-aware of how many bytes the sequence encompasses.

In order to do this the UTF-8 Encoding Algorithm puts the information of how many bytes the character encompasses in the first byte. If there are more bytes each of those starts with a binary value indicating its a continuation of the UTF-8 character.

A binary value of a UTF-8 character can be a maximum of 21-bits.

Let's have a look at what this looks like with 2 examples, a simple character and a more complex character:

Simple Character

  • Character: !
  • Name: EXCLAMATION MARK
  • hexadecimal: U+0021
  • Binary: 100001

Rule: If a character contains a 7-bit, or less, binary value it should be converted to a 7-bit binary and be prepended with 0. This looks like 0xxxxxxx.

In this case 100001 would be converted to 7-bit binary 0100001 and prepended with 0 to create the byte value: 00100001.

The leading 0 indicates the UTF-8 character is only 1 byte long.

Note: Since UTF-8 is byte-based and designed to be compatible with ASCII, characters like a-zA-Z0-9, '!', etc are encoded as single-byte values — making UTF-8 very space-efficient for English text.

fn main() {
    let binary: u8 = 0b00100001;
    let character = char::from_u32(binary as u32).unwrap();

    println!("{}", character); // prints: !
}

Complex character

  • Character: 🙈
  • Name: SEE-NO-EVIL MONKEY
  • hexadecimal: U+1F648
  • Binary: 11111011001001000
  • 21-bit binary: 000011111011001001000

Rules:

  1. If the UTF-8 byte sequence contains a total of:
    • 2 bytes the first byte starts with 110 (110xxxxx)
    • 3 bytes the first byte starts with 1110 (1110xxxx)
    • 4 bytes the first byte starts with 11110 (11110xxx)
  2. Any following byte is prepended with 10 (10xxxxxx).

The example of the "SEE-NO-EVIL MONKEY" will take up 4 bytes once we run it through the UTF-8 encoding algorithm.

Byte Bit Pattern Number of Data Bits
1st byte 11110xxx 3
2nd byte 10xxxxxx 6
3rd byte 10xxxxxx 6
4th byte 10xxxxxx 6

We split the 21-bit binary into chunks: 3 bits for the first byte, then 6 bits each for the remaining three. After inserting those into the UTF-8 format with the proper prefixes, we get:

Byte Bit Pattern Number of Data Bits
1st byte 11110000 3
2nd byte 10011111 6
3rd byte 10011001 6
4th byte 10001000 6

To convert the binary value of "SEE-NO-EVIL MONKEY" in Rust:

fn main() {
    let binary: u32 = 0b000011111011001001000;
    let character = char::from_u32(binary).unwrap();

    println!("{}", character); // prints: 🙈
}

Note: char::from_u32(...) converts a Unicode code point, not raw UTF-8 bytes. The actual UTF-8 bytes are what you see when calling .as_bytes(). This is done to help explain binary to UTF-8 characters.

You may wonder why 🙈 requires 4 bytes and not 3. The answer lies in its code point: U+1F648 (which is 128584 in decimal). UTF-8 uses 3 bytes only for values up to U+FFFF (65535). Since this character is above that range, it falls into the 4-byte category, which is defined for any code point from U+10000 to U+10FFFF.

Bonus section

More about bytes and chars

While a String in Rust is a UTF-8 encoded Vec<u8>, Rust automatically prints valid UTF-8. We can get hold of bytes or a characters by converting the strings using the built-in methods .as_bytes() and .chars().

To print the byte values of the "SEE-NO-EVIL Monkey", we can do the following:

fn main() {
    let character = "🙈";
    for byte in character.as_bytes().iter().map(|b| format!("{:0b}", b)) {
        println!("{}", byte);
    }
    // Prints:
    //
    // 11110000
    // 10011111
    // 10011001
    // 10001000
}

And to print the UTF-8 characters of a string, we can do the following:

fn main() {
    let value = "🙈!";
    for char in value.chars() {
        println!("{}", char);
    }
    // Prints:
    //
    // 🙈
    // !
}

UTF-8 Encoding Ranges

Code Point Range Byte Length Encoding Format
U+0000 to U+007F 1 byte 0xxxxxxx
U+0080 to U+07FF 2 bytes 110xxxxx 10xxxxxx
U+0800 to U+FFFF 3 bytes 1110xxxx 10xxxxxx 10xxxxxx
U+10000 to U+10FFFF 4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Have a look at UTF-8 Visualizer to get detailed byte and hexadecimal information!

Conclusion

I hope this was helpful in understanding what is going on in the computers memory when storing UTF-8 strings. The UTF-8 Encoding Algorithm is an elegant solution to storing characters in a sequence of bytes, Vec<u8> in the case of Rust.