Strings in Rust are UTF-8 encoded using the UTF-8 Encoding Algorithm, so
understanding that algorithm is key to understanding how String
and &str
work in Rust.
What is UTF-8?
UTF-8 is a character encoding (map of numbers to characters) created in 1993 to support characters for all languages, not just the English language. Before UTF-8, ASCII was used in English speaking countries. UTF-8 is backwards compatible with ASCII.
ASCII supports 128
characters (English alphabet, digits and other punctuation
characters), while UTF-8 supports 1,112,064
characters.
UTF-8 Encoding Algorithm
The Unicode Consortium understood the fact that a character spanning more than 1 byte of memory adds complexity and therefore created an algorithm for programming languages to use to store characters as binary bytes in memory and deal with this complexity.
Keeping in mind that a Rust String is a Vec<u8>
, the UTF-8 byte sequence
should be clear when the sequence starts and should be self-aware of how many
bytes the sequence encompasses.
In order to do this the UTF-8 Encoding Algorithm puts the information of how many bytes the character encompasses in the first byte. If there are more bytes each of those starts with a binary value indicating its a continuation of the UTF-8 character.
A binary value of a UTF-8 character can be a maximum of 21-bits.
Let's have a look at what this looks like with 2 examples, a simple character and a more complex character:
Simple Character
- Character:
!
- Name: EXCLAMATION MARK
- hexadecimal:
U+0021
- Binary:
100001
Rule: If a character contains a 7-bit, or less, binary value it should be
converted to a 7-bit binary and be prepended with 0
. This looks like
0xxxxxxx
.
In this case 100001
would be converted to 7-bit binary 0100001
and prepended
with 0
to create the byte value: 00100001
.
The leading 0
indicates the UTF-8 character is only 1 byte long.
Note: Since UTF-8 is byte-based and designed to be compatible with ASCII,
characters like a-zA-Z0-9
, '!', etc are encoded as single-byte values — making
UTF-8 very space-efficient for English text.
fn main() {
let binary: u8 = 0b00100001;
let character = char::from_u32(binary as u32).unwrap();
println!("{}", character); // prints: !
}
Complex character
- Character:
🙈
- Name: SEE-NO-EVIL MONKEY
- hexadecimal:
U+1F648
- Binary:
11111011001001000
- 21-bit binary:
000011111011001001000
Rules:
- If the UTF-8 byte sequence contains a total of:
- 2 bytes the first byte starts with
110
(110xxxxx
) - 3 bytes the first byte starts with
1110
(1110xxxx
) - 4 bytes the first byte starts with
11110
(11110xxx
)
- 2 bytes the first byte starts with
- Any following byte is prepended with
10
(10xxxxxx
).
The example of the "SEE-NO-EVIL MONKEY" will take up 4 bytes once we run it through the UTF-8 encoding algorithm.
Byte | Bit Pattern | Number of Data Bits |
---|---|---|
1st byte | 11110xxx | 3 |
2nd byte | 10xxxxxx | 6 |
3rd byte | 10xxxxxx | 6 |
4th byte | 10xxxxxx | 6 |
We split the 21-bit binary into chunks: 3 bits for the first byte, then 6 bits each for the remaining three. After inserting those into the UTF-8 format with the proper prefixes, we get:
Byte | Bit Pattern | Number of Data Bits |
---|---|---|
1st byte | 11110000 | 3 |
2nd byte | 10011111 | 6 |
3rd byte | 10011001 | 6 |
4th byte | 10001000 | 6 |
To convert the binary value of "SEE-NO-EVIL MONKEY" in Rust:
fn main() {
let binary: u32 = 0b000011111011001001000;
let character = char::from_u32(binary).unwrap();
println!("{}", character); // prints: 🙈
}
Note: char::from_u32(...)
converts a Unicode code point, not raw UTF-8 bytes.
The actual UTF-8 bytes are what you see when calling .as_bytes()
. This is done
to help explain binary to UTF-8 characters.
You may wonder why 🙈
requires 4 bytes and not 3. The answer lies in its code
point: U+1F648
(which is 128584
in decimal). UTF-8 uses 3 bytes only for
values up to U+FFFF (65535
). Since this character is above that range, it
falls into the 4-byte category, which is defined for any code point from
U+10000
to U+10FFFF
.
Bonus section
More about bytes and chars
While a String in Rust is a UTF-8 encoded Vec<u8>
, Rust automatically prints
valid UTF-8. We can get hold of bytes or a characters by converting the strings
using the built-in methods .as_bytes()
and .chars()
.
To print the byte values of the "SEE-NO-EVIL Monkey", we can do the following:
fn main() {
let character = "🙈";
for byte in character.as_bytes().iter().map(|b| format!("{:0b}", b)) {
println!("{}", byte);
}
// Prints:
//
// 11110000
// 10011111
// 10011001
// 10001000
}
And to print the UTF-8 characters of a string, we can do the following:
fn main() {
let value = "🙈!";
for char in value.chars() {
println!("{}", char);
}
// Prints:
//
// 🙈
// !
}
UTF-8 Encoding Ranges
Code Point Range | Byte Length | Encoding Format |
---|---|---|
U+0000 to U+007F | 1 byte | 0xxxxxxx |
U+0080 to U+07FF | 2 bytes | 110xxxxx 10xxxxxx |
U+0800 to U+FFFF | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx |
U+10000 to U+10FFFF | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Have a look at UTF-8 Visualizer to get detailed byte and hexadecimal information!
Conclusion
I hope this was helpful in understanding what is going on in the computers
memory when storing UTF-8 strings. The UTF-8 Encoding Algorithm is an elegant
solution to storing characters in a sequence of bytes, Vec<u8>
in the case of
Rust.