When working with strings in Rust, understanding the difference between bytes, characters, and grapheme clusters is essential for writing correct text-processing code. Many programming bugs stem from treating strings as simple arrays of characters, when in reality UTF-8 strings have a much richer structure.
In Rust, a String is a sequence of UTF-8 encoded bytes. There are three ways to view string data:
.bytes()) - The raw UTF-8 bytes. ASCII
characters are 1 byte, but many Unicode characters
take 2-4 bytes..chars()) - Unicode scalar values.
Most characters are single chars, but some displayed
characters (like emojis with modifiers) are multiple
chars.let text = "Hello";
assert_eq!(text.len(), 5); // 5 bytes
assert_eq!(text.chars().count(), 5); // 5 characters
// Family emoji (ZWJ sequence)
let emoji = "👨👩👧";
// 18 bytes!
assert_eq!(emoji.len(), 18);
// 5 Unicode scalars
assert_eq!(emoji.chars().count(), 5);
// But visually it's 1 "character"
// (grapheme cluster)String slicing in Rust must occur at valid UTF-8 boundaries. Slicing in the middle of a multi-byte character causes a panic:
// Russian "Hello"
let text = "Здравствуйте";
// text[0..1] would panic! 'З' is 2 bytes
// OK - takes full first character
let slice = &text[0..2];
assert_eq!(slice, "З");To safely extract substrings, use .chars() with indices or the .char_indices() method:
let text = "Здравствуйте";
let chars: Vec<char> = text.chars().collect();
let first_two: String = chars[..2].iter().collect();
assert_eq!(first_two, "Зд");Implement the following functions to demonstrate Unicode-aware string handling:
char_count(s: &str) -> usize - Count the number of
Unicode characters (not bytes)byte_count(s: &str) -> usize - Count the number of
bytes in the UTF-8 encodingsafe_substring(s: &str, start: usize, end: usize) -> Option<String> - Extract a substring by character
indices (not byte indices). Return None if indices
are out of bounds.char_at(s: &str, index: usize) -> Option<char> -
Get the character at a specific index (by character
position, not byte position)is_single_char(s: &str) -> bool - Check if a string
contains exactly one Unicode character// char_count
assert_eq!(char_count("Hello"), 5);
assert_eq!(char_count("Привет"), 6); // Russian "Hello"
assert_eq!(char_count("你好"), 2); // Chinese "Hello"
assert_eq!(char_count("🎉"), 1); // Single emoji
// byte_count
// ASCII: 1 byte each
assert_eq!(byte_count("Hello"), 5);
// Cyrillic: 2 bytes each
assert_eq!(byte_count("Привет"), 12);
// Chinese: 3 bytes each
assert_eq!(byte_count("你好"), 6);
// Emoji: 4 bytes
assert_eq!(byte_count("🎉"), 4);
// safe_substring
assert_eq!(
safe_substring("Hello", 0, 3),
Some("Hel".to_string())
);
assert_eq!(
safe_substring("Привет", 0, 2),
Some("Пр".to_string())
);
assert_eq!(
safe_substring("Hello", 0, 10),
None
); // Out of bounds
assert_eq!(
safe_substring("Hello", 3, 2),
None
); // Invalid range
// char_at
assert_eq!(char_at("Hello", 0), Some('H'));
assert_eq!(char_at("Привет", 2), Some('и'));
assert_eq!(char_at("Hello", 10), None);
// is_single_char
assert_eq!(is_single_char("a"), true);
assert_eq!(is_single_char("好"), true);
assert_eq!(is_single_char("🎉"), true);
assert_eq!(is_single_char("ab"), false);
assert_eq!(is_single_char(""), false);.chars().count() to count Unicode characters,
not .len() which counts bytes.len() or .bytes().count() for byte countsafe_substring, collect characters into a
Vec<char> first, then validate indices before
extracting.chars().nth(index) to get a character at a
specific positionis_single_char, check that the char count equals
exactly 1