Severity
CRITICAL - Blocks all file operations
- Completely prevents reading files containing multi-byte UTF-8 characters
- Affects all international users using languages beyond ASCII
- No workaround available for affected projects
Problem Description
The worker's file_read operation performs unsafe byte-level slicing on UTF-8 string content during preview generation. When the preview logic attempts to slice a string at arbitrary byte boundaries using Rust's byte slicing (&s[start..end]), it can split multi-byte UTF-8 characters in half.
What Happens
-
UTF-8 Encoding Background: Cyrillic characters (and many others) use 2+ bytes per character in UTF-8:
- ASCII characters: 1 byte
- Cyrillic (Russian, Ukrainian, etc.): 2 bytes
- Chinese/Japanese/Korean: 3 bytes
- Emoji: 4 bytes
-
The Bug: Preview generation code slices strings using byte indices:
&s[start..end] // ❌ Unsafe! Can cut multi-byte characters
-
The Panic: Rust's string slicing enforces UTF-8 validity:
thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'а' (bytes 4..5) of `ау`'
This panic occurs because the code attempts to slice at byte position 5, which is in the middle of a 2-byte Cyrillic character.
Reproduction Steps
Minimal Reproduction
# 1. Create a test file with Cyrillic content
echo "ау" > /tmp/cyrillic_test.txt
# 2. Attempt to read using worker's file_read
# (In worker context)
file_read(path="/tmp/cyrillic_test.txt", offset=1, limit=10)
# Result: Panic during preview generation
Step-by-Step
-
Create a file containing multi-byte UTF-8 characters:
echo "Привет, мир!" > test.txt
# Or: echo "你好世界" > test.txt
# Or: echo "" > test.txt
-
Trigger file_read with offset/limit parameters (for preview generation):
// In worker code
file_read({
path: "test.txt",
offset: 1, // Triggers preview path
limit: 50
})
-
Observe the panic:
panicked at 'byte index N is not a char boundary; it is inside 'X' (bytes N..N+1) of `...`'
Current Impact
Affected Users
- ❌ All Russian-speaking users
- ❌ All Ukrainian, Bulgarian, Serbian users (Cyrillic script)
- ❌ All Chinese, Japanese, Korean users
- ❌ All users using emoji in files
- ❌ Any project with non-ASCII UTF-8 content
Blocked Operations
- ❌ Reading source code with non-ASCII comments
- ❌ Reading documentation in international languages
- ❌ Reading configuration files with Unicode content
- ❌ Reading any text files containing emoji
- ❌ Preview generation for files with multi-byte characters
Scope
- Geographic: Affects users worldwide (most non-English languages)
- Functional: Complete blocker for file operations on affected projects
- Data Loss Risk: None (panic doesn't corrupt data), but prevents access
Technical Root Cause
The Problem Code Pattern
The preview generation logic uses byte-level slicing:
// ❌ Current implementation (unsafe)
fn generate_preview(content: &str, start: usize, end: usize) -> &str {
&content[start..end] // PANIC if start/end split a multi-byte char
}
Why This Fails
UTF-8 encoding uses variable-length characters:
Character Bytes Byte Indices
--------------------------------------------
'a' 0x61 0
'а' (Cyrillic) 0xD0 0xB0 0..1
'你' (Chinese) 0xE4 0xBD 0xA0 0..2
'' (Emoji) 0xF0 0x9F 0x98 0x80 0..3
When slicing &s[5..10] on a string like "Привет":
- Byte 5 is in the middle of the Cyrillic 'и' (bytes 4-5)
- Rust validates the slice boundary and panics
Where This Occurs
- Preview generation: When
offset and limit are provided
- Windowed views: When displaying file excerpts
- Any byte-indexed slicing: On UTF-8 string data
Suggested Fix
Solution: Character-Aware Slicing
Use Rust's char_indices() method to find valid character boundaries:
// ✅ Safe implementation
fn safe_slice(content: &str, start_byte: usize, end_byte: usize) -> &str {
// Find the nearest character boundary at or after start_byte
let start = content
.char_indices()
.find(|(byte_pos, _)| *byte_pos >= start_byte)
.map(|(byte_pos, _)| byte_pos)
.unwrap_or(content.len());
// Find the nearest character boundary at or before end_byte
let end = content
.char_indices()
.take_while(|(byte_pos, _)| *byte_pos <= end_byte)
.last()
.map(|(byte_pos, _)| byte_pos)
.unwrap_or(0);
&content[start..end.max(start)] // Safe: both are char boundaries
}
Alternative: Use split_at with Boundary Checking
// ✅ Another safe approach
fn safe_preview(content: &str, start: usize, len: usize) -> &str {
// Ensure start is at a character boundary
let start = if content.is_char_boundary(start) {
start
} else {
// Find next valid boundary
content[start..].char_indices()
.nth(1)
.map(|(i, _)| start + i)
.unwrap_or(content.len())
};
let end = if start + len <= content.len() {
// Find nearest safe end
content[start..].char_indices()
.take_while(|(i, _)| start + i <= start + len)
.last()
.map(|(i, _)| start + i)
.unwrap_or(start)
} else {
content.len()
};
&content[start..end]
}
Implementation Priority
- Immediate: Add
is_char_boundary check before slicing
- Short-term: Implement character-aware slicing using
char_indices()
- Long-term: Consider using a crate like
unicode-segmentation for grapheme-aware operations
Example Code
Demonstrating the Bug
// utf8_bug_demo.rs
fn main() {
let text = "Привет"; // "Hello" in Russian
println!("String: {}", text);
println!("Bytes: {:?}", text.as_bytes());
println!("Byte indices:");
for (i, b) in text.as_bytes().iter().enumerate() {
println!(" {}: 0x{:02X}", i, b);
}
// This will panic because byte 5 is inside a character
println!("\nAttempting unsafe slice &text[5..10]...");
// let preview = &text[5..10]; // ❌ PANIC!
// Safe approach
println!("Using char_indices() for safe slicing...");
let safe_start = text.char_indices()
.find(|(pos, _)| *pos >= 5)
.map(|(pos, _)| pos)
.unwrap_or(text.len());
if safe_start < text.len() {
let safe_end = text.char_indices()
.skip_while(|(pos, _)| *pos <= safe_start + 5)
.next()
.map(|(pos, _)| pos)
.unwrap_or(text.len());
println!("Safe slice: &text[{}..{}] = {:?}", safe_start, safe_end, &text[safe_start..safe_end]);
}
}
Output
String: Привет
Bytes: [208, 159, 209, 128, 208, 184, 208, 178, 208, 181, 209, 130]
Byte indices:
0: 0xD0
1: 0x9F
2: 0xD1
3: 0x80
4: 0xD0
5: 0xB8
6: 0xD0
7: 0xB2
...
Attempting unsafe slice &text[5..10]...
thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'и' (bytes 4..5) of `Привет`'
Fix Implementation Example
// file_read_utils.rs
pub fn get_safe_preview(content: &str, offset: usize, limit: usize) -> String {
if content.is_empty() {
return String::new();
}
// Find safe start boundary
let start = if content.is_char_boundary(offset) {
offset
} else {
content.char_indices()
.find(|(pos, _)| *pos >= offset)
.map(|(pos, _)| pos)
.unwrap_or(content.len())
};
// Find safe end boundary
let end_pos = (start + limit).min(content.len());
let end = if content.is_char_boundary(end_pos) {
end_pos
} else {
content[..end_pos].char_indices()
.last()
.map(|(pos, _)| pos)
.unwrap_or(0)
};
content[start..end].to_string()
}
Test Cases
Add these test cases to prevent regression:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_cyrillic_preview() {
let text = "Привет, мир!";
let preview = get_safe_preview(text, 0, 6);
assert!(!preview.is_empty());
}
#[test]
fn test_chinese_preview() {
let text = "你好世界";
let preview = get_safe_preview(text, 1, 5);
assert!(!preview.is_empty());
}
#[test]
fn test_emoji_preview() {
let text = "Hello World";
let preview = get_safe_preview(text, 6, 4);
assert!(!preview.is_empty());
}
#[test]
fn test_mixed_content() {
let text = "Пример: Example ";
let preview = get_safe_preview(text, 5, 10);
assert!(!preview.is_empty());
}
}
References
Follow-Up Actions
Report Prepared By: Spacebot Agent
Priority: P0 - Immediate action required
Estimated Fix Time: 2-4 hours (implementation + testing)
Severity
CRITICAL - Blocks all file operations
Problem Description
The worker's
file_readoperation performs unsafe byte-level slicing on UTF-8 string content during preview generation. When the preview logic attempts to slice a string at arbitrary byte boundaries using Rust's byte slicing (&s[start..end]), it can split multi-byte UTF-8 characters in half.What Happens
UTF-8 Encoding Background: Cyrillic characters (and many others) use 2+ bytes per character in UTF-8:
The Bug: Preview generation code slices strings using byte indices:
The Panic: Rust's string slicing enforces UTF-8 validity:
This panic occurs because the code attempts to slice at byte position 5, which is in the middle of a 2-byte Cyrillic character.
Reproduction Steps
Minimal Reproduction
Step-by-Step
Create a file containing multi-byte UTF-8 characters:
Trigger file_read with offset/limit parameters (for preview generation):
Observe the panic:
Current Impact
Affected Users
Blocked Operations
Scope
Technical Root Cause
The Problem Code Pattern
The preview generation logic uses byte-level slicing:
Why This Fails
UTF-8 encoding uses variable-length characters:
When slicing
&s[5..10]on a string like"Привет":Where This Occurs
offsetandlimitare providedSuggested Fix
Solution: Character-Aware Slicing
Use Rust's
char_indices()method to find valid character boundaries:Alternative: Use
split_atwith Boundary CheckingImplementation Priority
is_char_boundarycheck before slicingchar_indices()unicode-segmentationfor grapheme-aware operationsExample Code
Demonstrating the Bug
Output
Fix Implementation Example
Test Cases
Add these test cases to prevent regression:
References
char_indices()MethodFollow-Up Actions
Report Prepared By: Spacebot Agent
Priority: P0 - Immediate action required
Estimated Fix Time: 2-4 hours (implementation + testing)