Skip to content

CRITICAL: Worker file_read Operation Panics on Multi-byte UTF-8 Content (Cyrillic, CJK, Emoji) #391

@fargobornfarg-dot

Description

@fargobornfarg-dot

Severity

CRITICAL - Blocks all file operations

  • Completely prevents reading files containing multi-byte UTF-8 characters
  • Affects all international users using languages beyond ASCII
  • No workaround available for affected projects

Problem Description

The worker's file_read operation performs unsafe byte-level slicing on UTF-8 string content during preview generation. When the preview logic attempts to slice a string at arbitrary byte boundaries using Rust's byte slicing (&s[start..end]), it can split multi-byte UTF-8 characters in half.

What Happens

  1. UTF-8 Encoding Background: Cyrillic characters (and many others) use 2+ bytes per character in UTF-8:

    • ASCII characters: 1 byte
    • Cyrillic (Russian, Ukrainian, etc.): 2 bytes
    • Chinese/Japanese/Korean: 3 bytes
    • Emoji: 4 bytes
  2. The Bug: Preview generation code slices strings using byte indices:

    &s[start..end]  // ❌ Unsafe! Can cut multi-byte characters
  3. The Panic: Rust's string slicing enforces UTF-8 validity:

    thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'а' (bytes 4..5) of `ау`'
    

This panic occurs because the code attempts to slice at byte position 5, which is in the middle of a 2-byte Cyrillic character.


Reproduction Steps

Minimal Reproduction

# 1. Create a test file with Cyrillic content
echo "ау" > /tmp/cyrillic_test.txt

# 2. Attempt to read using worker's file_read
# (In worker context)
file_read(path="/tmp/cyrillic_test.txt", offset=1, limit=10)

# Result: Panic during preview generation

Step-by-Step

  1. Create a file containing multi-byte UTF-8 characters:

    echo "Привет, мир!" > test.txt
    # Or: echo "你好世界" > test.txt
    # Or: echo "" > test.txt
  2. Trigger file_read with offset/limit parameters (for preview generation):

    // In worker code
    file_read({
      path: "test.txt",
      offset: 1,    // Triggers preview path
      limit: 50
    })
  3. Observe the panic:

    panicked at 'byte index N is not a char boundary; it is inside 'X' (bytes N..N+1) of `...`'
    

Current Impact

Affected Users

  • ❌ All Russian-speaking users
  • ❌ All Ukrainian, Bulgarian, Serbian users (Cyrillic script)
  • ❌ All Chinese, Japanese, Korean users
  • ❌ All users using emoji in files
  • ❌ Any project with non-ASCII UTF-8 content

Blocked Operations

  • ❌ Reading source code with non-ASCII comments
  • ❌ Reading documentation in international languages
  • ❌ Reading configuration files with Unicode content
  • ❌ Reading any text files containing emoji
  • ❌ Preview generation for files with multi-byte characters

Scope

  • Geographic: Affects users worldwide (most non-English languages)
  • Functional: Complete blocker for file operations on affected projects
  • Data Loss Risk: None (panic doesn't corrupt data), but prevents access

Technical Root Cause

The Problem Code Pattern

The preview generation logic uses byte-level slicing:

// ❌ Current implementation (unsafe)
fn generate_preview(content: &str, start: usize, end: usize) -> &str {
    &content[start..end]  // PANIC if start/end split a multi-byte char
}

Why This Fails

UTF-8 encoding uses variable-length characters:

Character     Bytes          Byte Indices
--------------------------------------------
'a'           0x61           0
'а' (Cyrillic) 0xD0 0xB0      0..1
'你' (Chinese) 0xE4 0xBD 0xA0  0..2
'' (Emoji)   0xF0 0x9F 0x98 0x80  0..3

When slicing &s[5..10] on a string like "Привет":

  • Byte 5 is in the middle of the Cyrillic 'и' (bytes 4-5)
  • Rust validates the slice boundary and panics

Where This Occurs

  1. Preview generation: When offset and limit are provided
  2. Windowed views: When displaying file excerpts
  3. Any byte-indexed slicing: On UTF-8 string data

Suggested Fix

Solution: Character-Aware Slicing

Use Rust's char_indices() method to find valid character boundaries:

// ✅ Safe implementation
fn safe_slice(content: &str, start_byte: usize, end_byte: usize) -> &str {
    // Find the nearest character boundary at or after start_byte
    let start = content
        .char_indices()
        .find(|(byte_pos, _)| *byte_pos >= start_byte)
        .map(|(byte_pos, _)| byte_pos)
        .unwrap_or(content.len());
    
    // Find the nearest character boundary at or before end_byte
    let end = content
        .char_indices()
        .take_while(|(byte_pos, _)| *byte_pos <= end_byte)
        .last()
        .map(|(byte_pos, _)| byte_pos)
        .unwrap_or(0);
    
    &content[start..end.max(start)]  // Safe: both are char boundaries
}

Alternative: Use split_at with Boundary Checking

// ✅ Another safe approach
fn safe_preview(content: &str, start: usize, len: usize) -> &str {
    // Ensure start is at a character boundary
    let start = if content.is_char_boundary(start) {
        start
    } else {
        // Find next valid boundary
        content[start..].char_indices()
            .nth(1)
            .map(|(i, _)| start + i)
            .unwrap_or(content.len())
    };
    
    let end = if start + len <= content.len() {
        // Find nearest safe end
        content[start..].char_indices()
            .take_while(|(i, _)| start + i <= start + len)
            .last()
            .map(|(i, _)| start + i)
            .unwrap_or(start)
    } else {
        content.len()
    };
    
    &content[start..end]
}

Implementation Priority

  1. Immediate: Add is_char_boundary check before slicing
  2. Short-term: Implement character-aware slicing using char_indices()
  3. Long-term: Consider using a crate like unicode-segmentation for grapheme-aware operations

Example Code

Demonstrating the Bug

// utf8_bug_demo.rs
fn main() {
    let text = "Привет";  // "Hello" in Russian
    
    println!("String: {}", text);
    println!("Bytes: {:?}", text.as_bytes());
    println!("Byte indices:");
    for (i, b) in text.as_bytes().iter().enumerate() {
        println!("  {}: 0x{:02X}", i, b);
    }
    
    // This will panic because byte 5 is inside a character
    println!("\nAttempting unsafe slice &text[5..10]...");
    // let preview = &text[5..10];  // ❌ PANIC!
    
    // Safe approach
    println!("Using char_indices() for safe slicing...");
    let safe_start = text.char_indices()
        .find(|(pos, _)| *pos >= 5)
        .map(|(pos, _)| pos)
        .unwrap_or(text.len());
    
    if safe_start < text.len() {
        let safe_end = text.char_indices()
            .skip_while(|(pos, _)| *pos <= safe_start + 5)
            .next()
            .map(|(pos, _)| pos)
            .unwrap_or(text.len());
        
        println!("Safe slice: &text[{}..{}] = {:?}", safe_start, safe_end, &text[safe_start..safe_end]);
    }
}

Output

String: Привет
Bytes: [208, 159, 209, 128, 208, 184, 208, 178, 208, 181, 209, 130]
Byte indices:
  0: 0xD0
  1: 0x9F
  2: 0xD1
  3: 0x80
  4: 0xD0
  5: 0xB8
  6: 0xD0
  7: 0xB2
  ...

Attempting unsafe slice &text[5..10]...
thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'и' (bytes 4..5) of `Привет`'

Fix Implementation Example

// file_read_utils.rs
pub fn get_safe_preview(content: &str, offset: usize, limit: usize) -> String {
    if content.is_empty() {
        return String::new();
    }
    
    // Find safe start boundary
    let start = if content.is_char_boundary(offset) {
        offset
    } else {
        content.char_indices()
            .find(|(pos, _)| *pos >= offset)
            .map(|(pos, _)| pos)
            .unwrap_or(content.len())
    };
    
    // Find safe end boundary
    let end_pos = (start + limit).min(content.len());
    let end = if content.is_char_boundary(end_pos) {
        end_pos
    } else {
        content[..end_pos].char_indices()
            .last()
            .map(|(pos, _)| pos)
            .unwrap_or(0)
    };
    
    content[start..end].to_string()
}

Test Cases

Add these test cases to prevent regression:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_cyrillic_preview() {
        let text = "Привет, мир!";
        let preview = get_safe_preview(text, 0, 6);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_chinese_preview() {
        let text = "你好世界";
        let preview = get_safe_preview(text, 1, 5);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_emoji_preview() {
        let text = "Hello  World";
        let preview = get_safe_preview(text, 6, 4);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_mixed_content() {
        let text = "Пример: Example ";
        let preview = get_safe_preview(text, 5, 10);
        assert!(!preview.is_empty());
    }
}

References


Follow-Up Actions

  • Implement character-aware slicing in file_read preview generation
  • Add unit tests for multi-byte UTF-8 content
  • Add integration tests for Cyrillic, CJK, and emoji content
  • Audit other string slicing operations in the codebase
  • Consider adding linter rules for unsafe string slicing
  • Update documentation to note UTF-8 safety requirements

Report Prepared By: Spacebot Agent
Priority: P0 - Immediate action required
Estimated Fix Time: 2-4 hours (implementation + testing)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions