CRITICAL: Worker `file_read` Operation Panics on Multi-byte UTF-8 Content (Cyrillic, CJK, Emoji)

## Severity

 **CRITICAL** - Blocks all file operations

- Completely prevents reading files containing multi-byte UTF-8 characters
- Affects all international users using languages beyond ASCII
- No workaround available for affected projects

---

## Problem Description

The worker's `file_read` operation performs unsafe byte-level slicing on UTF-8 string content during preview generation. When the preview logic attempts to slice a string at arbitrary byte boundaries using Rust's byte slicing (`&s[start..end]`), it can split multi-byte UTF-8 characters in half.

### What Happens

1. **UTF-8 Encoding Background**: Cyrillic characters (and many others) use 2+ bytes per character in UTF-8:
   - ASCII characters: 1 byte
   - Cyrillic (Russian, Ukrainian, etc.): 2 bytes
   - Chinese/Japanese/Korean: 3 bytes
   - Emoji: 4 bytes

2. **The Bug**: Preview generation code slices strings using byte indices:
   ```rust
   &s[start..end]  // ❌ Unsafe! Can cut multi-byte characters
   ```

3. **The Panic**: Rust's string slicing enforces UTF-8 validity:
   ```
   thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'а' (bytes 4..5) of `ау`'
   ```

This panic occurs because the code attempts to slice at byte position 5, which is in the middle of a 2-byte Cyrillic character.

---

## Reproduction Steps

### Minimal Reproduction

```bash
# 1. Create a test file with Cyrillic content
echo "ау" > /tmp/cyrillic_test.txt

# 2. Attempt to read using worker's file_read
# (In worker context)
file_read(path="/tmp/cyrillic_test.txt", offset=1, limit=10)

# Result: Panic during preview generation
```

### Step-by-Step

1. **Create a file** containing multi-byte UTF-8 characters:
   ```bash
   echo "Привет, мир!" > test.txt
   # Or: echo "你好世界" > test.txt
   # Or: echo "" > test.txt
   ```

2. **Trigger file_read** with offset/limit parameters (for preview generation):
   ```javascript
   // In worker code
   file_read({
     path: "test.txt",
     offset: 1,    // Triggers preview path
     limit: 50
   })
   ```

3. **Observe the panic**:
   ```
   panicked at 'byte index N is not a char boundary; it is inside 'X' (bytes N..N+1) of `...`'
   ```

---

## Current Impact

### Affected Users
- ❌ All Russian-speaking users
- ❌ All Ukrainian, Bulgarian, Serbian users (Cyrillic script)
- ❌ All Chinese, Japanese, Korean users
- ❌ All users using emoji in files
- ❌ Any project with non-ASCII UTF-8 content

### Blocked Operations
- ❌ Reading source code with non-ASCII comments
- ❌ Reading documentation in international languages
- ❌ Reading configuration files with Unicode content
- ❌ Reading any text files containing emoji
- ❌ Preview generation for files with multi-byte characters

### Scope
- **Geographic**: Affects users worldwide (most non-English languages)
- **Functional**: Complete blocker for file operations on affected projects
- **Data Loss Risk**: None (panic doesn't corrupt data), but prevents access

---

## Technical Root Cause

### The Problem Code Pattern

The preview generation logic uses byte-level slicing:

```rust
// ❌ Current implementation (unsafe)
fn generate_preview(content: &str, start: usize, end: usize) -> &str {
    &content[start..end]  // PANIC if start/end split a multi-byte char
}
```

### Why This Fails

UTF-8 encoding uses variable-length characters:

```
Character     Bytes          Byte Indices
--------------------------------------------
'a'           0x61           0
'а' (Cyrillic) 0xD0 0xB0      0..1
'你' (Chinese) 0xE4 0xBD 0xA0  0..2
'' (Emoji)   0xF0 0x9F 0x98 0x80  0..3
```

When slicing `&s[5..10]` on a string like `"Привет"`:
- Byte 5 is in the middle of the Cyrillic 'и' (bytes 4-5)
- Rust validates the slice boundary and panics

### Where This Occurs

1. **Preview generation**: When `offset` and `limit` are provided
2. **Windowed views**: When displaying file excerpts
3. **Any byte-indexed slicing**: On UTF-8 string data

---

## Suggested Fix

### Solution: Character-Aware Slicing

Use Rust's `char_indices()` method to find valid character boundaries:

```rust
// ✅ Safe implementation
fn safe_slice(content: &str, start_byte: usize, end_byte: usize) -> &str {
    // Find the nearest character boundary at or after start_byte
    let start = content
        .char_indices()
        .find(|(byte_pos, _)| *byte_pos >= start_byte)
        .map(|(byte_pos, _)| byte_pos)
        .unwrap_or(content.len());
    
    // Find the nearest character boundary at or before end_byte
    let end = content
        .char_indices()
        .take_while(|(byte_pos, _)| *byte_pos <= end_byte)
        .last()
        .map(|(byte_pos, _)| byte_pos)
        .unwrap_or(0);
    
    &content[start..end.max(start)]  // Safe: both are char boundaries
}
```

### Alternative: Use `split_at` with Boundary Checking

```rust
// ✅ Another safe approach
fn safe_preview(content: &str, start: usize, len: usize) -> &str {
    // Ensure start is at a character boundary
    let start = if content.is_char_boundary(start) {
        start
    } else {
        // Find next valid boundary
        content[start..].char_indices()
            .nth(1)
            .map(|(i, _)| start + i)
            .unwrap_or(content.len())
    };
    
    let end = if start + len <= content.len() {
        // Find nearest safe end
        content[start..].char_indices()
            .take_while(|(i, _)| start + i <= start + len)
            .last()
            .map(|(i, _)| start + i)
            .unwrap_or(start)
    } else {
        content.len()
    };
    
    &content[start..end]
}
```

### Implementation Priority

1. **Immediate**: Add `is_char_boundary` check before slicing
2. **Short-term**: Implement character-aware slicing using `char_indices()`
3. **Long-term**: Consider using a crate like `unicode-segmentation` for grapheme-aware operations

---

## Example Code

### Demonstrating the Bug

```rust
// utf8_bug_demo.rs
fn main() {
    let text = "Привет";  // "Hello" in Russian
    
    println!("String: {}", text);
    println!("Bytes: {:?}", text.as_bytes());
    println!("Byte indices:");
    for (i, b) in text.as_bytes().iter().enumerate() {
        println!("  {}: 0x{:02X}", i, b);
    }
    
    // This will panic because byte 5 is inside a character
    println!("\nAttempting unsafe slice &text[5..10]...");
    // let preview = &text[5..10];  // ❌ PANIC!
    
    // Safe approach
    println!("Using char_indices() for safe slicing...");
    let safe_start = text.char_indices()
        .find(|(pos, _)| *pos >= 5)
        .map(|(pos, _)| pos)
        .unwrap_or(text.len());
    
    if safe_start < text.len() {
        let safe_end = text.char_indices()
            .skip_while(|(pos, _)| *pos <= safe_start + 5)
            .next()
            .map(|(pos, _)| pos)
            .unwrap_or(text.len());
        
        println!("Safe slice: &text[{}..{}] = {:?}", safe_start, safe_end, &text[safe_start..safe_end]);
    }
}
```

### Output

```
String: Привет
Bytes: [208, 159, 209, 128, 208, 184, 208, 178, 208, 181, 209, 130]
Byte indices:
  0: 0xD0
  1: 0x9F
  2: 0xD1
  3: 0x80
  4: 0xD0
  5: 0xB8
  6: 0xD0
  7: 0xB2
  ...

Attempting unsafe slice &text[5..10]...
thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'и' (bytes 4..5) of `Привет`'
```

### Fix Implementation Example

```rust
// file_read_utils.rs
pub fn get_safe_preview(content: &str, offset: usize, limit: usize) -> String {
    if content.is_empty() {
        return String::new();
    }
    
    // Find safe start boundary
    let start = if content.is_char_boundary(offset) {
        offset
    } else {
        content.char_indices()
            .find(|(pos, _)| *pos >= offset)
            .map(|(pos, _)| pos)
            .unwrap_or(content.len())
    };
    
    // Find safe end boundary
    let end_pos = (start + limit).min(content.len());
    let end = if content.is_char_boundary(end_pos) {
        end_pos
    } else {
        content[..end_pos].char_indices()
            .last()
            .map(|(pos, _)| pos)
            .unwrap_or(0)
    };
    
    content[start..end].to_string()
}
```

---

## Test Cases

Add these test cases to prevent regression:

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_cyrillic_preview() {
        let text = "Привет, мир!";
        let preview = get_safe_preview(text, 0, 6);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_chinese_preview() {
        let text = "你好世界";
        let preview = get_safe_preview(text, 1, 5);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_emoji_preview() {
        let text = "Hello  World";
        let preview = get_safe_preview(text, 6, 4);
        assert!(!preview.is_empty());
    }

    #[test]
    fn test_mixed_content() {
        let text = "Пример: Example ";
        let preview = get_safe_preview(text, 5, 10);
        assert!(!preview.is_empty());
    }
}
```

---

## References

- [Rust String Slicing Documentation](https://doc.rust-lang.org/std/primitive.str.html#method.slice)
- [UTF-8 Encoding Wikipedia](https://en.wikipedia.org/wiki/UTF-8)
- [Rust `char_indices()` Method](https://doc.rust-lang.org/std/primitive.str.html#method.char_indices)
- [Unicode Standard](https://unicode.org/standard/standard.html)

---

## Follow-Up Actions

- [ ] Implement character-aware slicing in file_read preview generation
- [ ] Add unit tests for multi-byte UTF-8 content
- [ ] Add integration tests for Cyrillic, CJK, and emoji content
- [ ] Audit other string slicing operations in the codebase
- [ ] Consider adding linter rules for unsafe string slicing
- [ ] Update documentation to note UTF-8 safety requirements

---

**Report Prepared By**: Spacebot Agent  
**Priority**: P0 - Immediate action required  
**Estimated Fix Time**: 2-4 hours (implementation + testing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRITICAL: Worker `file_read` Operation Panics on Multi-byte UTF-8 Content (Cyrillic, CJK, Emoji) #391

Severity

Problem Description

What Happens

Reproduction Steps

Minimal Reproduction

Step-by-Step

Current Impact

Affected Users

Blocked Operations

Scope

Technical Root Cause

The Problem Code Pattern

Why This Fails

Where This Occurs

Suggested Fix

Solution: Character-Aware Slicing

Alternative: Use `split_at` with Boundary Checking

Implementation Priority

Example Code

Demonstrating the Bug

Output

Fix Implementation Example

Test Cases

References

Follow-Up Actions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRITICAL: Worker file_read Operation Panics on Multi-byte UTF-8 Content (Cyrillic, CJK, Emoji) #391

Description

Severity

Problem Description

What Happens

Reproduction Steps

Minimal Reproduction

Step-by-Step

Current Impact

Affected Users

Blocked Operations

Scope

Technical Root Cause

The Problem Code Pattern

Why This Fails

Where This Occurs

Suggested Fix

Solution: Character-Aware Slicing

Alternative: Use split_at with Boundary Checking

Implementation Priority

Example Code

Demonstrating the Bug

Output

Fix Implementation Example

Test Cases

References

Follow-Up Actions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CRITICAL: Worker `file_read` Operation Panics on Multi-byte UTF-8 Content (Cyrillic, CJK, Emoji) #391

Alternative: Use `split_at` with Boundary Checking