Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: non-latin letters in file names are stripped #2703

Open
mkhludnev opened this issue Jun 22, 2024 · 1 comment
Open

[Bug]: non-latin letters in file names are stripped #2703

mkhludnev opened this issue Jun 22, 2024 · 1 comment
Labels
area: backend Related to backend functionality or under the /backend directory bug Something isn't working

Comments

@mkhludnev
Copy link

What happened?

non-ascii letters removed from file names.
image
it's caused by using file name as a key for storing upload https://github.com/QuivrHQ/quivr/blob/main/backend/modules/upload/service/upload_file.py#L81and upload keys are restricted supabase/storage#133 Thus Quivr removes non-ascii due to #1728

I made an attempt to fix it, but not pleased with the result. I think it's worth to add original_file_name column into knowledge table and identify file uploads by uuid. WDYT?

Relevant log output

No response

Twitter / LinkedIn details

No response

@mkhludnev mkhludnev added the bug Something isn't working label Jun 22, 2024
@dosubot dosubot bot added the area: backend Related to backend functionality or under the /backend directory label Jun 22, 2024
Copy link

dosubot bot commented Jun 22, 2024

The proposed solution of adding an original_file_name column to the knowledge table and identifying file uploads by UUID to preserve non-ASCII characters in file names seems reasonable. This approach allows you to store the original file name, including any non-ASCII characters, while using a UUID for identification, which ensures uniqueness and avoids issues with special characters in file names.

Here is how you might modify the Knowledge class to include the original_file_name:

from typing import Optional
from uuid import UUID

from pydantic import BaseModel


class Knowledge(BaseModel):
    id: UUID
    brain_id: UUID
    file_name: Optional[str] = None
    original_file_name: Optional[str] = None  # Added column for original file name
    url: Optional[str] = None
    extension: str = "txt"

This change will allow you to store the original file name with non-ASCII characters while using the file_name field for a sanitized or UUID-based name.

Additionally, you will need to update the SQL schema for the knowledge table to include the original_file_name column:

CREATE TABLE IF NOT EXISTS knowledge (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  file_name TEXT,
  original_file_name TEXT,  -- Added column for original file name
  url TEXT,
  brain_id UUID NOT NULL REFERENCES brains(brain_id),
  extension TEXT NOT NULL,
  CHECK ((file_name IS NOT NULL AND url IS NULL) OR (file_name IS NULL AND url IS NOT NULL))
);

This ensures that the database schema is aligned with the changes in your application code [1][2][3].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: backend Related to backend functionality or under the /backend directory bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant