Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groups acl not reflected in vector index and blob storage #2064

Open
cforce opened this issue Oct 22, 2024 · 1 comment
Open

groups acl not reflected in vector index and blob storage #2064

cforce opened this issue Oct 22, 2024 · 1 comment
Assignees

Comments

@cforce
Copy link
Contributor

cforce commented Oct 22, 2024

Summary: The script adlsgen2setup.py does not assign the "groups" ACL to individual files, and prepdocs.py does not check folder ACLs and created a copy of all folders in flat struture without any acls groups / rights. Consequently, the ACL is missing from the "group" field in the vector index —it remains empty. Files are accessible for anyone independent or the security group he is in

After running adlsgen2setup.py, the correct security groups are assigned to the directories in Azure Data Lake Gen2 through their ACLs. However, files within those directories do not inherit these ACLs. To fix this, it may be necessary to set the "default ACL" for the folders or apply the ACLs recursively to include the files, just as it’s done for the folders.

This is crucial because prepdocs.py, when importing data from Azure Data Lake Storage, only looks at file-level ACLs, not those on the directories. Without proper ACLs on the files, the security groups don’t get applied, leaving the "groups" field in the vector index empty.

When running prepdocs.py in "datalake mode" (with AZURE_ADLS_GEN2_STORAGE_ACCOUNT configured), the files still get uploaded to the AZURE_STORAGE_ACCOUNT (using the default container "content"), but the folder structure (category) is removed along with the ACLs. This results in a flat structure without any access control for groups, potentially allowing access to files through filename guessing.

Why do I even need an additional storage sink if the data is already stored in Azure Data Lake Storage? Is it necessary to use the extra storage to display content in the browser UI? Or can I simply use the --skipblobs option and still have documents display by referencing AZURE_ADLS_GEN2_STORAGE_ACCOUNT + AZURE_ADLS_GEN2_FILESYSTEM + AZURE_ADLS_GEN2_FILESYSTEM_PATH + filename?

Additionally, why is the directory structure being discarded when uploading to AZURE_STORAGE_ACCOUNT? With this flat structure and no directories, name duplication issues could arise as more documents are added over time.

@mattgotteiner mattgotteiner self-assigned this Oct 25, 2024
@cforce cforce changed the title groups acl not reflected in vector index groups acl not reflected in vector index and blob storage Oct 26, 2024
@cforce
Copy link
Contributor Author

cforce commented Jan 18, 2025

related #2278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants