You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: The script adlsgen2setup.py does not assign the "groups" ACL to individual files, and prepdocs.py does not check folder ACLs and created a copy of all folders in flat struture without any acls groups / rights. Consequently, the ACL is missing from the "group" field in the vector index —it remains empty. Files are accessible for anyone independent or the security group he is in
After running adlsgen2setup.py, the correct security groups are assigned to the directories in Azure Data Lake Gen2 through their ACLs. However, files within those directories do not inherit these ACLs. To fix this, it may be necessary to set the "default ACL" for the folders or apply the ACLs recursively to include the files, just as it’s done for the folders.
This is crucial because prepdocs.py, when importing data from Azure Data Lake Storage, only looks at file-level ACLs, not those on the directories. Without proper ACLs on the files, the security groups don’t get applied, leaving the "groups" field in the vector index empty.
When running prepdocs.py in "datalake mode" (with AZURE_ADLS_GEN2_STORAGE_ACCOUNT configured), the files still get uploaded to the AZURE_STORAGE_ACCOUNT (using the default container "content"), but the folder structure (category) is removed along with the ACLs. This results in a flat structure without any access control for groups, potentially allowing access to files through filename guessing.
Why do I even need an additional storage sink if the data is already stored in Azure Data Lake Storage? Is it necessary to use the extra storage to display content in the browser UI? Or can I simply use the --skipblobs option and still have documents display by referencing AZURE_ADLS_GEN2_STORAGE_ACCOUNT + AZURE_ADLS_GEN2_FILESYSTEM + AZURE_ADLS_GEN2_FILESYSTEM_PATH + filename?
Additionally, why is the directory structure being discarded when uploading to AZURE_STORAGE_ACCOUNT? With this flat structure and no directories, name duplication issues could arise as more documents are added over time.
The text was updated successfully, but these errors were encountered:
Summary: The script
adlsgen2setup.py
does not assign the "groups" ACL to individual files, andprepdocs.py
does not check folder ACLs and created a copy of all folders in flat struture without any acls groups / rights. Consequently, the ACL is missing from the "group" field in the vector index —it remains empty. Files are accessible for anyone independent or the security group he is inAfter running
adlsgen2setup.py
, the correct security groups are assigned to the directories in Azure Data Lake Gen2 through their ACLs. However, files within those directories do not inherit these ACLs. To fix this, it may be necessary to set the "default ACL" for the folders or apply the ACLs recursively to include the files, just as it’s done for the folders.This is crucial because
prepdocs.py
, when importing data from Azure Data Lake Storage, only looks at file-level ACLs, not those on the directories. Without proper ACLs on the files, the security groups don’t get applied, leaving the "groups" field in the vector index empty.When running
prepdocs.py
in "datalake mode" (withAZURE_ADLS_GEN2_STORAGE_ACCOUNT
configured), the files still get uploaded to theAZURE_STORAGE_ACCOUNT
(using the default container "content"), but the folder structure (category) is removed along with the ACLs. This results in a flat structure without any access control for groups, potentially allowing access to files through filename guessing.Why do I even need an additional storage sink if the data is already stored in Azure Data Lake Storage? Is it necessary to use the extra storage to display content in the browser UI? Or can I simply use the
--skipblobs
option and still have documents display by referencingAZURE_ADLS_GEN2_STORAGE_ACCOUNT
+AZURE_ADLS_GEN2_FILESYSTEM
+AZURE_ADLS_GEN2_FILESYSTEM_PATH
+ filename?Additionally, why is the directory structure being discarded when uploading to
AZURE_STORAGE_ACCOUNT
? With this flat structure and no directories, name duplication issues could arise as more documents are added over time.The text was updated successfully, but these errors were encountered: