Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrated vectorization - sourcefile and storageUrl is null #2279

Open
poonkuzhali opened this issue Jan 16, 2025 · 4 comments
Open

Integrated vectorization - sourcefile and storageUrl is null #2279

poonkuzhali opened this issue Jan 16, 2025 · 4 comments

Comments

@poonkuzhali
Copy link

poonkuzhali commented Jan 16, 2025

For data ingestion, I wanted to enable integrated vectorization. I followed the instructions provided in the docs -> data_ingestion.md file

  1. I deleted the old index
  2. Enabled azd env set USE_FEATURE_INT_VECTORIZATION true
  3. azd provision
  4. azd deploy
  5. An indexer, index and data source were created.

All my documents are present in blob storage.

After the indexer successfully ran, I used the app, the citation url was wrong, it wasnt linked to the file in blob storage.

When I looked at the index, the sourcefile and storageUrl variables were null, which is messing with my citations.

Image

The citation does not have an extension, just has the file name. I am not sure why
Image

@poonkuzhali
Copy link
Author

@pamelafox Any thoughts on this issue would be much appreciated. Thank you

@poonkuzhali poonkuzhali changed the title Integrated vectorization - sourcepage and storageUrl is null Integrated vectorization - sourcefile and storageUrl is null Jan 16, 2025
@timwillittes
Copy link

I can help with the root cause analysis. If you go to the integrated vectorization file (app/backend/prepdocslib/integratedvectorizerstrategy.py), on lines 100-102 you will see where the index field mappings are taking place. The sourcefile and storageUrl attributes are not mapped, and I'm not sure if that is on purpose or not.

selectors=[
                SearchIndexerIndexProjectionSelector(
                    target_index_name=index_name,
                    parent_key_field_name="parent_id",
                    source_context="/document/pages/*",
                    mappings=[
                        InputFieldMappingEntry(name="content", source="/document/pages/*"),
                        InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
                        InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
                    ],
                ),
            ],

@cforce
Copy link
Contributor

cforce commented Jan 21, 2025

InputFieldMappingEntry(name="sourcefile", source="/document/metadata_storage_path"),
InputFieldMappingEntry(name="storageUrl", source="/document/metadata_storage_url"),                    ],

??

@wvdm1217
Copy link

Hi, I am also having the same issue.

Tried mapping sourcefile and storageurl in the skills section but it seems to not be working. Not sure where to view the necessary metadata

    "indexProjections": {
        "selectors": [
            {
                "targetIndexName": "gptkbindex",
                "parentKeyFieldName": "parent_id",
                "sourceContext": "/document/pages/*",
                "mappings": [
                    {
                        "name": "content",
                        "source": "/document/pages/*",
                        "inputs": []
                    },
                    {
                        "name": "embedding",
                        "source": "/document/pages/*/vector",
                        "inputs": []
                    },
                    {
                        "name": "sourcepage",
                        "source": "/document/metadata_storage_name",
                        "inputs": []
                    },
                    {
                        "name": "sourcefile",
                        "source": "/document/metadata_storage_path",
                        "inputs": []
                    },
                    {
                        "name": "storageUrl",
                        "source": "/document/metadata_storage_url",
                        "inputs": []
                    }
                ]
            }
        ],
        "parameters": {
            "projectionMode": "skipIndexingParentDocuments"
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants