Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to Replace sourcefile with url in Indexing and Citing for JSON Files? #2211

Open
Dberg042 opened this issue Dec 3, 2024 · 3 comments
Labels
discussion No repository changes necessarily needed, just a discussion about how to do something ingestion

Comments

@Dberg042
Copy link

Dberg042 commented Dec 3, 2024

Hi,

I'm currently developing an internal chatbot project that utilizes our company wiki pages.

I'm working with a JSON file that includes, content, id and a url source field that the content was scraped.

My goal is to: Use the url value from the JSON file instead of sourcefile when indexing the data.

And displaying the source URL in the citation information under the chatbot's responses instead of the default sourcefile citation.

  1. How can I modify the indexing process to replace sourcefile with the url field from my JSON files?
  2. What changes are needed to ensure that the chatbot cites the source URL in its responses instead of the sourcefile?

I need help. Before trying myself, I wanted to ask to follow the best approach and your best suggestion.

Thank you!

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

By that way thank you for great work and lots of tutorials @pamelafox !

@pamelafox
Copy link
Collaborator

There are several ways you could go about this:

  • You could set "sourcefile" and "sourcepage" equal to that URL when you are indexing. That'd be in update_content in searchmanager.py
    However, that would presumably be a long URL, and it might end up looking awkward. It might also lead to less reliable citations from the LLM. So you may want to truncate those URLs once received, in the approach code, if that's the case, or truncate them in the UI.

  • You could set "sourcefile" and "sourcepage" to some other unique value, and then add an additional field for "sourceurl". You can add more fields in search manager.py, similar to the code that adds the storageUrl field. Then you'd pull that URL as well, in the search() function in searchmanager.py, and send it back to the frontend. You can then update Answer.tsx/Chat.tsx in order to render the desired citation.

@pamelafox pamelafox added discussion No repository changes necessarily needed, just a discussion about how to do something ingestion labels Dec 11, 2024
@cforce
Copy link
Contributor

cforce commented Dec 13, 2024

The "original" source url (no where is hosted for this application but where it came from if its a web based app with uris for the documents available - so called deep links) is a very needful feature

@Dberg042
Copy link
Author

Thank you so much for answer. @pamelafox @cforce, I'll try the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion No repository changes necessarily needed, just a discussion about how to do something ingestion
Projects
None yet
Development

No branches or pull requests

3 participants