Skip to content

store crawled research results in a folder, log research topic/follow-up questions+learnings in output file #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

L3Gaunt
Copy link

@L3Gaunt L3Gaunt commented Feb 20, 2025

This PR implements storing the homepages accessed in a downloaded-urls/ subdirectory, which allows a user to validate the report by looking at the sources and building a knowledge base that can be consulted for more detail. Filenames derive from URLs sanitized with sanitize-url, plus a timestamp recording when files were accessed. Files contain the title, description, URL, accessed-at timestamp, and markdown content from firecrawl.

In the near future, I want to implement storing the log of queries, research and learnings as well, so that a user can judge the quality of the research process for themselves and give feedback.

@L3Gaunt
Copy link
Author

L3Gaunt commented Feb 21, 2025

I changed things as follows:

  • the accessed-at date isn't put into filenames of downloaded URLs anymore; I think it is usually a desired behavior to overwrite web pages with newer versions, someone who really wants version tracking should add git to their knowledge base. In edge cases, the mapping of URLs->filenames is not 1-to-1 anymore though.
  • The final report now includes the download locations of the files we get
  • the output.md file now contains a timestamp, initial+follow-up questions, and the final learnings. Want to add intermediate learnings too. I think having the option to supervise and judge the quality of what the thing did during the process is important for quality control, and someone who doesn't want to see it can always just scroll past it.
  • using path.join to put folder+filename together (so it should work on Windows now...?)

Feel free to cherry-pick what you like.

@L3Gaunt L3Gaunt changed the title store crawled research results in a folder store crawled research results in a folder, log questions+learnings in output file Feb 22, 2025
@L3Gaunt L3Gaunt changed the title store crawled research results in a folder, log questions+learnings in output file store crawled research results in a folder, log research topic/follow-up questions+learnings in output file Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant