Skip to content

Conversation

@shackmann
Copy link
Contributor

@shackmann shackmann commented Jul 23, 2025

  • notebook to generate a new HF QA dataset from raw text (potentially private)
  • new private dataset in storage
  • experiment based on the new dataset

@shackmann shackmann linked an issue Jul 23, 2025 that may be closed by this pull request
@shackmann shackmann requested review from alex-dr and mhauskn-dr and removed request for alex-dr and mhauskn-dr July 23, 2025 17:50
@shackmann shackmann requested review from alex-dr and mhauskn-dr July 24, 2025 13:19
@shackmann shackmann self-assigned this Jul 24, 2025
syftr/storage.py Outdated
supporting_facts=[],
difficulty="default",
qtype="default",
gold_evidence=[],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset generation script seems to populate this field - we should include it here.

Copy link
Collaborator

@alex-dr alex-dr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike had some logic for reviewing and filtering generated QA pairs after generation. From what I've seen, we can end up with partially generated answers and stuff, so it'd be good if we can look up what he was doing and incorporate it into the notebook.

@shackmann shackmann requested a review from alex-dr August 4, 2025 15:01
Copy link
Collaborator

@alex-dr alex-dr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay, but we need to rebase this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QA dataset generation

3 participants