Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readonly Pipeline #229

Merged
merged 283 commits into from
Dec 11, 2024
Merged

Readonly Pipeline #229

merged 283 commits into from
Dec 11, 2024

Conversation

haohangyan
Copy link
Contributor

@haohangyan haohangyan commented Oct 20, 2024

This PR contains all the code we need for the readonly pipeline.

When we run the command bash indra_db/readonly_dumping/test_bash.sh, the pipeline will start running.
An overview of the pipeline:

  • Set up environment variables and database passwords.
  • Get file paths for required initial dump files and verify them.
  • Dump raw statements, reading text content meta, and text refs principal from the principal database.
  • Get all files that will be loaded into the readonly database.
  • Recreate the local database and import data using the readonly_dumping script.
  • (Not tested) create and remove a local database dump file after uploading it to S3.
  • (Not tested) upload an end-date file to S3 and restore the dump to a readonly instance using pg_restore.

The pipeline contains two main parts: export_assembly and readonly_dumping.

Some main stages in the export assembly include:

  1. Running the knowledgebase pipeline
  2. Running statement distillation
  3. Running preprocessing
  4. Merging processed knowledgebase statements with processed raw statements
  5. Running grounding and deduplication
  6. Calculating refinements
  7. Calculating the belief score

In the readonly_dumping part, we dump the tables into the local Postgres database in the following order: “belief”, “raw_stmt_src”, “reading_ref_link”, “evidence_counts”, “pa_agent_counts”, “mesh_concept_ref_counts”, “mesh_term_ref_counts”, “name_meta”, “text_meta”, “other_meta”, “source_meta”, “agent_interactions”, “fast_raw_pa_link”, “raw_stmt_mesh_concepts”, “raw_stmt_mesh_terms”, “mesh_concept_meta”, “mesh_term_meta”.

Some tasks that need to be completed:

  • When the final readonly database is generated, the bash file needs a script to upload the database to an S3 bucket. We need to determine the storage destination for where the database should be placed.
  • Everything is currently running using the test_bash.sh. This will be fixed so we can use readonly_dumping_bash.sh soon.

Depends on sorgerlab/indra#1460.

@kkaris kkaris force-pushed the readonly_dumping_new branch from 6a3c5d0 to ab27a25 Compare November 1, 2024 21:38
@kkaris kkaris self-requested a review November 1, 2024 23:13
Copy link
Contributor

@kkaris kkaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are updates needed by both of us @haohangyan, in general I think it looks good

@bgyori bgyori merged commit b3b719a into gyorilab:master Dec 11, 2024
@kkaris kkaris mentioned this pull request Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants