Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to reference genome set for AmiGO, all APIs, and data products #48

Open
kltm opened this issue Oct 28, 2022 · 10 comments
Open

Switch to reference genome set for AmiGO, all APIs, and data products #48

kltm opened this issue Oct 28, 2022 · 10 comments
Assignees
Labels
Needs LA approval Needs final approval from the Lead Architect Needs PI Needs PM approval Needs final approval from the Project Manager Needs PO Needs tech doc Needs TL

Comments

@kltm
Copy link
Member

kltm commented Oct 28, 2022

Project link

https://github.com/orgs/geneontology/projects/117

Project description

A potential continuation of #82.

We'd like to:

  1. make all available all species in AmiGO and the API; this would require changes in the pipeline, mostly with scaling
  2. adding QC (currently ontobio) and GPAD output for these products
  3. do full species sorting for all data products; some adjustments to downloads and announcement

For examining scope, also see:
geneontology/pipeline#246
geneontology/pipeline#204

PI

Chris

Project owner (PO)

Pascale

Technical lead (TL)

Seth

Other personnel (OP)

Seth, Sierra, Dustin, Patrick

Technical specs

https://docs.google.com/document/d/1CU4Zp7t-wTRlcsXlMndIBr2XNUjeBxINdv-O5XPUERQ/edit

Other comments

N/A

@kltm kltm added Needs LA approval Needs final approval from the Lead Architect Needs PI Needs PM approval Needs final approval from the Project Manager Needs PO Needs tech doc Needs TL labels Oct 28, 2022
@kltm kltm moved this to Hopper in Project Metadata Overview Nov 17, 2022
@kltm kltm moved this from Hopper to Priority (project triage) in Project Metadata Overview Dec 6, 2022
@kltm
Copy link
Member Author

kltm commented Dec 6, 2022

@pgaudet Added new project; we'd also likely be using myself, Dustin, and/or Sierra.

@kltm
Copy link
Member Author

kltm commented Dec 6, 2022

Before digging into what we want to do, we should decide the final scope.

@pgaudet pgaudet moved this from Priority (project triage) to Creation (initial requirements document) in Project Metadata Overview Mar 1, 2023
@pgaudet
Copy link

pgaudet commented Mar 2, 2023

Updates& discussion on 2023-03-01 Managers call:

  • Current load time ~ 2 days
  • Seth proposes that this becomes an active project
  • may need to get a new machine
  • advantage is that this would be consistent with Panther & allow species-specific downloads
  • Patrick can also work on this
  • Announce enough in advance so that use dont have to maintain old and new files
  • However: pipeline cleanup and API projects should be wrapped up before getting this started

@kltm kltm changed the title Change available download data (GAFs/GPADs) to reflect species rather than resource Switch to reference genome set (was change available download data (GAFs/GPADs) to reflect species rather than resource) Mar 2, 2023
@kltm
Copy link
Member Author

kltm commented Mar 11, 2023

Following from @pgaudet, I'm going to put some running notes on things we'll need to tackle here are we figure out the exact scope of what we're tackling:

  • load time too long; this seems to be about two days; the downloads are a quarter of this; either:
    • upstream fixes this
    • we create our own cache for EBI files (would need to coordinate our tooling as well)
  • new hardware may help speeds and throughput a lot (@kltm already given a green light to purchase)
    • hopefully related to docker timing out as well
  • communications
    • announce enough in advance so that we don't have to maintain old and new files
  • logging sizes are already an issue; reduce logging in the main tools one way or another
  • docker does not seem to be able to handle the very large product as part of tempfs (check); may need to figure our non-clobbering raw filesystem for mega-step
    • this may force us back to permission leaking issues we had before using tempfs
  • the stats generation has a lot hard-coded and will need to be fixed up for this
  • sparta-report.json generation fails (check in on exact reasons)
  • if we include species reorientation:
    • downloads pages will need to be updated
    • we'll need a shuffling script at the end for GPAD/GPI and GAF outputs
  • suppress and document iea filtering (see go-site branch)
  • memory increases required; need new machine (above) or less runners

@kltm
Copy link
Member Author

kltm commented Aug 30, 2023

@pgaudet Talking @cmungall , maybe to better make progress with smaller incremental steps, we could unbundle the initial task of make the "142" available as GAF 2.2s.

@cmungall
Copy link
Member

cmungall commented Aug 30, 2023

Proposed new scope for this project:

  • the end-goal is to have the GAF download page (http://current.geneontology.org/products/pages/downloads.html) have the 142 reference species, broken down by species, available as GAFs
  • the non-core GAFs will not go through the same QC process pipeline as the core (MOD + goa human/cow/etc)
  • the implementation will be simply to take the filtered file provided by Alex (gcrp entries for 142 species; ftp://ftp.ebi.ac.uk/pub/contrib/goa/go_reference_species.gaf.gz) and bin these into separate files, one per species
  • a single static HTML file will be produced
  • The UI implementation will abandon the pageanated table and be a simple table with 142 rows in one page
  • The table should be sortable?
  • By default the sort order should prioritize the main curated species
  • the columns will be
    • species
    • count
    • link to download
    • potentially: primary source (ebigoa, mgi, ...)
  • there will not be separate rna/isoform files; these will be merged in. One file per species

@kltm
Copy link
Member Author

kltm commented Aug 30, 2023

@pgaudet, More talk w/Chris, assuming we're all on the same page about this, I'd probably try and push through on this myself (spinning it out as a separate project first).

@pgaudet
Copy link

pgaudet commented Aug 30, 2023

Sounds good to me.

One suggestion (if doable) would be to have 2 table, one for the MODs, and one for all others, since 142 (in fact it's now 143) species is a long list to scroll through.

Thanks, Pascale

@kltm
Copy link
Member Author

kltm commented Aug 30, 2023

@pgaudet @cmungall Re-scoping this as the "strong" version; the narrow version discussed above is now #82

@kltm kltm changed the title Switch to reference genome set (was change available download data (GAFs/GPADs) to reflect species rather than resource) Switch to reference genome set for AmiGO, all APIs, and data products Aug 30, 2023
@kltm
Copy link
Member Author

kltm commented Aug 30, 2023

Moving this back to "priority", as #82 will meet some initial desirable targets and unsure if the rest is on the table immediately.

@kltm kltm moved this from Creation (initial requirements document) to Priority (project triage) in Project Metadata Overview Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs LA approval Needs final approval from the Lead Architect Needs PI Needs PM approval Needs final approval from the Project Manager Needs PO Needs tech doc Needs TL
Projects
Status: Priority (project triage)
Development

No branches or pull requests

3 participants