Skip to content

Stable release pre-preprint

Compare
Choose a tag to compare
@cjcarlson cjcarlson released this 02 Aug 00:33
· 221 commits to main since this release

This release contains ~6 months of new data from GenBank, which is now called from an FTP server as part of the pipeline. There's also been

  • small improvements of the taxonomy pipeline, including handling of "cf." names
  • some small cleanup of the PREDICT dataset
  • metadata tidying including a full removal of the SRA-specific backbones
  • several bug fixes (e.g., issues with sparse columns and vroom column identification lead to data overwrites)
  • a massive reduction in the dataset dimensions by collapsing NCBIAccession (much needed given SARS-CoV-2 influx)