Stable release pre-preprint
This release contains ~6 months of new data from GenBank, which is now called from an FTP server as part of the pipeline. There's also been
- small improvements of the taxonomy pipeline, including handling of "cf." names
- some small cleanup of the PREDICT dataset
- metadata tidying including a full removal of the SRA-specific backbones
- several bug fixes (e.g., issues with sparse columns and vroom column identification lead to data overwrites)
- a massive reduction in the dataset dimensions by collapsing NCBIAccession (much needed given SARS-CoV-2 influx)