Agree on outline #2

hammer · 2020-12-17T15:02:02Z

@eric-czech has an initial proposal at https://github.com/pystatgen/sgkit-publication/blob/main/content/01.outline.md.

jeromekelleher · 2020-12-17T16:07:17Z

Yep, overall narrative sounds good to me. We should break things down into some sections, I guess, and then sketch out the outlines of each of these sections as separate issues?

This is also determined by the journal and article format choice (#1), as some journals have pretty weird sectioning requirements.

alimanfoo · 2022-11-07T16:56:50Z

Here's another very rough possible starting point for an outline, similar to Eric's: https://docs.google.com/document/d/1TKSS--28ErsGjdyH6AHXnAcB7ngwCMXm5JK-u0dA_3c/edit?usp=sharing

hammer · 2022-11-15T21:38:58Z

eLife recommends the following outline:

Introduction
Results
Discussion
Methods
Acknowledgements
References
Figures
Tables

hammer · 2022-11-21T15:02:35Z

Pulling some previous discussions that could be useful here:

jeromekelleher · 2022-12-05T14:54:08Z

Here's an outline that @tomwhite, @benjeffery and I came up with last week:

Introduction
- What is Pydata?
- Columnar binary data (zarr)
- Distributed
- numba
- Other related technologies that we're using
Results
Discussion
- Light recap, picking up an meta-points that we haven't made through the rest of the paper.

Then within results we have

PyData for genomics
- sgkit design principles
- Overview of data structures, etc
- Discussion of basic performance characteristics, illustrating that the general strategy scales well in terms of single-threaded compute performance and space utilisation.
Population Genetics (Popgen use-case #8)
Statistical Genetics
Quantitative Genetics (Quantgen use-case #9)
Software Development (Better name needed?)
- Reimplenting REGENIE and gene-e (quick comparison of LoC and rough performance numbers)
- Extending sgkit's zarr on-disk structures in tsinfer.
- Important to stress the point that stuff doesn't need to be in sgkit to benefit from sgkit. You can implement your own methods outside sgkit using the tools and are in no way obliged to contribute stuff into the repo.
Scaling to large datasets
- GPU pairwise distance example (but, could make this an example for a Phylogenetics section also)
- Scaling out with Dask (can refer to Liangde’s thesis/paper?)

What do we think? The first section (pydata for genomics) gets directly to the point of discussing sgkit's design principles and data structures, letting the intro set the scene of the software infrastructure around us.

In terms of display items, we would refer to the Scaling and compute (#7) in the pydata for genomics section, plus the . We probably don't need display items for the rest of the paper.

The PopGen, StatGen, QuantGen (and PhyloGen?) sections are a way to allow readers interested in just those areas to skip in and see what sort of things sgkit can do, without having the trudge through API listings. We want to give one (or two) concrete examples showing useful things being done, giving indicative performance figures without getting bogged down in direct performance comparisons. It also gives us a space to quickly discuss the tools that people use and illustrate how fragmented the ecosystem is.

jeromekelleher · 2022-12-05T14:54:55Z

If we roughly agree on this outline I can make some more issues to track the different sections, and sketch out what we want to say in them.

hammer · 2022-12-07T15:36:28Z

Looks great to me thanks for moving this forward!

hammer · 2022-12-07T15:38:16Z

I should have asked: how do we define stat, pop, and quant gen? I generally think of pop gen as variation without phenotype and stat gen as variation with phenotype. I’m not sure where that leaves quant, perhaps as the union of the two? If so, do we need to rename to qgkit?

jeromekelleher · 2022-12-07T17:39:50Z

There probably isn't a good definition, but we can just do something pragmatic based on the user communities. PopGen people are mostly interested in evolutionary biology itself, Statgen mostly in applications to humans and Quantgen mostly to applications in agriculture.

The tools they use are mostly nonoverlapping sets I think.

eric-czech · 2023-02-28T14:21:39Z

Introduction

Should this include a mention of trends in python adoption? And/or why this is an important tailwind to ride given AI progress?

We want to give one (or two) concrete examples showing useful things being done

FWIW on the StatGen piece, I think #9 is a good template for that. I also think that would probably be a good place to touch on the potential power and relatively nascent state of pathway GWAS (gene-e), GWAS/ExWAS methods in general (e.g. REGENIE), some of the QC ops necessary to get there (HWE, pruning, filtering) and general purpose operations like those for creating LD matrices and kinship coefficients (pc-relate).

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it. I'm not sure how this interacts with the Software Development section though -- perhaps you have some thoughts there?

jeromekelleher · 2023-03-02T09:21:11Z

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it.

@eric-czech please do go ahead and create an issue to sketch out your thoughts on StatGen. Don't worry too much about how things fit into the overall structure, just get the key points that you think should get in there down in some form, and I'll bring it together into the document.

jeromekelleher · 2023-12-22T10:19:10Z

I'm going to close this as out-of-date now.

jeromekelleher closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agree on outline #2

Agree on outline #2

hammer commented Dec 17, 2020

jeromekelleher commented Dec 17, 2020

alimanfoo commented Nov 7, 2022

hammer commented Nov 15, 2022

hammer commented Nov 21, 2022

jeromekelleher commented Dec 5, 2022

jeromekelleher commented Dec 5, 2022

hammer commented Dec 7, 2022

hammer commented Dec 7, 2022

jeromekelleher commented Dec 7, 2022

eric-czech commented Feb 28, 2023

jeromekelleher commented Mar 2, 2023

jeromekelleher commented Dec 22, 2023

Agree on outline #2

Agree on outline #2

Comments

hammer commented Dec 17, 2020

jeromekelleher commented Dec 17, 2020

alimanfoo commented Nov 7, 2022

hammer commented Nov 15, 2022

hammer commented Nov 21, 2022

jeromekelleher commented Dec 5, 2022

jeromekelleher commented Dec 5, 2022

hammer commented Dec 7, 2022

hammer commented Dec 7, 2022

jeromekelleher commented Dec 7, 2022

eric-czech commented Feb 28, 2023

jeromekelleher commented Mar 2, 2023

jeromekelleher commented Dec 22, 2023