Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agree on outline #2

Closed
hammer opened this issue Dec 17, 2020 · 12 comments
Closed

Agree on outline #2

hammer opened this issue Dec 17, 2020 · 12 comments

Comments

@hammer
Copy link

hammer commented Dec 17, 2020

@eric-czech has an initial proposal at https://github.com/pystatgen/sgkit-publication/blob/main/content/01.outline.md.

@jeromekelleher
Copy link
Collaborator

Yep, overall narrative sounds good to me. We should break things down into some sections, I guess, and then sketch out the outlines of each of these sections as separate issues?

This is also determined by the journal and article format choice (#1), as some journals have pretty weird sectioning requirements.

@alimanfoo
Copy link
Collaborator

Here's another very rough possible starting point for an outline, similar to Eric's: https://docs.google.com/document/d/1TKSS--28ErsGjdyH6AHXnAcB7ngwCMXm5JK-u0dA_3c/edit?usp=sharing

@hammer
Copy link
Author

hammer commented Nov 15, 2022

eLife recommends the following outline:

  • Introduction
  • Results
  • Discussion
  • Methods
  • Acknowledgements
  • References
  • Figures
  • Tables

@hammer
Copy link
Author

hammer commented Nov 21, 2022

@jeromekelleher
Copy link
Collaborator

Here's an outline that @tomwhite, @benjeffery and I came up with last week:

  1. Introduction
    • What is Pydata?
    • Columnar binary data (zarr)
    • Distributed
    • numba
    • Other related technologies that we're using
  2. Results
  3. Discussion
    • Light recap, picking up an meta-points that we haven't made through the rest of the paper.

Then within results we have

  • PyData for genomics
    • sgkit design principles
    • Overview of data structures, etc
    • Discussion of basic performance characteristics, illustrating that the general strategy scales well in terms of single-threaded compute performance and space utilisation.
  • Population Genetics (Popgen use-case  #8)
  • Statistical Genetics
  • Quantitative Genetics (Quantgen use-case #9)
  • Software Development (Better name needed?)
    • Reimplenting REGENIE and gene-e (quick comparison of LoC and rough performance numbers)
    • Extending sgkit's zarr on-disk structures in tsinfer.
    • Important to stress the point that stuff doesn't need to be in sgkit to benefit from sgkit. You can implement your own methods outside sgkit using the tools and are in no way obliged to contribute stuff into the repo.
  • Scaling to large datasets
    • GPU pairwise distance example (but, could make this an example for a Phylogenetics section also)
    • Scaling out with Dask (can refer to Liangde’s thesis/paper?)

What do we think? The first section (pydata for genomics) gets directly to the point of discussing sgkit's design principles and data structures, letting the intro set the scene of the software infrastructure around us.

In terms of display items, we would refer to the Scaling and compute (#7) in the pydata for genomics section, plus the . We probably don't need display items for the rest of the paper.

The PopGen, StatGen, QuantGen (and PhyloGen?) sections are a way to allow readers interested in just those areas to skip in and see what sort of things sgkit can do, without having the trudge through API listings. We want to give one (or two) concrete examples showing useful things being done, giving indicative performance figures without getting bogged down in direct performance comparisons. It also gives us a space to quickly discuss the tools that people use and illustrate how fragmented the ecosystem is.

@jeromekelleher
Copy link
Collaborator

If we roughly agree on this outline I can make some more issues to track the different sections, and sketch out what we want to say in them.

@hammer
Copy link
Author

hammer commented Dec 7, 2022

Looks great to me thanks for moving this forward!

@hammer
Copy link
Author

hammer commented Dec 7, 2022

I should have asked: how do we define stat, pop, and quant gen? I generally think of pop gen as variation without phenotype and stat gen as variation with phenotype. I’m not sure where that leaves quant, perhaps as the union of the two? If so, do we need to rename to qgkit?

@jeromekelleher
Copy link
Collaborator

There probably isn't a good definition, but we can just do something pragmatic based on the user communities. PopGen people are mostly interested in evolutionary biology itself, Statgen mostly in applications to humans and Quantgen mostly to applications in agriculture.

The tools they use are mostly nonoverlapping sets I think.

@eric-czech
Copy link
Collaborator

Introduction

Should this include a mention of trends in python adoption? And/or why this is an important tailwind to ride given AI progress?

We want to give one (or two) concrete examples showing useful things being done

FWIW on the StatGen piece, I think #9 is a good template for that. I also think that would probably be a good place to touch on the potential power and relatively nascent state of pathway GWAS (gene-e), GWAS/ExWAS methods in general (e.g. REGENIE), some of the QC ops necessary to get there (HWE, pruning, filtering) and general purpose operations like those for creating LD matrices and kinship coefficients (pc-relate).

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it. I'm not sure how this interacts with the Software Development section though -- perhaps you have some thoughts there?

@jeromekelleher
Copy link
Collaborator

@jeromekelleher I could outline some of those in more detail in a StatGen specific issue at some point if you or someone else (@hammer perhaps?) hasn't already done anything related to it.

@eric-czech please do go ahead and create an issue to sketch out your thoughts on StatGen. Don't worry too much about how things fit into the overall structure, just get the key points that you think should get in there down in some form, and I'll bring it together into the document.

@jeromekelleher
Copy link
Collaborator

I'm going to close this as out-of-date now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants