Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is long format preferable for the environmental data matrix? Should the example data set dnam_ex be transposed? #17

Open
ellisifnygaard opened this issue Apr 17, 2023 · 1 comment
Labels
question Further information is requested

Comments

@ellisifnygaard
Copy link
Collaborator

When it comes to the matrix containing the environmental data, are long data matrices more favourable than wide matrices?

I'm guessing that in EWAS data sets with DNAm data, the number of individuals/samples will usually be smaller than the number of probes/CpGs. If this is correct, it would make sense to have the rows of the environmental data matrix represent the probe IDs as a rule if in fact long data is less demanding to process than wide data. (I don't know if this "long vs. wide" rule applies in HaplinMethyl or to ffdata; let me know if I'm wrong😊).

In dnam_ex, the row names = the individual/sample IDs and the column names = the probe/CpG IDs (according to the vignettes). Should we transpose dnam_ex so that the rows represent the probe IDs instead?

The example data set is not large enough for this to make a difference in practice, but perhaps we should change it for illustrative purposes?

It might also be a good idea to explicitly mention in the package documentation and the vignettes that you can have

  • rows = cpgs & columns = sample ID, or
  • columns = cpgs & rows = sample ID,

and offer some pointers regarding which of the formats users should use.

@ellisifnygaard ellisifnygaard added the question Further information is requested label Apr 17, 2023
@jromanowska
Copy link
Owner

Good question! I'm not sure what is better. The ff package does not allow for very large number of columns, that is why in the implementation, there is a list of ff-matrices. That's one point for your idea of pivoting. However, in the original Haplin (and in PLINK genetic datasets), the data is with samples as rows. I am not sure what's more common for environmental or epigentic data - have you checked? I have seen both representations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants