Is long format preferable for the environmental data matrix? Should the example data set `dnam_ex` be transposed? #17

ellisifnygaard · 2023-04-17T17:49:40Z

When it comes to the matrix containing the environmental data, are long data matrices more favourable than wide matrices?

I'm guessing that in EWAS data sets with DNAm data, the number of individuals/samples will usually be smaller than the number of probes/CpGs. If this is correct, it would make sense to have the rows of the environmental data matrix represent the probe IDs as a rule if in fact long data is less demanding to process than wide data. (I don't know if this "long vs. wide" rule applies in HaplinMethyl or to ffdata; let me know if I'm wrong😊).

In dnam_ex, the row names = the individual/sample IDs and the column names = the probe/CpG IDs (according to the vignettes). Should we transpose dnam_ex so that the rows represent the probe IDs instead?

The example data set is not large enough for this to make a difference in practice, but perhaps we should change it for illustrative purposes?

It might also be a good idea to explicitly mention in the package documentation and the vignettes that you can have

rows = cpgs & columns = sample ID, or
columns = cpgs & rows = sample ID,

and offer some pointers regarding which of the formats users should use.

The text was updated successfully, but these errors were encountered:

jromanowska · 2023-04-18T11:10:51Z

Good question! I'm not sure what is better. The ff package does not allow for very large number of columns, that is why in the implementation, there is a list of ff-matrices. That's one point for your idea of pivoting. However, in the original Haplin (and in PLINK genetic datasets), the data is with samples as rows. I am not sure what's more common for environmental or epigentic data - have you checked? I have seen both representations.

ellisifnygaard added the question Further information is requested label Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is long format preferable for the environmental data matrix? Should the example data set `dnam_ex` be transposed? #17

Is long format preferable for the environmental data matrix? Should the example data set `dnam_ex` be transposed? #17

ellisifnygaard commented Apr 17, 2023

jromanowska commented Apr 18, 2023

Is long format preferable for the environmental data matrix? Should the example data set dnam_ex be transposed? #17

Is long format preferable for the environmental data matrix? Should the example data set dnam_ex be transposed? #17

Comments

ellisifnygaard commented Apr 17, 2023

jromanowska commented Apr 18, 2023

Is long format preferable for the environmental data matrix? Should the example data set `dnam_ex` be transposed? #17

Is long format preferable for the environmental data matrix? Should the example data set `dnam_ex` be transposed? #17