Step away from modifying column names #158

alexg9010 · 2023-02-24T10:46:24Z

When searching for variants which share the same signature mutations using the dedupe_sigmut_mat function, the goal is to cluster variants.
We basically transform a matrix of dimension [nAllVar x nMut] into a matrix [nSharedMutVar x nMut].
The column names of the resulting matrix are basically the pasted names of variants with the same signature.

While this might work for a few variants with highly distinct mutations, this approach might cause trouble if more variants are introduced with less variability in their mutations or too little data. This would lead to a larger number of similar variants and potentially very long column names, as recently seen here.

To solve this potential issue I suggest the following solution:

for the dedupe_sigmut_mat function, instead of returning a modified matrix, maybe it would be easier to directly return the group_list aka lineage defined at

pigx_sars-cov-2/scripts/deconvolution_funs.R

Line 198 in 485f7df

dupe_group_list <- list()

and iterate over this, subsetting the original matrix in each step of the iteration. This would potentially simplify the code.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step away from modifying column names #158

Step away from modifying column names #158

alexg9010 commented Feb 24, 2023

Step away from modifying column names #158

Step away from modifying column names #158

Comments

alexg9010 commented Feb 24, 2023