Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting cell-level p-values for multiple comparisons? #94

Open
schroeme opened this issue Aug 5, 2024 · 3 comments
Open

Correcting cell-level p-values for multiple comparisons? #94

schroeme opened this issue Aug 5, 2024 · 3 comments

Comments

@schroeme
Copy link

schroeme commented Aug 5, 2024

Hi, thanks for a great package! I am working with a brain snRNAseq dataset and have run scDRS to test for the enrichment of MDD, ADHD, ALZ, MS, SCZ, and height GWAS hits (using the MAGMA scores from your original publication). For the cell-level MC p-values, is it appropriate to use a cutoff of 0.05 to say something like, X number of cells were significantly associated with X disease? Or should I be doing a B-H p-value correction based on the number of cells (i.e. total number of p-values computed)?

I also ran the group-level downstream analysis and found that very few cell types were significantly associated (FDR < 0.1; as plotted here: https://martinjzhang.github.io/scDRS/notebooks/quickstart.html) with these traits, despite prior studies (including your original paper), showing that many more should be. Any thoughts on this? Is this because of what you noted in the discussion section of the paper?: "Second, the fact that scDRS assesses the statistical significance of an individual cell’s association to disease by implicitly comparing it to other cells via matched control genes may reduce power if most cells in the data are truly causal."

Many thanks,
Margaret

@HelloWorldLTY
Copy link
Contributor

It is a really good question. I think to do correction or not to do, really depending on your cost for false positive or false negative. Performing bh correction is to reduce false positive rate, with the scrafication for missing true signals, but I think in this case, the cost of missing a true important cell type for a disease is larger than accepting a risky cell type for a disease, and thus I think it is ok to use the current p-value setting. https://stats.libretexts.org/Bookshelves/Applied_Statistics/Biological_Statistics_(McDonald)/06%3A_Multiple_Tests/6.01%3A_Multiple_Comparisons

For the second point, I am considering to improve it with more atlas-level datasets 🤔️.

@martinjzhang
Copy link
Owner

martinjzhang commented Aug 14, 2024

For the cell-level MC p-values, is it appropriate to use a cutoff of 0.05 to say something like, X number of cells were significantly associated with X disease? Or should I be doing a B-H p-value correction based on the number of cells (i.e. total number of p-values computed)?

I recommend always using FDR control. Detecting cells based on p<0.05 will give you a lot of false positives and is against the statistical principles of hypothesis testing. If it is very underpowered, consider increasing the FDR threshold, e.g., to 0.2.

@martinjzhang
Copy link
Owner

martinjzhang commented Aug 14, 2024

I also ran the group-level downstream analysis and found that very few cell types were significantly associated (FDR < 0.1; as plotted here: https://martinjzhang.github.io/scDRS/notebooks/quickstart.html) with these traits, despite prior studies (including your original paper), showing that many more should be. Any thoughts on this? Is this because of what you noted in the discussion section of the paper?: "Second, the fact that scDRS assesses the statistical significance of an individual cell’s association to disease by implicitly comparing it to other cells via matched control genes may reduce power if most cells in the data are truly causal."

Yes, this may indeed be the reason, that scDRS is underpowered. Again, consider increase the threshold.

Also, consider imputing the data using MAGIC first before applying scDRS, a procedure discussion here #32 This procedure seems to be a good workaround for the low power issue, as documented in a recent paper https://www.biorxiv.org/content/10.1101/2024.02.05.579042v1.abstract

Moreover, we are developing a much more powerful version of scDRS, which I hope to share in a few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants