Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying SL, TMM, and IRS on log2 data or raw data #4

Open
lisiarend opened this issue Feb 28, 2023 · 2 comments
Open

Applying SL, TMM, and IRS on log2 data or raw data #4

lisiarend opened this issue Feb 28, 2023 · 2 comments

Comments

@lisiarend
Copy link

Hey,
I have a question regarding the three normalization methods that you applied. Actually it makes a difference whether to apply for example SL on raw data and log2 the data for visualization, or to apply SL on already log2 transformed data. The same results for TMM and IRS.

What is best practice for this? And why would you (as in your markdown) use these methods on raw data?

Best,
Lis

@pwilmart
Copy link
Owner

Hi Lis,
I don't work with log2 data because mathematical operations on logs are not the same as operations on non-logged values. For example, averages of non-logged values are simple averages. Averages of log2 values are geometric means. Internally, routines like TMM might be working with logged values, so you do not want to pass in data that is already logged. You are correct that the normalization methods would not give the same results with logged data. All those normalization methods assume the data is in its native (linear) scale.

I may use log scales in plots but I try to keep data in its natural scale as much as possible. I also try to avoid ratios (which usually also need a log2 transform). Both logs and ratios change the mathematical space of the numbers and our brains do not mentally visualize those spaces correctly. Our intuition with numbers really only applies to linear numerical spaces.

I think some of the reasons you see a lot of log transformations in R scripts is related to parametric statistical modeling. It is often assumed (i.e. not tested) that data is not normally distributed and that the logged data might be. The argument tath the data is not Normal is usually based on the distribution of all data values in a genome or proteome (the full dataset). The statistical modeling is applied per gene/protein and it is the distribution of the values for single proteins/genes that should be normally distributed. The distribution of all proteins or genes is irrelevant.
Cheers,
Phil

@lisiarend
Copy link
Author

Okay, thank you very much.

I am currently working with the multiple normalization methods of proteomics data and I am evaluating those on multiple datasets. Therefore it is important to know which method requires log2-transformed data and which does not. And this isn't always that easy, but thanks very much for your quick response:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants