Skip to content

omics-datascience/differential_expression_analysis

Repository files navigation

differential_expression_analysis

This project is based on differential expression analysis with the following characteristics:

  • It uses the lung adenocarcinoma (LUAD) dataset from The Cancer Genome Atlas Program (TCGA), obtained from cBioportal.
  • It uses the SEX variable to compare genes and obtain the differentially expressed ones.
  • Use the Limma library of R to perform the DEA.

Requerimientos

  • R version 4.4.2

Descargar datasets

Ejecutar el siguiente comando para obtener todos los datasets de cbioportal que tienen datos de RNA-Seq:

Rscript download_rnaseq_datasets.r

Run DEA

Use the following command to perform differential expression analysis

Rscript aed_limma_cbioportal_TMP.r

The script downloads all necessary libraries before running the analysis.

Operating details

Of the dataset downloaded from cBioportal, only the files "datasets/luad_tcga/data_clinical_patient.txt", "datasets/luad_tcga/data_clinical_sample.txt", and "datasets/luad_tcga/data_mrna_seq_v2_rsem.txt" are used for the analysis. These correspond to the patient clinical data, the sample data, and the gene expression data, respectively.

Expression data: The dataset has RNASeq data expressed in Transcripts per million (TPM) values ​​using RSEM.

Processing of clinical and sample data

  • Filter only the PATIENT_ID and SEX columns in clinical data.
  • Convert SEX values ​​to uppercase in clinical data.
  • Remove duplicate records in clinical data.
  • Filter only the SAMPLE_ID and PATIENT_ID columns in sample data.
  • Perform a merge between sample_data and clinical_data using PATIENT_ID as the key.
  • Keep only the SAMPLE_ID and SEX columns.
  • Replace hyphens (-) with periods (.) in SAMPLE_ID.
  • Ensure there are no duplicates by removing them from the dataset (there should be no duplicates at this point).
  • Count the number of samples per sex, and calculate the percentage of each sex in the sample.
  • Store the entire processed dataset in a dataframe called 'metadata'

RNASeq dataset processing

  • Genes with duplicate names (Hugo_Symbol) are identified.
  • Duplicate genes are removed, keeping only the first occurrence of each Hugo_Symbol.
  • The 'Entrez_Gene_Id' column is removed.
  • Rows where Hugo_Symbol is NA are removed.
  • Hugo_Symbol is assigned as row names.
  • Samples in metadata that are not present in rnaseq_tpm_data are removed.
  • The result is returned in a data frame named 'rnaseq_tpm_data'.

DEA with limma

  • Create boxplots to compare expression distributions between samples.
  • Convert TPM values ​​to a log2 scale (TPM + 1) to stabilize variance.
  • Generate histograms of the expression distribution before and after the log2 transformation.
  • Save the plots to a plots.pdf file.
  • Create a design matrix with the variable SEX as a factor.
  • Calculate the variance of each gene in the log-transformed data.
  • Eliminate genes with a variance less than the threshold (1e-4 by default).
  • Calculate the mean expression of each gene, set a percentile-based threshold (15% by default), and then eliminate genes with mean expression below this threshold.
  • Use the limma package to identify genes differentially expressed by sex (SEX), using the dataset modified with the two points above.
  • Fit a linear model (lmFit) and apply the eBayes adjustment.
  • Extracts the results with corrected p-values ​​(Benjamini-Hochberg correction).
  • Returns the table of differentially expressed genes.
  • Sorts the results of the differential analysis by the adjusted p-value (adj.P.Val), in ascending order (lowest to highest).
  • Sorts the results by the fold change (logFC), in descending order (highest to lowest).
  • Sorts the results in two steps: first by the adjusted p-value in ascending order and then by logFC in descending order.
  • Extracts the 50 genes with the lowest adjusted p-values ​​(the most significant).
  • Generates a "Volcano Plot" to visualize the relationship between logFC and -log10 (adjusted p-value). Distinguishes significant genes (in red) from non-significant ones (in black).
  • Saves the results in two CSV files (one with all results and one with the top 50)

About

tests of differential expression analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages