The pipeline for dimensionality reduction and resampling of TCGA data, for the research on "Evaluation of machine learning-based cancer classification techniques using multi-omics data."
The provided R Markdown files and Python scripts offer a comprehensive pipeline for transforming and preparing multi-omics data sets, including genomic (copy number), transcriptomic (gene expression), and epigenomic (DNA methylation) data.
- R Markdown files of applying PCA to cleaned TCGA data
- R Markdown files of applying Tomek Links and Near Miss to address class imbalance
- Alternative Python script for Tomek Links
Cleaned TCGA data is be available here Or Hugging Face Dataset