Skip to content

Feature Extraction Using Deep Learning Models in Cancer Detection

Notifications You must be signed in to change notification settings

KaushalNaresh/Genomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme

Feature Extraction from HDLSS (High Dimension Low Sample Size) data using Deep Neural Networks

Name : Naresh Kumar Kaushal
Id : 170030027, Department : CSE, Term : Fourth Year
BTP advisor : Dr. Clint Pazhayidam George


Table of Contents

  1. Abstract
  2. Dataset
  3. Algorithm
  4. References

Abstract

Deep neural networks (DNN) have achieved break throughs in applications with large sample size. However, when facing high dimension, low sample size (HDLSS) data, such as the phenotype prediction problem using genetic data in bioinformatics, DNN suffers from overfitting and high-variance gradients. In this paper, we implemented a DNN model tailored for the HDLSS data, named Deep Neural Pursuit (DNP). DNP selects a subset of high dimensional features to alleviate overfitting and takes the average over multiple dropouts to calculate gradients with low variance. As the first DNN method applied on the HDLSS data, DNP enjoys the advantages of the high nonlinearity, the robustness to high dimensionality, the capability of learning from a small number of samples, the stability in feature selection, and the end-to-end training. I will deploy this technique to classify breast, lung, leukemia and Prostate cancer.

Dataset

We chose 3 biological datasets (Prostate-GE, ALLAML and Lung) provided on this link and one breast cancer dataset which was provided by professor. You can find all these datasets in data folder. All these datasets suffer from HDLSS. Few details of these data sets are - Breast(4 subtypes) 574 samples and 1519 features, Lung Cancer(5 subtypes) with 203 samples and 3312 features, Prostate Cancer(2 subtypes) with 102 samples and 5966 features, Leukemia Cancer(2 subtypes) with 72 samples and 7129 features. These datasets are already preprocessed and for better prediction we normalised (mean = 0, std = 1) gene expression data. Every dataset has 2 tables one for gene expression values and other for subtype information.

Algorithm

Input: X ∈ Rn×d , y ∈ Rn, the maximum number of selected features k.
Initialize: S = {bias}, C = F and WC = 0.
while |S| ≤ k + 1 do

  1. Fix candidate weights WC = 0;
  2. Update weights of hidden layer and input WS;
  3. Dropout multiple times and average out GFc;
  4. j = arg maxc∈C ||GFc||q;
  5. Update learning rates using Adagrad;
  6. Initialize WFj with Xavier Initializer;
  7. S = S ∪ Fj and C = C \ Fj;

You can find the source code in code folder.

end while

References

  1. Deep Neural Networks for High Dimension, Low Sample Size Data Bo Liu, Ying Wei, Yu Zhang, Qiang Yang Hong Kong University of Science and Technology, Hong Kong. Research Paper
  2. Google Slides to understand DNP algorithm in a better way Google Slides
  3. Google Slides for final presentation Google Slides

Further Queries

For further queries please contact [email protected]
Happy coding !!

About

Feature Extraction Using Deep Learning Models in Cancer Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published