Readme

Feature Extraction from HDLSS (High Dimension Low Sample Size) data using Deep Neural Networks

Name : Naresh Kumar Kaushal
Id : 170030027, Department : CSE, Term : Fourth Year
BTP advisor : Dr. Clint Pazhayidam George

Table of Contents

Abstract
Dataset
Algorithm
References

Abstract

Deep neural networks (DNN) have achieved break throughs in applications with large sample size. However, when facing high dimension, low sample size (HDLSS) data, such as the phenotype prediction problem using genetic data in bioinformatics, DNN suffers from overfitting and high-variance gradients. In this paper, we implemented a DNN model tailored for the HDLSS data, named Deep Neural Pursuit (DNP). DNP selects a subset of high dimensional features to alleviate overfitting and takes the average over multiple dropouts to calculate gradients with low variance. As the first DNN method applied on the HDLSS data, DNP enjoys the advantages of the high nonlinearity, the robustness to high dimensionality, the capability of learning from a small number of samples, the stability in feature selection, and the end-to-end training. I will deploy this technique to classify breast, lung, leukemia and Prostate cancer.

Dataset

We chose 3 biological datasets (Prostate-GE, ALLAML and Lung) provided on this link and one breast cancer dataset which was provided by professor. You can find all these datasets in data folder. All these datasets suffer from HDLSS. Few details of these data sets are - Breast(4 subtypes) 574 samples and 1519 features, Lung Cancer(5 subtypes) with 203 samples and 3312 features, Prostate Cancer(2 subtypes) with 102 samples and 5966 features, Leukemia Cancer(2 subtypes) with 72 samples and 7129 features. These datasets are already preprocessed and for better prediction we normalised (mean = 0, std = 1) gene expression data. Every dataset has 2 tables one for gene expression values and other for subtype information.

Algorithm

Input: X ∈ R^n×d , y ∈ Rⁿ, the maximum number of selected features k.
Initialize: S = {bias}, C = F and W_C = 0.
while |S| ≤ k + 1 do

Fix candidate weights W_C = 0;
Update weights of hidden layer and input W_S;
Dropout multiple times and average out G_{F_c};
j = arg max_c∈C ||G_{F_c}||_q;
Update learning rates using Adagrad;
Initialize W_{F_j} with Xavier Initializer;
S = S ∪ F_j and C = C \ F_j;

You can find the source code in code folder.

end while

References

Deep Neural Networks for High Dimension, Low Sample Size Data Bo Liu, Ying Wei, Yu Zhang, Qiang Yang Hong Kong University of Science and Technology, Hong Kong. Research Paper
Google Slides to understand DNP algorithm in a better way Google Slides
Google Slides for final presentation Google Slides

Further Queries

For further queries please contact [email protected]
Happy coding !!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
code		code
data		data
report		report
README.md		README.md
docker.md		docker.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

Feature Extraction from HDLSS (High Dimension Low Sample Size) data using Deep Neural Networks

Table of Contents

Abstract

Dataset

Algorithm

References

Further Queries

About

Releases

Packages

Languages

KaushalNaresh/Genomics

Folders and files

Latest commit

History

Repository files navigation

Readme

Feature Extraction from HDLSS (High Dimension Low Sample Size) data using Deep Neural Networks

Table of Contents

Abstract

Dataset

Algorithm

References

Further Queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages