Personalized cancer diagnosis

1. Business Problem

1.1. Description

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment

Data: Memorial Sloan Kettering Cancer Center (MSKCC)

Download training_variants.zip and training_text.zip from Kaggle.

Context:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

Problem statement :

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

1.2. Source/Useful Links

Some articles and reference blogs about the problem statement

1.3. Real-world/Business objectives and constraints.

No low-latency requirement.
Interpretability is important.
Errors can be very costly.
Probability of a data-point belonging to each class is needed.

2. Machine Learning Problem Formulation

2.1. Data

2.1.1. Data Overview

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
We have two data files: one conatins the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.
Both these data files are have a common column called ID
Data file's information:
- training_variants (ID , Gene, Variations, Class)
- training_text (ID, Text)

2.2. Mapping the real-world problem to an ML problem

2.2.1. Type of Machine Learning Problem

There are nine different classes a genetic mutation can be classified into => Multi class classification problem.

2.2.2. Performance Metric

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation

Metric(s):

Multi class log-loss
Confusion matrix

2.2.3. Machine Learing Objectives and Constraints

Objective: Predict the probability of each data-point belonging to each of the nine classes.

Constraints:

* Interpretability * Class probabilities are needed. * Penalize the errors in class probabilites => Metric is Log-loss. * No Latency constraints.

3. Getting Started

Start by downloading the project and run "CancerDiagnostic.ipynb" file in ipython-notebook.

4. Prerequisites

You need to have installed following softwares and libraries before running this project.

Python 3: https://www.python.org/downloads/
Anaconda: https://www.anaconda.com/download/

5. Libraries

scikit-learn: scikit-learn is a Python module for machine learning built on top of SciPy.
- pip install scikit-learn
- conda install -c anaconda scikit-learn
scipy: SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering.
- pip install scipy
- conda install -c anaconda scipy

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
CancerDiagnostic.ipynb		CancerDiagnostic.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personalized cancer diagnosis

1. Business Problem

1.1. Description

Context:

Problem statement :

1.2. Source/Useful Links

1.3. Real-world/Business objectives and constraints.

2. Machine Learning Problem Formulation

2.1. Data

2.1.1. Data Overview

2.2. Mapping the real-world problem to an ML problem

2.2.1. Type of Machine Learning Problem

2.2.2. Performance Metric

2.2.3. Machine Learing Objectives and Constraints

3. Getting Started

4. Prerequisites

5. Libraries

About

Releases

Packages

Languages

manu-vishwakarma/cancer-detection

Folders and files

Latest commit

History

Repository files navigation

Personalized cancer diagnosis

1. Business Problem

1.1. Description

Context:

Problem statement :

1.2. Source/Useful Links

1.3. Real-world/Business objectives and constraints.

2. Machine Learning Problem Formulation

2.1. Data

2.1.1. Data Overview

2.2. Mapping the real-world problem to an ML problem

2.2.1. Type of Machine Learning Problem

2.2.2. Performance Metric

2.2.3. Machine Learing Objectives and Constraints

3. Getting Started

4. Prerequisites

5. Libraries

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages