#Classifing Mendelian Medical Literature With Azure Machine Learning Studio#

##Prerequisites##

[Download Mendelian Pubmed Dataset] (https://github.com/ProjectBethesda/ProjectBethesda-ResearchClassificationModel/blob/master/datasets/Project%20Bethesda%20DataSet%20Normalized.csv)
[Download Stop Words CSV] (https://github.com/ProjectBethesda/ProjectBethesda-ResearchClassificationModel/blob/master/datasets/Stopwords.csv)
Create a Free Microsoft Azure Subscription

##Contents##

Setup Workspace
Getting Data into an Experiment
Feature Selection
Train Model
Benchmark
Put into Production

##1. Sign into ML Studio##

Sign into [Azure ML Studio] (https://studio.azureml.net/) using a microsoft account.
If you do not have a Microsoft account you can register for one [here] (https://signup.live.com/signup.aspx???lcid=1033&wa=wsignin1.0&rpsnv=12&ct=1458589749&rver=6.7.6643.0&wp=MBI_SSL&wreply=https:%2F%2Faccount.microsoft.com%2Fauth%2Fcomplete-signin%3Fru%3Dhttps%253a%252f%252faccount.microsoft.com%252f%253frefd%253daccount.microsoft.com%2526refp%253dhome-about-index&lc=1033&id=292666&lw=1&fl=easi2&mkt=en-US).

##2. Getting Data into an Experiment##

Step 1: Navigate to the Datasets tab in ML Studio and Click New

Step 2: Select from local file

Step 3: Upload Datasets

Click 'Choose File' and select the BethesdaDataset.csv that was generated by the Movie Data Generator Project
Name the Dataset BethesdaDataset
Select Generic CSV File with a header
Click the Checkmark
Upload stopwords.csv using the same process

Step 4: Navigate to the Experiments tab in ML Studio and Click New

Step 5: Create a New Blank Experiment

Step 6: Name the Experiment

Rename the experiment to Bethesda

Step 7: Drag the dataset into the experiment

Expand Saved Datasets -> My Datasets and Drag our data on to the page

Step 8: Visualize your dataset

Right Click the Bottom of Dataset and Select Visualize
Visualizing a dataset allows you to see useful analytics and gauge relationships between features.

##3. Feature Selection ##

Step 1: Grab Data Using SQL transformation

We need to use a SQL Transformation to grab the relevant data from our dataset to build our model.
Drag the SQL Transformation module into the experiment
Enter the following query into the module

select PMID, Title || ' ' || Abstract as TextInput, [Research Phase Id] as TwoClassLabel from t1;

 * Connect the module as follows 
<img src="https://github.com/ProjectBethesda/ProjectBethesda-ResearchClassificationModel/blob/master/media/sql%20transformation.jpg"/>

**Step 2: Normalization**
 * In this phase we will normalize our data to remove noise from our model the script below will:
   - Convert all text to lowercase 
   - Remove Stopwords from NLTK Stopwords Corpus
   - Remove Standalone numbers (1, 2, 3 but not brca1)
   - Remove Number Words (one, two, three)
   - Remove Punctuation (periods, commas, etc)
 * Drag Stopwords dataset into the experiment
 * Drag the python script module into the experiment
 * Put the following python code normalization snipped into the experiment
```python
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
    import re
   
# Create normalized abstract entry 
    dataframe1["Normalized"] = dataframe1["TextInput"]
   
    numWords=[ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen","twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety","hundred", "thousand", "million", "billion", "trillion"] 
  
    for i in range( len(dataframe1["TextInput"])):
       Abstract = dataframe1["TextInput"][i]
       if(Abstract!= None):
        #to lower
        abstract = Abstract.lower()
        #remove punctuation
        abstract = re.findall(r'\w+', abstract,flags = re.UNICODE | re.LOCALE) 
        #remove stand alone numbers
        
        abstract = " ".join([x for x in abstract if not x.isdigit()])
        #remove stand alone numbers and number words    
        abstract =" ".join(filter(lambda w: not w in numWords ,abstract.split(" ")))    

        #remove stop words
        NormAbs = " ".join(filter(lambda w: not w in set(dataframe2["StopWords"]),abstract.split(" ")))
        
       else:
         NormAbs = ''
         abstract = ''
       dataframe1["Normalized"][i] = str(NormAbs)
      
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

Link the modules as follows and run the experiment

Step 3: Make TwoClass a feature

Expand the Data Transformation and Manipulation tabs and drag a 'Metadata Editor' Module into the experiment.
Link the 'Metadata Editor' Module to the python script Module
Click the select modules button

Select the TwoClassLabel feature and add it to the list

Click the check mark

Change the fields property to "Label" this means that the feature is a category.

Step 4: Make PMID a clear feature

Expand the Data Transformation and Manipulation tabs and drag a second 'Metadata Editor' Module into the experiment. This will allow you to classify by article abstract and title while using the PMID as an identifier
Link the new 'Metadata Editor' Module to the previous Metadata Editor Module
Click the select modules button
Select the pmid feature and add it to the list

Change the fields property to "Clear Feature"
Clear features are passed through the ML pipeline but are not processed by any of the other modules in the experiment
Since there is little to no correlation between pmid number and research type this allows us to identify our nodes without while reducing the noise in our model

Step 5: Hash Features and run experiment

Up to this point we have been dealing with strings as features. Strings are more resource intensive than to numbers to process. The best way to address this is by bagging the words our normalized strings and then hashing them into numerical features. While the new features have a 1-1 correspondence hashing is a one way function the tradeoff for the performance we gain from numerical features is that we will not know which "word bags" or statistical couplings of words are which. However, we do know that the features will accurately represent out data.

Expand the Text Analytics tab and drag the 'Feature Hashing' Module into the experiment and connect the 'Feature Hashing' Module to the previous 'Metadata editor'

Select the normalized column

Change hashing bitsize to 15 the n-gram value to 4

Step 6: Project Features

Next we need to project out the hashed features we generated to train our model

Expand the 'Data Transformation' and 'Manipulation' tabs and drag the 'Project Columns' Module into the experiment and connect it to the previous 'Feature Hashing' Module

Exclude the 'TextInput' and 'Normalized' columns

The experiment should look as follows

##4. Train Model##

Now that we have converted our dataset into a set of representative features and labels it is time to train our model.

Step 1: Train/Test Split

Expand the Data Transformation and 'Sample and Split' tabs then drag a 'Split' Module into the experiment
Connect the Split Module to the Metadata Editor Module
Select the Split Module, and Enter "0.7" in the 'Fraction of rows in the first dataset' Field
The left side of the split is our training set the right side will be our test/validation split

Step 2: Drag in One Vs All Classifier

Step 3: Configure Two Class Decision Tree

Drag in a Two Class Decision Tree into the experiment and connect it to the One vs All module
Set the properties of the Two Class Decision Tree to the following empirically chosen values:
- Max of 32 Leaves Per a Tree
- Min of 50 instances per a Leaf
- Learning Rate of 0.2
- 300 Trees

Step 4: Train Model

Drag and a the 'Train' Module into the experiment and connect it to the 'One Vs All' and 'Split' modules as shown below.

Use the column selector to select the TwoClassLabel Column this will tell the model what column we are trying to predict.

Step 5: Score and Evaluate Model

Drag ‘Score model’module into experiment
Link trained model and test set to the ‘Score model’ module
Drag the evaluate module to the experiment and link the score module

Step 6: Run the experiment and visualize the evaluate module

##5. Put into Production## Step 1: Set up Predictive Webservice

Step 2: Project service inputs

We need to clean our service inputs so that we can classify by just PMID, Abstract and Title

Drag a 'Project Columns' module into the experiment
Connect the Project Bethesda Data set to the 'Project Columns' module
Click the 'Launch column selector' button and select the PMID, Title, Abstract, Research Phase ID columns
Connect the 'Project Columns' and the 'Web service input' modules to the same port of the Apply SQL Transformation

Step 3: Project service outputs and run predictive experiment

We need to clean our service results so that they don't return thousands of noisy features.

Drag a 'Project Columns' module into the experiment
Connect the 'Score model' module to the 'Project Columns' module
Click the 'Launch column selector' button and select the PMID, Scored Labels, Scored Probabilities for Class "0", Scored Probabilities for Class "1", Scored Probabilities for Class "2" columns
Connect the 'Web Service Output' to the previous 'Project Columns' module

Step 4: Deploy Web service

Step 5: Test the Web service

Step 6: Get code for Web service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods.md

Methods.md

Files

Methods.md

Latest commit

History

Methods.md

File metadata and controls