Skip to content

Latest commit

 

History

History
290 lines (192 loc) · 14.7 KB

File metadata and controls

290 lines (192 loc) · 14.7 KB

#Classifing Mendelian Medical Literature With Azure Machine Learning Studio#

##Prerequisites##

##Contents##

  1. Setup Workspace
  2. Getting Data into an Experiment
  3. Feature Selection
  4. Train Model
  5. Benchmark
  6. Put into Production

##1. Sign into ML Studio##

##2. Getting Data into an Experiment##

Step 1: Navigate to the Datasets tab in ML Studio and Click New

Step 2: Select from local file

Step 3: Upload Datasets

  • Click 'Choose File' and select the BethesdaDataset.csv that was generated by the Movie Data Generator Project
  • Name the Dataset BethesdaDataset
  • Select Generic CSV File with a header
  • Click the Checkmark
  • Upload stopwords.csv using the same process

Step 4: Navigate to the Experiments tab in ML Studio and Click New

Step 5: Create a New Blank Experiment

Step 6: Name the Experiment

  • Rename the experiment to Bethesda

Step 7: Drag the dataset into the experiment

  • Expand Saved Datasets -> My Datasets and Drag our data on to the page

Step 8: Visualize your dataset

  • Right Click the Bottom of Dataset and Select Visualize
  • Visualizing a dataset allows you to see useful analytics and gauge relationships between features.

##3. Feature Selection ##

Step 1: Grab Data Using SQL transformation

  • We need to use a SQL Transformation to grab the relevant data from our dataset to build our model.

  • Drag the SQL Transformation module into the experiment

  • Enter the following query into the module

select PMID, Title || ' ' || Abstract as TextInput, [Research Phase Id] as TwoClassLabel from t1;

 * Connect the module as follows 
<img src="https://github.com/ProjectBethesda/ProjectBethesda-ResearchClassificationModel/blob/master/media/sql%20transformation.jpg"/>

**Step 2: Normalization**
 * In this phase we will normalize our data to remove noise from our model the script below will:
   - Convert all text to lowercase 
   - Remove Stopwords from NLTK Stopwords Corpus
   - Remove Standalone numbers (1, 2, 3 but not brca1)
   - Remove Number Words (one, two, three)
   - Remove Punctuation (periods, commas, etc)
 * Drag Stopwords dataset into the experiment
 * Drag the python script module into the experiment
 * Put the following python code normalization snipped into the experiment
```python
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
    import re
   
# Create normalized abstract entry 
    dataframe1["Normalized"] = dataframe1["TextInput"]
   
    numWords=[ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen","twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety","hundred", "thousand", "million", "billion", "trillion"] 
  
    for i in range( len(dataframe1["TextInput"])):
       Abstract = dataframe1["TextInput"][i]
       if(Abstract!= None):
        #to lower
        abstract = Abstract.lower()
        #remove punctuation
        abstract = re.findall(r'\w+', abstract,flags = re.UNICODE | re.LOCALE) 
        #remove stand alone numbers
        
        abstract = " ".join([x for x in abstract if not x.isdigit()])
        #remove stand alone numbers and number words    
        abstract =" ".join(filter(lambda w: not w in numWords ,abstract.split(" ")))    

        #remove stop words
        NormAbs = " ".join(filter(lambda w: not w in set(dataframe2["StopWords"]),abstract.split(" ")))
        
       else:
         NormAbs = ''
         abstract = ''
       dataframe1["Normalized"][i] = str(NormAbs)
      
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,
    

Link the modules as follows and run the experiment

Step 3: Make TwoClass a feature

  • Expand the Data Transformation and Manipulation tabs and drag a 'Metadata Editor' Module into the experiment.
  • Link the 'Metadata Editor' Module to the python script Module
  • Click the select modules button

Select the TwoClassLabel feature and add it to the list

Click the check mark

Change the fields property to "Label" this means that the feature is a category.

Step 4: Make PMID a clear feature

  • Expand the Data Transformation and Manipulation tabs and drag a second 'Metadata Editor' Module into the experiment. This will allow you to classify by article abstract and title while using the PMID as an identifier
  • Link the new 'Metadata Editor' Module to the previous Metadata Editor Module
  • Click the select modules button
  • Select the pmid feature and add it to the list

  • Change the fields property to "Clear Feature"
  • Clear features are passed through the ML pipeline but are not processed by any of the other modules in the experiment
  • Since there is little to no correlation between pmid number and research type this allows us to identify our nodes without while reducing the noise in our model

Step 5: Hash Features and run experiment

Up to this point we have been dealing with strings as features. Strings are more resource intensive than to numbers to process. The best way to address this is by bagging the words our normalized strings and then hashing them into numerical features. While the new features have a 1-1 correspondence hashing is a one way function the tradeoff for the performance we gain from numerical features is that we will not know which "word bags" or statistical couplings of words are which. However, we do know that the features will accurately represent out data.

Expand the Text Analytics tab and drag the 'Feature Hashing' Module into the experiment and connect the 'Feature Hashing' Module to the previous 'Metadata editor'

Select the normalized column

Change hashing bitsize to 15 the n-gram value to 4

Step 6: Project Features

  • Next we need to project out the hashed features we generated to train our model

Expand the 'Data Transformation' and 'Manipulation' tabs and drag the 'Project Columns' Module into the experiment and connect it to the previous 'Feature Hashing' Module

Exclude the 'TextInput' and 'Normalized' columns

The experiment should look as follows

##4. Train Model##

Now that we have converted our dataset into a set of representative features and labels it is time to train our model.

Step 1: Train/Test Split

  • Expand the Data Transformation and 'Sample and Split' tabs then drag a 'Split' Module into the experiment
  • Connect the Split Module to the Metadata Editor Module
  • Select the Split Module, and Enter "0.7" in the 'Fraction of rows in the first dataset' Field
  • The left side of the split is our training set the right side will be our test/validation split

Step 2: Drag in One Vs All Classifier

Step 3: Configure Two Class Decision Tree

  • Drag in a Two Class Decision Tree into the experiment and connect it to the One vs All module
  • Set the properties of the Two Class Decision Tree to the following empirically chosen values:
    • Max of 32 Leaves Per a Tree
    • Min of 50 instances per a Leaf
    • Learning Rate of 0.2
    • 300 Trees

Step 4: Train Model

Drag and a the 'Train' Module into the experiment and connect it to the 'One Vs All' and 'Split' modules as shown below.

Use the column selector to select the TwoClassLabel Column this will tell the model what column we are trying to predict.

Step 5: Score and Evaluate Model

  • Drag ‘Score model’module into experiment
  • Link trained model and test set to the ‘Score model’ module
  • Drag the evaluate module to the experiment and link the score module

Step 6: Run the experiment and visualize the evaluate module

##5. Put into Production## Step 1: Set up Predictive Webservice

Step 2: Project service inputs

We need to clean our service inputs so that we can classify by just PMID, Abstract and Title

  • Drag a 'Project Columns' module into the experiment
  • Connect the Project Bethesda Data set to the 'Project Columns' module
  • Click the 'Launch column selector' button and select the PMID, Title, Abstract, Research Phase ID columns
  • Connect the 'Project Columns' and the 'Web service input' modules to the same port of the Apply SQL Transformation

Step 3: Project service outputs and run predictive experiment

We need to clean our service results so that they don't return thousands of noisy features.

  • Drag a 'Project Columns' module into the experiment
  • Connect the 'Score model' module to the 'Project Columns' module
  • Click the 'Launch column selector' button and select the PMID, Scored Labels, Scored Probabilities for Class "0", Scored Probabilities for Class "1", Scored Probabilities for Class "2" columns
  • Connect the 'Web Service Output' to the previous 'Project Columns' module

Step 4: Deploy Web service

Step 5: Test the Web service

Step 6: Get code for Web service