Skip to content

divetm/Supervised-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Supervised learning

Assignment #1 - CS 7641 Machine Learning course - Charles Isbell & Michael Littman - Georgia Tech

Please clone this git to a local project if you want to replicate the experiments reported in the assignment paper.

Virtual Environment

This project contains a virtual environment folder venv (the folder is too heavy for Github - click on the link here to download it from my Google Drive). This folder contains all the files needed to create a virtual environment in which the project is supposed to run.

requirements.txt

This file contains all the necessary packages for this project. (Running pip install -r requirements.txt will install all the packages in your project's environment - should not be necessary if you are using the given venvfolder here)

The datasets

These datasets (train_32x32.matand tumor_classification_data.csv) are the datasets described in the assignment paper. They can be downloaded from their original sources:

digit_recognition.py and tumor_classification.py

These Python files are the implementations of the 5 Machine Learning algorithms studied in this assignment over the two datasets that we use for this project. They are almost identical but the way the data is prepared for the algorithms and the hyperparameters of each algorithm differ from one script to the other because of the differences in the datasets. What they do is:

  • load the dataset
  • prepare the data before running it through the algorithms (we use the scikit-learn algorithms in this project: it is important to follow the requirements of these algorithms for the script to work)
  • split the data into training and testing sets
  • plot Learning curves on the training and cross-validation sets for each algorithm (generated by splitting the original training set into smaller training and cross-validation sets)
  • plot validation curves for each algorithm to test some hyperparameters and get an idea of what range to use in the GridSearch (see next step)
  • perform a GridSearch in order to find the best hyperparameters for each algorithm. We perform this GridSearch for the hyperparameters that were tested with the validation curves and over the ranges that were identified as interesting with the help of the validation curves
  • train the models with the hyperparameters defined by the GridSearch using the training dataset
  • compute the accuracy score of each trained algorithm over the testing dataset (put aside since the beginning of the project) and plot a histogram comparing the differnt accuracy scores

Using Google Cloud Compute Engine

If you wish to use Google Cloud Compute Engine there are some tweaks to do to the Python scripts and some commands to run in a shell. The first thing you want to do is add a way to retrieve the plots generated by the scripts. We can use Google Cloud Storage API to upload the plots as png files to a Google bucket from where we can download them. For this add:

from google.cloud import storage


def upload_file(filename):
    """ Upload data to a bucket"""

    # Explicitly use service account credentials by specifying the private key
    # file.
    storage_client = storage.Client.from_service_account_json('/path/to/credentials.json')

    bucket = storage_client.get_bucket("ml-assignment-1-graphs")
    blob = bucket.blob(filename)
    blob.upload_from_filename(filename)

    #returns a public url
    return blob.public_url

to the Python scripts and the line upload_file("{}.png".format(title)) after plt.savefig("{}.png".format(title)) at the end of get_learning_curves and get_validation_curvefunction definitions.

For this to work you will need a Google Cloud Platform account and create a project with an access to their Google Cloud Storage API. There you can generate a json file containint you credentials (download it and replace '/path/to/credentials.json' with the correct path to this file). Finally, create a bucket on Google Cloud Storage interface.

You will need to create a repo in the GCP console. Upload your local project to this repo by using Google's SDK and the commands git remote add google https://source.developers.google.com/p/[YOUR_PROJECT_ID]/r/[YOUR_BUCKET_NAME] (insert your project's ID and your bucket's name where needed), git commit -am "Commit title" and git push cloud master.

You will also need to activate Google Compute Engine API on your account.

Use this shell script to create a Compute Engine instance and run your python script in it (you can change the zone and machine-type tags values if needed) :

    gcloud compute instances create my-app-instance \
    --image-family=debian-9 \
    --image-project=debian-cloud \
    --machine-type=g1-small \
    --scopes userinfo-email,cloud-platform \
    --metadata-from-file startup-script=gce/startup-script.sh \
    --zone us-central1-f \
    --tags http-server

You will need to have saved the startup-script.sh in the directory where you execute this command. This script is the following:


# Talk to the metadata server to get the project id
PROJECTID=$(curl -s "http://metadata.google.internal/computeMetadata/v1/project/project-id" -H "Metadata-Flavor: Google")

# Install logging monitor. The monitor will automatically pickup logs sent to
# syslog.
curl -s "https://storage.googleapis.com/signals-agents/logging/google-fluentd-install.sh" | bash
service google-fluentd restart &

# Install dependencies from apt
apt-get update
apt-get install -yq \
    git build-essential supervisor python python-dev python-pip libffi-dev \
    libssl-dev

# Create a pythonapp user. The application will run as this user.
useradd -m -d /home/pythonapp pythonapp

# pip from apt is out of date, so make it update itself and install virtualenv.
pip install --upgrade pip virtualenv

# Get the source code from the Google Cloud Repository
# git requires $HOME and it's not set during the startup script.
export HOME=/root
git config --global credential.helper gcloud.sh
git clone https://source.developers.google.com/p/$PROJECTID/r/[YOUR_REPO_NAME] /opt/app

# Install app dependencies
virtualenv -p python3 /opt/app/venv
source /opt/app/venv/bin/activate
/opt/app/venv/bin/pip install -r /opt/app/requirements.txt

# Make sure the pythonapp user owns the application code
chown -R pythonapp:pythonapp /opt/app

# Configure supervisor to run the application.
cat >/etc/supervisor/conf.d/python-app.conf << EOF
[program:pythonapp]
directory=/opt/app
command=/opt/app/venv/bin/python /opt/app/[YOUR_PYTHON_SCRIPT_NAME].py
autostart=true
autorestart=true
user=pythonapp
# Environment variables ensure that the application runs inside of the
# configured virtualenv.
environment=VIRTUAL_ENV="/opt/app/venv",PATH="/opt/app/venv/bin",\
    HOME="/home/pythonapp",USER="pythonapp"
stdout_logfile=syslog
stderr_logfile=syslog
EOF

supervisorctl reread
supervisorctl update

# Application should now be running under supervisor

Replace [YOUR_REPO_NAME] with the name of the repo you created in the previous steps and [YOUR_PYTHON_SCRIPT_NAME] with the name of the python script you want to run. (This part can generate some errors due to the paths that can change from one project to another).

Once the Compute Engine instance is created, you can access its logs here. And if everything goes well, the plots will start being added to your bucket.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages