Auto-generate model training and C# code for a Binary Classification task (Sentiment Analysis scenario)
In this example you are going to automatically train/create a model and related C# code by simply providing a dataset (The WikiDetox dataset in this case) to the ML.NET CLI tool.
NOTE: For this particular 'Sentiment Analysis sample', you also have a very similar scenario, but longer document with step-by-step explanations getting started from scratch, in this tutorial:
Tutorial: Auto generate a binary classifier using the CLI
The ML.NET CLI (command-line interface) is a tool you run on any command-prompt (Windows, Mac or Linux) for generating good quality ML.NET models and C# source code based on training datasets you provide.
The ML.NET CLI is part of ML.NET and its main purpose is to "democratize" ML.NET for .NET developers when learning ML.NET so it is very simple to generate a good quality ML.NET model (serialized model .zip file) plus the sample C# code to run/score that model. In addition, the C# code to create/train that model is also generated for you so you can research what algorithm and settings it is using for that generated "best model".
From command-prompt (either PowerShell, Bash or CMD) move to the 'BinaryClassification CLI sample' folder:
> cd <YOUR_PATH>samples/CLI/BinaryClassification_CLI
Now run the following ML.NET CLI command:
> mlnet auto-train --task binary-classification --dataset wikiDetoxAnnotated40kRows.tsv --label-column-name Label --max-exploration-time 180
You will get a similar command execution like the following:
This process is performing multiple training explorations trying multiple trainers/algorithms and multiple hyper-parameters with different combinations of configuration per each model.
IMPORTANT: Note that in this case you are exploring multiple trainings with the CLI looking for "best models" only for 3 minutes. That's enough when you are just learning the CLI usage and the generated C# code for the model. But when trying to optimize the model to achieve high quality you might need to run the CLI 'auto-train' command for many more minutes or even hours depending on the size of the dataset.
As a rule of thumb, a high quality model might need hundreds of iterations (hundreds of models explored automatically performed by the CLI).
When the command finishes the training explorations, you get a summary like the following:
For undestanding the 'quality metrics' read this doc: Model evaluation metrics in ML.NET.
That command generates the following assets in a new folder (if no --name parameter was specified, its name is 'SampleBinaryClassification'):
- A serialized "best model" (MLModel.zip) ready to use.
- Sample C# code to run/score that generated model (To make predictions in your end-user apps with that model).
- Sample C# code with the training code used to generate that model (For learning purposes or direct training with the API).
The first two assets (.ZIP file model and C# code to run that model) can directly be used in your end-user apps (ASP.NET Core web app, services, desktop app, etc.) to make predictions with that generated ML model.
The third asset, the training code, shows you what ML.NET API code was used by the CLI to train the generated model, so you can investigate what specific trainer/algorithm and hyper-paramenters were selected by the CLI.
Go ahead and explore that generated C# projects code and compare it with the 'Sentiment Analysis' sample in this repo. The accuracy and performance coming from the model generated by the CLI should be better than the sample in the repo which has simpler ML.NET code with no additional hyper-parameters, etc.
For instance, the configuration for one of the trainers used in the Sentiment Analysis ML.NET sample (SdcaLogisticRegression
) is simplified for making easier to learn ML.NET (but might not be the most optimal model), so it is like the following code, with no hyper-parameters:
var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features");
On the other hand, in 1 hour exploration time with the CLI, the selected algorithm/trainer chosen (SdcaLogisticRegression
) was the following code which includes additional hyper-parameters, all that code generated for you!:
var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
new SdcaLogisticRegressionBinaryTrainer.Options()
{ ConvergenceTolerance = 0.2f,
MaximumNumberOfIterations = 100,
Shuffle = true,
BiasLearningRate = 1f,
LabelColumnName = "Label",
FeatureColumnName = "Features" });
If you run the CLI for longer time exploring additional algorithms/trainers, the algorithm configuration would probably change and improve.
Finding those hyper-parameters by yourself could be a very long and tedious trial process. With the CLI and AutoML this is very much simplified for you.
You can generate those assets explained above from your own datasets without coding by yourself, so it also improves your productivity even if you already know ML.NET. Try your own dataset with the CLI!
Step-by-step CLI tutorial, getting started from scratch: