-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
GitHub Actions
committed
Apr 19, 2024
1 parent
58b4d65
commit 03ca3c1
Showing
1 changed file
with
341 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,341 @@ | ||
# MLJ Cheatsheet | ||
|
||
|
||
## Starting an interactive MLJ session | ||
|
||
```@repl cheat | ||
using MLJ | ||
MLJ_VERSION # version of MLJ for this cheatsheet | ||
``` | ||
|
||
## Model search and code loading | ||
|
||
`info("PCA")` retrieves registry metadata for the model called "PCA" | ||
|
||
`info("RidgeRegressor", pkg="MultivariateStats")` retrieves metadata | ||
for "RidgeRegresssor", which is provided by multiple packages | ||
|
||
`doc("DecisionTreeClassifier", pkg="DecisionTree")` retrieves the | ||
model document string for the classifier, without loading model code | ||
|
||
`models()` lists metadata of every registered model. | ||
|
||
`models("Tree")` lists models with "Tree" in the model or package name. | ||
|
||
`models(x -> x.is_supervised && x.is_pure_julia)` lists all supervised models written in pure julia. | ||
|
||
`models(matching(X))` lists all unsupervised models compatible with input `X`. | ||
|
||
`models(matching(X, y))` lists all supervised models compatible with input/target `X/y`. | ||
|
||
With additional conditions: | ||
|
||
```julia | ||
models() do model | ||
matching(model, X, y) && | ||
model.prediction_type == :probabilistic && | ||
model.is_pure_julia | ||
end | ||
``` | ||
|
||
`Tree = @load DecisionTreeClassifier pkg=DecisionTree` imports "DecisionTreeClassifier" type and binds it to `Tree` | ||
`tree = Tree()` to instantiate a `Tree`. | ||
|
||
`tree2 = Tree(max_depth=2)` instantiates a tree with different hyperparameter | ||
|
||
`Ridge = @load RidgeRegressor pkg=MultivariateStats` imports a type for a model provided by multiple packages | ||
|
||
For interactive loading instead, use `@iload` | ||
|
||
|
||
## Scitypes and coercion | ||
|
||
`scitype(x)` is the scientific type of `x`. For example `scitype(2.4) == Continuous` | ||
|
||
![scitypes_small.png](img/scitypes_small.png) | ||
|
||
type | scitype | ||
-------------------------------------------|---------------------------------- | ||
`AbstractFloat` | `Continuous` | ||
`Integer` | `Count` | ||
`CategoricalValue` and `CategoricalString` | `Multiclass` or `OrderedFactor` | ||
`AbstractString` | `Textual` | ||
|
||
*Figure and Table for common scalar scitypes* | ||
|
||
Use `schema(X)` to get the column scitypes of a table `X` | ||
|
||
`coerce(y, Multiclass)` attempts coercion of all elements of `y` into scitype `Multiclass` | ||
|
||
`coerce(X, :x1 => Continuous, :x2 => OrderedFactor)` to coerce columns `:x1` and `:x2` of table `X`. | ||
|
||
`coerce(X, Count => Continuous)` to coerce all columns with `Count` scitype to `Continuous`. | ||
|
||
|
||
## Ingesting data | ||
|
||
Split the table `channing` into target `y` (the `:Exit` column) and | ||
features `X` (everything else), after a seeded row shuffling: | ||
|
||
```julia | ||
using RDatasets | ||
channing = dataset("boot", "channing") | ||
y, X = unpack(channing, ==(:Exit); rng=123) | ||
``` | ||
|
||
Same as above but exclude `:Time` column from `X`: | ||
|
||
```julia | ||
using RDatasets | ||
channing = dataset("boot", "channing") | ||
y, X = unpack(channing, | ||
==(:Exit), # y is the :Exit column | ||
!=(:Time); # X is the rest, except :Time | ||
rng=123) | ||
``` | ||
|
||
Splitting row indices into train/validation/test, with seeded shuffling: | ||
|
||
`train, valid, test = partition(eachindex(y), 0.7, 0.2, rng=1234)` for 70:20:10 ratio | ||
|
||
For a stratified split: | ||
|
||
`train, test = partition(eachindex(y), 0.8, stratify=y)` | ||
|
||
Split a table or matrix `X`, instead of indices: | ||
|
||
`Xtrain, Xvalid, Xtest = partition(X, 0.5, 0.3, rng=123)` | ||
|
||
Getting data from [OpenML](https://www.openml.org): | ||
|
||
`table = OpenML.load(91)` | ||
|
||
Creating synthetic classification data: | ||
|
||
`X, y = make_blobs(100, 2)` (also: `make_moons`, `make_circles`) | ||
|
||
Creating synthetic regression data: | ||
|
||
`X, y = make_regression(100, 2)` | ||
|
||
## Machine construction | ||
|
||
Supervised case: | ||
|
||
`model = KNNRegressor(K=1)` and `mach = machine(model, X, y)` | ||
|
||
Unsupervised case: | ||
|
||
`model = OneHotEncoder()` and `mach = machine(model, X)` | ||
|
||
## Fitting | ||
|
||
`fit!(mach, rows=1:100, verbosity=1, force=false)` (defaults shown) | ||
|
||
|
||
## Prediction | ||
|
||
Supervised case: `predict(mach, Xnew)` or `predict(mach, rows=1:100)` | ||
|
||
Similarly, for probabilistic models: `predict_mode`, `predict_mean` and `predict_median`. | ||
|
||
Unsupervised case: `transform(mach, rows=1:100)` or `inverse_transform(mach, rows)`, etc. | ||
|
||
|
||
## Inspecting objects | ||
|
||
`@more` gets detail on the last object in REPL | ||
|
||
`params(model)` gets a nested-tuple of all hyperparameters, even nested ones | ||
|
||
`info(ConstantRegressor())`, `info("PCA")`, `info("RidgeRegressor", | ||
pkg="MultivariateStats")` gets all properties (aka traits) of registered models | ||
|
||
`info(rms)` gets all properties of a performance measure | ||
|
||
`schema(X)` get column names, types and scitypes, and nrows, of a table `X` | ||
|
||
`scitype(X)` gets the scientific type of `X` | ||
|
||
`fitted_params(mach)` gets learned parameters of the fitted machine | ||
|
||
`report(mach)` gets other training results (e.g. feature rankings) | ||
|
||
|
||
## Saving and retrieving machines using Julia serializer | ||
|
||
`MLJ.save("trained_for_five_days.jls", mach)` to save machine `mach` (without data) | ||
|
||
`predict_only_mach = machine("trained_for_five_days.jlso")` to deserialize. | ||
|
||
|
||
## Performance estimation | ||
|
||
`evaluate(model, X, y, resampling=CV(), measure=rms, operation=predict, weights=..., verbosity=1)` | ||
|
||
`evaluate!(mach, resampling=Holdout(), measure=[rms, mav], operation=predict, weights=..., verbosity=1)` | ||
|
||
`evaluate!(mach, resampling=[(fold1, fold2), (fold2, fold1)], measure=rms)` | ||
|
||
## Resampling strategies (`resampling=...`) | ||
|
||
`Holdout(fraction_train=0.7, rng=1234)` for simple holdout | ||
|
||
`CV(nfolds=6, rng=1234)` for cross-validation | ||
|
||
`StratifiedCV(nfolds=6, rng=1234)` for stratified cross-validation | ||
|
||
`TimeSeriesSV(nfolds=4)` for time-series cross-validation | ||
|
||
or a list of pairs of row indices: | ||
|
||
`[(train1, eval1), (train2, eval2), ... (traink, evalk)]` | ||
|
||
## Tuning | ||
|
||
### Tuning model wrapper | ||
|
||
`tuned_model = TunedModel(model=…, tuning=RandomSearch(), resampling=Holdout(), measure=…, operation=predict, range=…)` | ||
|
||
### Ranges for tuning (`range=...`) | ||
|
||
If `r = range(KNNRegressor(), :K, lower=1, upper = 20, scale=:log)` | ||
|
||
then `Grid()` search uses `iterator(r, 6) == [1, 2, 3, 6, 11, 20]`. | ||
|
||
`lower=-Inf` and `upper=Inf` are allowed. | ||
|
||
Non-numeric ranges: `r = range(model, :parameter, values=…)` | ||
|
||
Nested ranges: Use dot syntax, as in `r = range(EnsembleModel(atom=tree), :(atom.max_depth), ...)` | ||
|
||
Can specify multiple ranges, as in `range=[r1, r2, r3]`. For more range options do `?Grid` or `?RandomSearch` | ||
|
||
|
||
### Tuning strategies | ||
|
||
`RandomSearch(rng=1234)` for basic random search | ||
|
||
`Grid(resolution=10)` or `Grid(goal=50)` for basic grid search | ||
|
||
Also available: `LatinHyperCube`, `Explicit` (built-in), `MLJTreeParzenTuning`, `ParticleSwarm`, `AdaptiveParticleSwarm` (3rd-party packages) | ||
|
||
|
||
#### Learning curves | ||
|
||
For generating a plot of performance against parameter specified by `range`: | ||
|
||
`curve = learning_curve(mach, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)` | ||
|
||
`curve = learning_curve(model, X, y, resolution=30, resampling=Holdout(), measure=…, operation=predict, range=…, n=1)` | ||
|
||
If using Plots.jl: | ||
|
||
`plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)` | ||
|
||
|
||
## Controlling iterative models | ||
|
||
Requires: `using MLJIteration` | ||
|
||
`iterated_model = IteratedModel(model=…, resampling=Holdout(), measure=…, controls=…, retrain=false)` | ||
|
||
|
||
### Controls | ||
|
||
Increment training: `Step(n=1)` | ||
|
||
Stopping: `TimeLimit(t=0.5)` (in hours), `NumberLimit(n=100)`, `NumberSinceBest(n=6)`, `NotANumber()`, `Threshold(value=0.0)`, `GL(alpha=2.0)`, `PQ(alpha=0.75, k=5)`, `Patience(n=5)` | ||
|
||
Logging: `Info(f=identity)`, `Warn(f="")`, `Error(predicate, f="")` | ||
|
||
Callbacks: `Callback(f=mach->nothing)`, `WithNumberDo(f=n->@info(n))`, `WithIterationsDo(f=i->@info("num iterations: $i"))`, `WithLossDo(f=x->@info("loss: $x"))`, `WithTrainingLossesDo(f=v->@info(v))` | ||
|
||
Snapshots: `Save(filename="machine.jlso")` | ||
|
||
Wraps: `MLJIteration.skip(control, predicate=1)`, `IterationControl.with_state_do(control)` | ||
|
||
|
||
## Performance measures (metrics) | ||
|
||
Do `measures()` to get full list. | ||
|
||
`info(rms)` to list properties (aka traits) of the `rms` measure | ||
|
||
|
||
## Transformers | ||
|
||
Built-ins include: `Standardizer`, `OneHotEncoder`, `UnivariateBoxCoxTransformer`, `FeatureSelector`, `FillImputer`, `UnivariateDiscretizer`, `ContinuousEncoder`, `UnivariateTimeTypeToContinuous` | ||
|
||
Externals include: `PCA` (in MultivariateStats), `KMeans`, `KMedoids` (in Clustering). | ||
|
||
`models(m -> !m.is_supervised)` to get full list | ||
|
||
|
||
## Ensemble model wrapper | ||
|
||
`EnsembleModel(atom=…, weights=Float64[], bagging_fraction=0.8, rng=GLOBAL_RNG, n=100, parallel=true, out_of_bag_measure=[])` | ||
|
||
|
||
## Target transformation wrapper | ||
|
||
`TransformedTargetModel(model=ConstantClassifier(), target=Standardizer())` | ||
|
||
## Pipelines | ||
|
||
`pipe = (X -> coerce(X, :height=>Continuous)) |> OneHotEncoder |> KNNRegressor(K=3)` | ||
|
||
Unsupervised: | ||
|
||
`pipe = Standardizer |> OneHotEncoder` | ||
|
||
Concatenation: | ||
|
||
`pipe1 |> pipe2` or `model |> pipe` or `pipe |> model`, etc | ||
|
||
|
||
## Define a supervised learning network: | ||
|
||
`Xs = source(X)` | ||
`ys = source(y)` | ||
|
||
... define further nodal machines and nodes ... | ||
|
||
`yhat = predict(knn_machine, W, ys)` (final node) | ||
|
||
|
||
## Exporting a learning network as a stand-alone model: | ||
|
||
Supervised, with final node `yhat` returning point predictions: | ||
|
||
```julia | ||
@from_network machine(Deterministic(), Xs, ys; predict=yhat) begin | ||
mutable struct Composite | ||
reducer=network_pca | ||
regressor=network_knn | ||
end | ||
``` | ||
|
||
Here `network_pca` and `network_knn` are models appearing in the | ||
learning network. | ||
|
||
Supervised, with `yhat` final node returning probabilistic predictions: | ||
|
||
```julia | ||
@from_network machine(Probabilistic(), Xs, ys; predict=yhat) begin | ||
mutable struct Composite | ||
reducer=network_pca | ||
classifier=network_tree | ||
end | ||
``` | ||
|
||
Unsupervised, with final node `Xout`: | ||
|
||
```julia | ||
@from_network machine(Unsupervised(), Xs; transform=Xout) begin | ||
mutable struct Composite | ||
reducer1=network_pca | ||
reducer2=clusterer | ||
end | ||
end | ||
```UnivariateTimeTypeToContinuous |