fully implement MLJ #22

tiemvanderdeure · 2024-10-28T08:48:45Z

Adds an MLJ-compliant docstring and tests, so this can be added to the registry.

tiemvanderdeure · 2024-10-28T08:54:48Z

@abloam Do you mind reviewing this?

Current the integration tests fail this fails. This is because in the test interface the binary data is partitioned into 2 such that the machine only sees data of 1 class, which this model cannot handle (the lasso regression outputs just don't make any sense). I've added an informative error, but the test will still fail.

This is where the partioning happens:
https://github.com/JuliaAI/MLJTestInterface.jl/blob/7d32440f67d271c5c52bc6a087c4937a1fe8531f/src/attemptors.jl#L81-L82
Is this intentional or would it be better to shuffle before partitioning?

tiemvanderdeure · 2024-11-22T09:45:41Z

@ablaom Just noticed I misspelled your username. What do you think is the best way forward here?

ablaom · 2024-11-23T08:38:22Z

This has come up before. I think the best resolution is to handle the case of only one seen class. Can't you just predict that class (or that class with high probability, if predictions are probabilistic)?

tiemvanderdeure · 2024-11-24T10:19:47Z

Can't you just predict that class (or that class with high probability, if predictions are probabilistic)?

That wouldn't be my preferred solution. Under the hood Maxnet uses a lasso regression, which just doesn't make any sense if there is only one class. I could hard-code something in (it would be giving all predictions a probability of 0.5), but as this algorithm should never be used in this way I would rather just let it error.

Would it be acceptable to change the test itself from @test isempty(failures) to testing that only that particular case fails with the expected error?

ablaom · 2024-11-24T21:04:33Z

Would it be acceptable to change the test itself from @test isempty(failures) to testing that only that particular case fails with the expected error?

That would be acceptable.

ablaom · 2024-11-24T21:23:40Z

However, it would mean I need to exclude the model from MLJ's integration test suite. Perhaps you could make some prediction but throw a warning?

tiemvanderdeure · 2024-11-26T13:55:39Z

I just made docs pass by testing for the error. I think an error is more appropriate than a warning for this case, and allowing for it would also require me to add more code.

However, it would mean I need to exclude the model from MLJ's integration test suite.

Does that mean it can't be in the model registry, or not?

ablaom · 2024-11-28T01:04:57Z

My apologies @tiemvanderdeure, I think I messed up. I don't actually understand the source of your original error. The data set your are using, MLJTestInterface.make_binary() has both target classes appearing, and MLJTestInterface does not do any sub sampling- just a few basic things like fit to all the data. Are you subsampling internally, and that is limiting the datasets you can operation on?

If that is the case, then I'd be happy for you do substitute a bigger dataset in your tests. But it is likely your model will fail MLJ integration tests, which do involve subsampling with small datasets (an intentional decision). We can exclude your model from those tests (which live at MLJ.jl).

tiemvanderdeure · 2024-11-28T08:54:31Z

MLJTestInterface trains on a subset of rows though, and that subset happens to contain only one class.

It's in these lines:
https://github.com/JuliaAI/MLJTestInterface.jl/blob/7d32440f67d271c5c52bc6a087c4937a1fe8531f/src/attemptors.jl#L81-L82

There's no internal subsampling and no minimum amount of data - even one false and one true would work. I added a check for allequal and that's exactly the error that it spits out.

ablaom · 2024-11-29T03:50:04Z

MLJTestInterface trains on a subset of rows though, and that subset happens to contain only one class.

Okay, I stand corrected. I will investigate and possibly change the test there. I'm guessing this never came up because we mostly do the multiclass data...

Thanks @tiemvanderdeure for you continued patience.

ablaom · 2024-12-01T19:53:40Z

Waiting for:

Relax subsampling JuliaAI/MLJTestInterface.jl#24

ablaom · 2024-12-01T23:37:41Z

@tiemvanderdeure Can you please try re-instating the original version of the test and see if that works now?

tiemvanderdeure · 2024-12-02T09:34:51Z

Tests pass now, thanks @ablaom!

ablaom · 2024-12-04T03:15:51Z

Good to see the progress. Thank you for helping me sort this one out.

Okay, I've had a closer look and have some concerns about the docstring. They may be just doc-issues but I think they are linked to some other weaknesses:

Your model should run without the need to force scitype_check_level = 0. The problem with your docstring example is that you are passing Count (i.e. integer) data where multiclass is expected. One possible fix is to include Count in your input_scitype and/or target_scitype declarations, if you really mean to support "Count" data. As far as the target is concerned, I doubt you intend this. As far as inputs (features) are concerned, perhaps you do mean to support Count, with the understanding that your model (some kind of NN?) treats integers the same way as continuous data, which should probably be documented, but we wouldn't press you on this.

The MLJ idiom for treating categorical data encoded as integers is to coerce them to Multiclass (or Binary also works):

using MLJ # or MLJBase
y, X = Maxnet.bradypus();
y = coerce(y, Multiclass)

Your example still doesn't work, because X has Count features (not allowed by your input_scitype declarations):

schema(X)
┌─────────────┬────────────────┬─────────────────────────────────┐
│ names       │ scitypes       │ types                           │
├─────────────┼────────────────┼─────────────────────────────────┤
│ cld6190_ann │ Count          │ Int64                           │
│ dtr6190_ann │ Count          │ Int64                           │
│ ecoreg      │ Multiclass{14} │ CategoricalValue{Int64, UInt32} │
│ frs6190_ann │ Count          │ Int64                           │
│ h_dem       │ Count          │ Int64                           │
│ pre6190_ann │ Count          │ Int64                           │
│ pre6190_l1  │ Count          │ Int64                           │
│ pre6190_l10 │ Count          │ Int64                           │
│ pre6190_l4  │ Count          │ Int64                           │
│ pre6190_l7  │ Count          │ Int64                           │
│ tmn6190_ann │ Count          │ Int64                           │
│ tmp6190_ann │ Count          │ Int64                           │
│ tmx6190_ann │ Count          │ Int64                           │
│ vap6190_ann │ Count          │ Int64                           │
└─────────────┴────────────────┴─────────────────────────────────┘

If you don't extend the input_scitype declaration, you will need the following additional step to get the example to work without scitype_check_level=0:

X = coerce(X, Count=>Continuous)
mach = machine(MaxnetBinaryClassifier(features = "lqp"), X, y) |> fit!
MLJBase.predict(mach, X)

The good news is that MLJTestIntegration tests work on your canned dataset (with the coercions indicated). These tests include things like tuning and so on. (However, for reasons already discussed, we will not be including your model in the integration part of MLJ CI.)

Further comments on the docstring:

It is not complete. You can find a checklist of everything required here. In particular, we need:
- An exhaustive description of the hyper-parameters (fields of your MaxnetBinaryClassifier). I understand you may want to avoid doc duplication. If you prefer not to include this, please give a complete url for where to find this information. And if possible, ensure this external documentation qualifies all objects in the Maxnet namespace with Maxnet.object (see Tip below).
- A clear statement of supported scitypes of the training data (matching your input_scitype/target_scitype declarations).
You can find lots of examples in the ModelBrowser. The DecisionTree.DecisionTreeClassifier is probably a good model.
I don't understand the docstring sentence "The keywords link, and clamp are passed to predict, while all other keywords are passed to maxnet": Aren't link and clamp set when instantiating MaxnetBinaryClassifier? If they are passed to predict, that is presumably using non-MLJ API. What are the "other keywords" and what is the function maxnet? Does the MLJ user needs to know about this function?
Tip: Generally in writing an MLJ docstring, it is understood that the MLJ namespace has been imported but the package providing the implementation of the MLJ API (Maxnet.jl, in this case) is only imported (true if user loads the mode code with @load). So all references to Maxnet objects need to be qualified, as in Maxnet.maxnet (if this is really needed). It's best if such names can be avoided (e.g, with use of symbols, like :auto, instead of MyPkg.Auto() but we don't insist on this.

tiemvanderdeure · 2024-12-07T12:16:01Z

Thanks for the elaborate feedback! I'm going to lump this together with some other improvements to the docs over in #24.

As to the input types, I'll add a coercion step. This model is similar to e.g. a GLM in that it treats count and continuous data in the exact same way. So nothing breaks if the input is integer, but I think it would be most appropriate to exclude it from the input class declaration (as MLJGLMInterface does, I'm pretty sure)

ablaom · 2024-12-10T04:52:54Z

Okay, thank you. Ping me when you're done, or if you want a review at #24.

tiemvanderdeure added 4 commits October 28, 2024 09:11

add mlj docstring

b77ec5d

test with MLJTestInterface

e43b313

throw a helpful error if input data only has one class

c7b49b7

mljtestinterface is not a dep (oops)

c49e303

tiemvanderdeure added 4 commits November 26, 2024 09:28

move allequal error to main function

9388243

fix allequal error

bd87fe9

fix tests

c443eb4

add MLJBase as docs dep

ae39990

tiemvanderdeure added 3 commits November 26, 2024 15:10

fix mlj doctest

a7a15f8

attempt fix of multiclass printing

585cbe9

use @example instead of jldoctest

6e38ccc

test for no failures in mlj interface test

b2b09be

tiemvanderdeure merged commit 1742c96 into master Dec 3, 2024
5 checks passed

tiemvanderdeure deleted the implement_mlj branch December 3, 2024 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fully implement MLJ #22

fully implement MLJ #22

tiemvanderdeure commented Oct 28, 2024

tiemvanderdeure commented Oct 28, 2024

tiemvanderdeure commented Nov 22, 2024

ablaom commented Nov 23, 2024

tiemvanderdeure commented Nov 24, 2024

ablaom commented Nov 24, 2024

ablaom commented Nov 24, 2024

tiemvanderdeure commented Nov 26, 2024

ablaom commented Nov 28, 2024

tiemvanderdeure commented Nov 28, 2024

ablaom commented Nov 29, 2024

ablaom commented Dec 1, 2024 •

edited

Loading

ablaom commented Dec 1, 2024 •

edited

Loading

tiemvanderdeure commented Dec 2, 2024

ablaom commented Dec 4, 2024

tiemvanderdeure commented Dec 7, 2024

ablaom commented Dec 10, 2024

fully implement MLJ #22

fully implement MLJ #22

Conversation

tiemvanderdeure commented Oct 28, 2024

tiemvanderdeure commented Oct 28, 2024

tiemvanderdeure commented Nov 22, 2024

ablaom commented Nov 23, 2024

tiemvanderdeure commented Nov 24, 2024

ablaom commented Nov 24, 2024

ablaom commented Nov 24, 2024

tiemvanderdeure commented Nov 26, 2024

ablaom commented Nov 28, 2024

tiemvanderdeure commented Nov 28, 2024

ablaom commented Nov 29, 2024

ablaom commented Dec 1, 2024 • edited Loading

ablaom commented Dec 1, 2024 • edited Loading

tiemvanderdeure commented Dec 2, 2024

ablaom commented Dec 4, 2024

tiemvanderdeure commented Dec 7, 2024

ablaom commented Dec 10, 2024

ablaom commented Dec 1, 2024 •

edited

Loading

ablaom commented Dec 1, 2024 •

edited

Loading