Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fully implement MLJ #22

Merged
merged 12 commits into from
Dec 3, 2024
Merged

fully implement MLJ #22

merged 12 commits into from
Dec 3, 2024

Conversation

tiemvanderdeure
Copy link
Owner

Adds an MLJ-compliant docstring and tests, so this can be added to the registry.

See JuliaAI/MLJModels.jl#572

@tiemvanderdeure
Copy link
Owner Author

@abloam Do you mind reviewing this?

Current the integration tests fail this fails. This is because in the test interface the binary data is partitioned into 2 such that the machine only sees data of 1 class, which this model cannot handle (the lasso regression outputs just don't make any sense). I've added an informative error, but the test will still fail.

This is where the partioning happens:
https://github.com/JuliaAI/MLJTestInterface.jl/blob/7d32440f67d271c5c52bc6a087c4937a1fe8531f/src/attemptors.jl#L81-L82
Is this intentional or would it be better to shuffle before partitioning?

@tiemvanderdeure
Copy link
Owner Author

@ablaom Just noticed I misspelled your username. What do you think is the best way forward here?

@ablaom
Copy link

ablaom commented Nov 23, 2024

This has come up before. I think the best resolution is to handle the case of only one seen class. Can't you just predict that class (or that class with high probability, if predictions are probabilistic)?

@tiemvanderdeure
Copy link
Owner Author

Can't you just predict that class (or that class with high probability, if predictions are probabilistic)?

That wouldn't be my preferred solution. Under the hood Maxnet uses a lasso regression, which just doesn't make any sense if there is only one class. I could hard-code something in (it would be giving all predictions a probability of 0.5), but as this algorithm should never be used in this way I would rather just let it error.

Would it be acceptable to change the test itself from @test isempty(failures) to testing that only that particular case fails with the expected error?

@ablaom
Copy link

ablaom commented Nov 24, 2024

Would it be acceptable to change the test itself from @test isempty(failures) to testing that only that particular case fails with the expected error?

That would be acceptable.

@ablaom
Copy link

ablaom commented Nov 24, 2024

However, it would mean I need to exclude the model from MLJ's integration test suite. Perhaps you could make some prediction but throw a warning?

@tiemvanderdeure
Copy link
Owner Author

I just made docs pass by testing for the error. I think an error is more appropriate than a warning for this case, and allowing for it would also require me to add more code.

However, it would mean I need to exclude the model from MLJ's integration test suite.

Does that mean it can't be in the model registry, or not?

@ablaom
Copy link

ablaom commented Nov 28, 2024

My apologies @tiemvanderdeure, I think I messed up. I don't actually understand the source of your original error. The data set your are using, MLJTestInterface.make_binary() has both target classes appearing, and MLJTestInterface does not do any sub sampling- just a few basic things like fit to all the data. Are you subsampling internally, and that is limiting the datasets you can operation on?

If that is the case, then I'd be happy for you do substitute a bigger dataset in your tests. But it is likely your model will fail MLJ integration tests, which do involve subsampling with small datasets (an intentional decision). We can exclude your model from those tests (which live at MLJ.jl).

@tiemvanderdeure
Copy link
Owner Author

MLJTestInterface trains on a subset of rows though, and that subset happens to contain only one class.

It's in these lines:
https://github.com/JuliaAI/MLJTestInterface.jl/blob/7d32440f67d271c5c52bc6a087c4937a1fe8531f/src/attemptors.jl#L81-L82

There's no internal subsampling and no minimum amount of data - even one false and one true would work. I added a check for allequal and that's exactly the error that it spits out.

@ablaom
Copy link

ablaom commented Nov 29, 2024

MLJTestInterface trains on a subset of rows though, and that subset happens to contain only one class.

Okay, I stand corrected. I will investigate and possibly change the test there. I'm guessing this never came up because we mostly do the multiclass data...

Thanks @tiemvanderdeure for you continued patience.

@ablaom
Copy link

ablaom commented Dec 1, 2024

@ablaom
Copy link

ablaom commented Dec 1, 2024

@tiemvanderdeure Can you please try re-instating the original version of the test and see if that works now?

@tiemvanderdeure
Copy link
Owner Author

Tests pass now, thanks @ablaom!

@tiemvanderdeure tiemvanderdeure merged commit 1742c96 into master Dec 3, 2024
5 checks passed
@tiemvanderdeure tiemvanderdeure deleted the implement_mlj branch December 3, 2024 12:22
@ablaom
Copy link

ablaom commented Dec 4, 2024

Good to see the progress. Thank you for helping me sort this one out.

Okay, I've had a closer look and have some concerns about the docstring. They may be just doc-issues but I think they are linked to some other weaknesses:

Your model should run without the need to force scitype_check_level = 0. The problem with your docstring example is that you are passing Count (i.e. integer) data where multiclass is expected. One possible fix is to include Count in your input_scitype and/or target_scitype declarations, if you really mean to support "Count" data. As far as the target is concerned, I doubt you intend this. As far as inputs (features) are concerned, perhaps you do mean to support Count, with the understanding that your model (some kind of NN?) treats integers the same way as continuous data, which should probably be documented, but we wouldn't press you on this.

The MLJ idiom for treating categorical data encoded as integers is to coerce them to Multiclass (or Binary also works):

using MLJ # or MLJBase
y, X = Maxnet.bradypus();
y = coerce(y, Multiclass)

Your example still doesn't work, because X has Count features (not allowed by your input_scitype declarations):

schema(X)
┌─────────────┬────────────────┬─────────────────────────────────┐
│ names       │ scitypes       │ types                           │
├─────────────┼────────────────┼─────────────────────────────────┤
│ cld6190_ann │ Count          │ Int64                           │
│ dtr6190_ann │ Count          │ Int64                           │
│ ecoreg      │ Multiclass{14} │ CategoricalValue{Int64, UInt32} │
│ frs6190_ann │ Count          │ Int64                           │
│ h_dem       │ Count          │ Int64                           │
│ pre6190_ann │ Count          │ Int64                           │
│ pre6190_l1  │ Count          │ Int64                           │
│ pre6190_l10 │ Count          │ Int64                           │
│ pre6190_l4  │ Count          │ Int64                           │
│ pre6190_l7  │ Count          │ Int64                           │
│ tmn6190_ann │ Count          │ Int64                           │
│ tmp6190_ann │ Count          │ Int64                           │
│ tmx6190_ann │ Count          │ Int64                           │
│ vap6190_ann │ Count          │ Int64                           │
└─────────────┴────────────────┴─────────────────────────────────┘

If you don't extend the input_scitype declaration, you will need the following additional step to get the example to work without scitype_check_level=0:

X = coerce(X, Count=>Continuous)
mach = machine(MaxnetBinaryClassifier(features = "lqp"), X, y) |> fit!
MLJBase.predict(mach, X)

The good news is that MLJTestIntegration tests work on your canned dataset (with the coercions indicated). These tests include things like tuning and so on. (However, for reasons already discussed, we will not be including your model in the integration part of MLJ CI.)

Further comments on the docstring:

  • It is not complete. You can find a checklist of everything required here. In particular, we need:

    • An exhaustive description of the hyper-parameters (fields of your MaxnetBinaryClassifier). I understand you may want to avoid doc duplication. If you prefer not to include this, please give a complete url for where to find this information. And if possible, ensure this external documentation qualifies all objects in the Maxnet namespace with Maxnet.object (see Tip below).

    • A clear statement of supported scitypes of the training data (matching your input_scitype/target_scitype declarations).

    You can find lots of examples in the ModelBrowser. The DecisionTree.DecisionTreeClassifier is probably a good model.

  • I don't understand the docstring sentence "The keywords link, and clamp are passed to predict, while all other keywords are passed to maxnet": Aren't link and clamp set when instantiating MaxnetBinaryClassifier? If they are passed to predict, that is presumably using non-MLJ API. What are the "other keywords" and what is the function maxnet? Does the MLJ user needs to know about this function?

  • Tip: Generally in writing an MLJ docstring, it is understood that the MLJ namespace has been imported but the package providing the implementation of the MLJ API (Maxnet.jl, in this case) is only imported (true if user loads the mode code with @load). So all references to Maxnet objects need to be qualified, as in Maxnet.maxnet (if this is really needed). It's best if such names can be avoided (e.g, with use of symbols, like :auto, instead of MyPkg.Auto() but we don't insist on this.

@tiemvanderdeure
Copy link
Owner Author

Thanks for the elaborate feedback! I'm going to lump this together with some other improvements to the docs over in #24.

As to the input types, I'll add a coercion step. This model is similar to e.g. a GLM in that it treats count and continuous data in the exact same way. So nothing breaks if the input is integer, but I think it would be most appropriate to exclude it from the input class declaration (as MLJGLMInterface does, I'm pretty sure)

@ablaom
Copy link

ablaom commented Dec 10, 2024

Okay, thank you. Ping me when you're done, or if you want a review at #24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants