-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fully implement MLJ #22
Conversation
@abloam Do you mind reviewing this? Current the integration tests fail this fails. This is because in the test interface the binary data is partitioned into 2 such that the machine only sees data of 1 class, which this model cannot handle (the lasso regression outputs just don't make any sense). I've added an informative error, but the test will still fail. This is where the partioning happens: |
@ablaom Just noticed I misspelled your username. What do you think is the best way forward here? |
This has come up before. I think the best resolution is to handle the case of only one seen class. Can't you just predict that class (or that class with high probability, if predictions are probabilistic)? |
That wouldn't be my preferred solution. Under the hood Maxnet uses a lasso regression, which just doesn't make any sense if there is only one class. I could hard-code something in (it would be giving all predictions a probability of 0.5), but as this algorithm should never be used in this way I would rather just let it error. Would it be acceptable to change the test itself from |
That would be acceptable. |
However, it would mean I need to exclude the model from MLJ's integration test suite. Perhaps you could make some prediction but throw a warning? |
I just made docs pass by testing for the error. I think an error is more appropriate than a warning for this case, and allowing for it would also require me to add more code.
Does that mean it can't be in the model registry, or not? |
My apologies @tiemvanderdeure, I think I messed up. I don't actually understand the source of your original error. The data set your are using, If that is the case, then I'd be happy for you do substitute a bigger dataset in your tests. But it is likely your model will fail MLJ integration tests, which do involve subsampling with small datasets (an intentional decision). We can exclude your model from those tests (which live at MLJ.jl). |
It's in these lines: There's no internal subsampling and no minimum amount of data - even one |
Okay, I stand corrected. I will investigate and possibly change the test there. I'm guessing this never came up because we mostly do the multiclass data... Thanks @tiemvanderdeure for you continued patience. |
Waiting for: |
@tiemvanderdeure Can you please try re-instating the original version of the test and see if that works now? |
Tests pass now, thanks @ablaom! |
Good to see the progress. Thank you for helping me sort this one out. Okay, I've had a closer look and have some concerns about the docstring. They may be just doc-issues but I think they are linked to some other weaknesses: Your model should run without the need to force The MLJ idiom for treating categorical data encoded as integers is to coerce them to using MLJ # or MLJBase
y, X = Maxnet.bradypus();
y = coerce(y, Multiclass) Your example still doesn't work, because schema(X)
┌─────────────┬────────────────┬─────────────────────────────────┐
│ names │ scitypes │ types │
├─────────────┼────────────────┼─────────────────────────────────┤
│ cld6190_ann │ Count │ Int64 │
│ dtr6190_ann │ Count │ Int64 │
│ ecoreg │ Multiclass{14} │ CategoricalValue{Int64, UInt32} │
│ frs6190_ann │ Count │ Int64 │
│ h_dem │ Count │ Int64 │
│ pre6190_ann │ Count │ Int64 │
│ pre6190_l1 │ Count │ Int64 │
│ pre6190_l10 │ Count │ Int64 │
│ pre6190_l4 │ Count │ Int64 │
│ pre6190_l7 │ Count │ Int64 │
│ tmn6190_ann │ Count │ Int64 │
│ tmp6190_ann │ Count │ Int64 │
│ tmx6190_ann │ Count │ Int64 │
│ vap6190_ann │ Count │ Int64 │
└─────────────┴────────────────┴─────────────────────────────────┘ If you don't extend the X = coerce(X, Count=>Continuous)
mach = machine(MaxnetBinaryClassifier(features = "lqp"), X, y) |> fit!
MLJBase.predict(mach, X) The good news is that MLJTestIntegration tests work on your canned dataset (with the coercions indicated). These tests include things like tuning and so on. (However, for reasons already discussed, we will not be including your model in the integration part of MLJ CI.) Further comments on the docstring:
|
Thanks for the elaborate feedback! I'm going to lump this together with some other improvements to the docs over in #24. As to the input types, I'll add a coercion step. This model is similar to e.g. a GLM in that it treats count and continuous data in the exact same way. So nothing breaks if the input is integer, but I think it would be most appropriate to exclude it from the input class declaration (as MLJGLMInterface does, I'm pretty sure) |
Okay, thank you. Ping me when you're done, or if you want a review at #24. |
Adds an MLJ-compliant docstring and tests, so this can be added to the registry.
See JuliaAI/MLJModels.jl#572