-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLJ Integration #16
MLJ Integration #16
Conversation
Codecov Report
@@ Coverage Diff @@
## main #16 +/- ##
===========================================
- Coverage 92.00% 73.61% -18.39%
===========================================
Files 1 6 +5
Lines 25 163 +138
===========================================
+ Hits 23 120 +97
- Misses 2 43 +41
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Happy to provide a code review for the MLJ interface if required (although may have to wait until Jan). Ping me when you're ready. |
@ablaom I would love if you looked at the MLJ interface when you get a chance. The summary of the changes that I have made is that the core interface is the same, but there is a submodule, Here are some items that I have questions about:
|
@tylerjthomas9 Thanks for this work. I shall take a look at it shortly.
There is no need
I'll need more context but will possibly get this when I start to review. |
I get the following warning when running the
I have tried playing with the input scitypes quite a bit, but I can't get it to resolve. I may be missing something obvious here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of work has put into this, thanks! So far, I am only addressing the scitype issues in this review and would like a further opportunity to review more carefully later.
1 My first objection is that you expose the hyper-parameters cat_features
and text_feautres
in the MLJ models. In MLJ we avoid hyper-parameters that pass characteristics of the data. The data should speak for itself. And this is not a problem because MLJ requires the user to represent data in a way that allows inference of what features are categorical from their scitype
. In other words, cat_features
and text_feautres
(and maybe some others??) are things your interface should be determining from the data presented internally.
I'm guessing that any input column that has OrderedFactor
scitype you will want to internally convert to integer, for catboost (which you can do with MLJModelInterface.int
) and that any input column that has Multiclass
scitype you will flag internally using cat_features
- and maybe you can leave this, but probably you need convert to integers for catboost also. Any Count
features are integers and are not to be treated as categorical but as "numerical" which catboost will do automatically.
I'm not sure what the text_features
hyperparameter does in catboost. Does catboost support some kind of semantic anaysis? Or is this just a different kind of categorical feature? You'll have to explain to me how catboost interprets such features for me to comment further. But, as far as the String
types I'm seeing in the test examples, I think ordinary Multiclass
is intended, and so those string types need coercion to categorical vectors in the test scripts, as I have indicated.
2. It seems strange to me that your target_scitype
declaration for the classifier includes Continuous
. Is there a reason for this?
Thank you for taking the time to review.
Oops. I accidentally added this. I have removed this.
I have added automatic detection of these columns using the function
I think that we can leave this column. Catboost should be able to handle it.
I am not too familiar with how catboost handles text features, but it seems like the text is split on spaces, then converted to numeric features using Bag of Words. Here is a little more info if you're interested: https://catboost.ai/en/docs/concepts/algorithm-main-stages_text-to-numeric. |
Ah ha! In that case you may want to consider enlarging edit Looks like you already did just this. Great! |
src/MLJCatBoostInterface.jl
Outdated
|
||
function MMI.selectrows(::CatBoostModels, I, data_pool) | ||
py_I = numpy.array(numpy.array(I)) | ||
return (data_pool.slice(py_I)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something wrong here. Why should data_pool
have slice
as field? See my other comment. Also, I think we're missing a comma:
return (data_pool.slice(py_I)) | |
return (data_pool.slice(py_I),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should selectrows
handle data in the format that MMI.update
/MMI.fit
uses? I can add methods for handling generic table subsetting too. I have slice here because I assume that the data is in a catboost pool at this point, and you have to slice the pool to subset it.
https://catboost.ai/en/docs/concepts/python-reference_pool_slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right! That's not the issue. But I have this error, and I think the comma is the problem.
using MLJ
model = CatBoostClassifier()
mach = machine(model, X, y)
train, test = partition(eachindex(y), 0.7)
julia> fit!(mach, rows=train)
[ Info: Training machine(CatBoostClassifier(iterations = 1000, …), …).
┌ Error: Problem fitting the machine machine(CatBoostClassifier(iterations = 1000, …), …).
└ @ MLJBase ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:682
[ Info: Running type checks...
[ Info: Type checks okay.
ERROR: Python: TypeError: 'Pool' object is not iterable
Python stacktrace: none
Stacktrace:
[1] pythrow()
@ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/err.jl:94
[2] errcheck
@ ~/.julia/packages/PythonCall/3GRYN/src/err.jl:10 [inlined]
[3] pyiter(x::PythonCall.Py)
@ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/abstract/iter.jl:6
[4] iterate(x::PythonCall.Py)
@ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/Py.jl:339
[5] fit_only!(mach::Machine{CatBoostClassifier, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)
@ MLJBase ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:680
[6] #fit!#63
@ ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:778 [inlined]
[7] top-level scope
@ REPL[29]:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the missing comma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
There's still something wrong with that
0-based indexing adjustment? |
Ah, you're right. I just pushed a fix. |
Thanks for that progress. My stacking integration tests are failing and I have a pretty good idea what the problem is. In these tests, the folds can get quite small, and you have folds where the target (iris in this case) is missing a class. It's the responsibility of the implementation to track all classes, even those not explicitly manifest. Currently this tracking in not happening, because we have The Now that I think about it, it may be that your
Adjust the signatures of |
Perhaps this is helpful: julia> v = ["X", "Y", "Z", "Z"] |> categorical
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"X"
"Y"
"Z"
"Z"
julia> probs = UnivariateFinite(["Z", "X"], [0.1 0.9; 0.2 0.8], pool=v[1])
2-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
UnivariateFinite{Multiclass{3}}(X=>0.9, Z=>0.1) # <--- notice the Multiclass{3} indicating 3 classes in total
UnivariateFinite{Multiclass{3}}(X=>0.8, Z=>0.2)
julia> pdf(probs, levels(v))
2×3 Matrix{Float64}:
0.9 0.0 0.1
0.8 0.0 0.2 |
Oh, and can I request that that |
Good news. If you just add these two missing methods, then all the level-3 MLJIntegrationTests pass, for both the classifier and the regressor: function MMI.selectrows(::CatBoostClassifier, I, data_pool)
py_I = numpy.array(numpy.array(I .- 1))
return (data_pool.slice(py_I), )
end
function MMI.selectrows(::CatBoostClassifier, I::Colon, data_pool)
return (data_pool, )
end These are needed so we can subsample data for use in |
Done! I just expanded the types for the regressor one back to both models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments on the docstring for the classifier. The same would apply to the regressor.
Otherwise good to go.
@tylerjthomas9 Thanks for this valuable contribution and your perseverance.
Great. Here's a short demo to show what all this hard work helps you do: using CatBoost.MLJCatBoostInterface
using MLJBase
using MLJIteration
X, y = @load_iris
model = CatBoostClassifier()
iterated_model = IteratedModel(model, retrain=true)
iterated_model.controls # show default controls
# 5-element Vector{Any}:
# Step(10)
# Patience(5)
# GL(2.0)
# TimeLimit(Dates.Millisecond(108000))
# InvalidValue()
mach = machine(iterated_model, X, y) |> fit!
# [ Info: Training machine(ProbabilisticIteratedModel(model = CatBoostClassifier(iterations = 1000, …), …), …).
# [ Info: No iteration parameter specified. Using `iteration_parameter=:(iterations)`.
# [ Info: No measure specified. Using `measure=LogLoss(tol = 2.220446049250313e-16)`.
# [ Info: final loss: 0.6239726325698093
# [ Info: Stop triggered by Patience(5) stopping criterion.
# [ Info: Retraining on all provided data. To suppress, specify `retrain=false`.
# [ Info: Total of 480 iterations.
predict(mach, rows=1:3)
# 3-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
# UnivariateFinite{Multiclass{3}}(setosa=>0.996, versicolor=>0.00243, virginica=>0.00169)
# UnivariateFinite{Multiclass{3}}(setosa=>0.98, versicolor=>0.0153, virginica=>0.0043)
# UnivariateFinite{Multiclass{3}}(setosa=>0.995, versicolor=>0.00287, virginica=>0.00203) For more, see MNIST Flux example and the official docs. Ping me when a new release is tagged and I will updated the MLJ model registry. |
@ericphanson How does this PR look to you? The core functionality should remain the same, and nothing that currently works should break. However, there is now a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks so much for both of your hard work @ablaom & @tylerjthomas9.
I think it is technically breaking since we had documented certain functions return DataFrames and they no longer do. However I think it's totally fine to make a breaking release here. So I think we can just bump the release & merge/tag.
In reference to #9
I added an MLJ Interface for
CatBoostRegressor
. I am not sure what the best way to deal with the python vs MLJ interfaces (we could just put them in separate modules inside the package). This was just a quick port ofCatBoostRegressor
MLJ, so I am sure it'll change with time. Let me know what you think