Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLJ Integration #16

Merged
merged 41 commits into from
Feb 2, 2023
Merged

MLJ Integration #16

merged 41 commits into from
Feb 2, 2023

Conversation

tylerjthomas9
Copy link
Collaborator

In reference to #9

I added an MLJ Interface for CatBoostRegressor. I am not sure what the best way to deal with the python vs MLJ interfaces (we could just put them in separate modules inside the package). This was just a quick port of CatBoostRegressor MLJ, so I am sure it'll change with time. Let me know what you think

@codecov
Copy link

codecov bot commented Oct 18, 2022

Codecov Report

Merging #16 (d199643) into main (9cab2d6) will decrease coverage by 18.39%.
The diff coverage is 72.61%.

@@             Coverage Diff             @@
##             main      #16       +/-   ##
===========================================
- Coverage   92.00%   73.61%   -18.39%     
===========================================
  Files           1        6        +5     
  Lines          25      163      +138     
===========================================
+ Hits           23      120       +97     
- Misses          2       43       +41     
Impacted Files Coverage Δ
src/mlj_serialization.jl 0.00% <0.00%> (ø)
src/MLJCatBoostInterface.jl 69.49% <69.49%> (ø)
src/wrapper.jl 91.17% <91.17%> (ø)
src/mlj_catboostregressor.jl 93.33% <93.33%> (ø)
src/mlj_catboostclassifier.jl 95.65% <95.65%> (ø)
src/CatBoost.jl 100.00% <100.00%> (+8.00%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ablaom
Copy link
Member

ablaom commented Nov 30, 2022

FYI: https://github.com/JuliaAI/MLJTestInterface.jl

@ablaom
Copy link
Member

ablaom commented Dec 4, 2022

Happy to provide a code review for the MLJ interface if required (although may have to wait until Jan). Ping me when you're ready.

@tylerjthomas9
Copy link
Collaborator Author

@ablaom I would love if you looked at the MLJ interface when you get a chance.

The summary of the changes that I have made is that the core interface is the same, but there is a submodule, MLJCatBoostInterface that has all the MLJ code.

Here are some items that I have questions about:

  • using CatBoost and using CatBoost.MLJCatBoostInterface both export CatBoostRegressor and CatBoostClassifier. I do not picture a user having issues with this, but it doesn't seem right. My first thought was to rename the base interface versions to PyCatBoostRegressor and ``PyCatBoostClassifier`, because they are the actual python classes, and not structs. However, I am not sure this breaking change is necessary.
  • I am getting MLJ data type warnings on CatBoostClassifier. Do you know why these are appearing?

@ablaom
Copy link
Member

ablaom commented Jan 13, 2023

@tylerjthomas9 Thanks for this work. I shall take a look at it shortly.

using CatBoost and using CatBoost.MLJCatBoostInterface both export CatBoostRegressor and

There is no need export the MLJ models, so if I understand right, they don't need different names. For example, when you use @load in MLJ you only import the module-qualified version of CatBoostRegressor, which will not conflict with the CatBoost if, for some reason that has been imported.

I am getting MLJ data type warnings on CatBoostClassifier. Do you know why these are appearing?v

I'll need more context but will possibly get this when I start to review.

@tylerjthomas9
Copy link
Collaborator Author

tylerjthomas9 commented Jan 13, 2023

I'll need more context but will possibly get this when I start to review.

I get the following warning when running the MLJTestInterface tests for the CatBoostClassifier:

┌ Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc CatBoost.jl.CatBoostClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{AbstractVector{Count}}, AbstractVector{Count}}
│ 
│ fit_data_scitype(model) = Tuple{Union{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractMatrix{Continuous}}, Union{AbstractVector{<:Finite}, AbstractVector{<:Continuous}}} 
└ @ MLJBase C:\Users\TylerThomas\.julia\packages\MLJBase\9Nkjh\src\machines.jl:230

I have tried playing with the input scitypes quite a bit, but I can't get it to resolve. I may be missing something obvious here.

examples/mlj/binary.jl Outdated Show resolved Hide resolved
test/mlj_interface.jl Outdated Show resolved Hide resolved
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of work has put into this, thanks! So far, I am only addressing the scitype issues in this review and would like a further opportunity to review more carefully later.

1 My first objection is that you expose the hyper-parameters cat_features and text_feautres in the MLJ models. In MLJ we avoid hyper-parameters that pass characteristics of the data. The data should speak for itself. And this is not a problem because MLJ requires the user to represent data in a way that allows inference of what features are categorical from their scitype. In other words, cat_features and text_feautres (and maybe some others??) are things your interface should be determining from the data presented internally.

I'm guessing that any input column that has OrderedFactor scitype you will want to internally convert to integer, for catboost (which you can do with MLJModelInterface.int) and that any input column that has Multiclass scitype you will flag internally using cat_features - and maybe you can leave this, but probably you need convert to integers for catboost also. Any Count features are integers and are not to be treated as categorical but as "numerical" which catboost will do automatically.

I'm not sure what the text_features hyperparameter does in catboost. Does catboost support some kind of semantic anaysis? Or is this just a different kind of categorical feature? You'll have to explain to me how catboost interprets such features for me to comment further. But, as far as the String types I'm seeing in the test examples, I think ordinary Multiclass is intended, and so those string types need coercion to categorical vectors in the test scripts, as I have indicated.

2. It seems strange to me that your target_scitype declaration for the classifier includes Continuous. Is there a reason for this?

@tylerjthomas9
Copy link
Collaborator Author

Thank you for taking the time to review.

2. It seems strange to me that your target_scitype declaration for the classifier includes Continuous. Is there a reason for this?

Oops. I accidentally added this. I have removed this.

1 My first objection is that you expose the hyper-parameters cat_features and text_feautres in the MLJ models. In MLJ we avoid hyper-parameters that pass characteristics of the data. The data should speak for itself. And this is not a problem because MLJ requires the user to represent data in a way that allows inference of what features are categorical from their scitype. In other words, cat_features and text_feautres (and maybe some others??) are things your interface should be determined from the data presented internally.

I have added automatic detection of these columns using the function prepare_input. Let me know what you think.

any input column that has Multiclass scitype you will flag internally using cat_features - and maybe you can leave this, but probably you need convert to integers for catboost also. Any

I think that we can leave this column. Catboost should be able to handle it.

I'm not sure what the text_features hyperparameter does in catboost. Does catboost support some kind of semantic anaysis?

I am not too familiar with how catboost handles text features, but it seems like the text is split on spaces, then converted to numeric features using Bag of Words. Here is a little more info if you're interested: https://catboost.ai/en/docs/concepts/algorithm-main-stages_text-to-numeric.

@ablaom
Copy link
Member

ablaom commented Jan 16, 2023

I am not too familiar with how catboost handles text features, but it seems like the text is split on spaces, then converted to numeric features using Bag of Words.

Ah ha! In that case you may want to consider enlarging input_scitype for both models to include columns with elscitype Textual and pass such columns onto catboost as is, but with the relevant column flagged in text_features, which you internally set, like cat_features. It will be important to document carefully in the doc string that String data will be interpreted as Textual and not Multiclass. Some users are likely going to use strings, thinking they are going to be interpreted as Multiclass.

edit Looks like you already did just this. Great!

examples/mlj/binary.jl Outdated Show resolved Hide resolved
examples/mlj/multiclass.jl Outdated Show resolved Hide resolved
examples/mlj/multiclass.jl Outdated Show resolved Hide resolved

function MMI.selectrows(::CatBoostModels, I, data_pool)
py_I = numpy.array(numpy.array(I))
return (data_pool.slice(py_I))
Copy link
Member

@ablaom ablaom Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something wrong here. Why should data_pool have slice as field? See my other comment. Also, I think we're missing a comma:

Suggested change
return (data_pool.slice(py_I))
return (data_pool.slice(py_I),)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should selectrows handle data in the format that MMI.update/MMI.fit uses? I can add methods for handling generic table subsetting too. I have slice here because I assume that the data is in a catboost pool at this point, and you have to slice the pool to subset it.

https://catboost.ai/en/docs/concepts/python-reference_pool_slice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right! That's not the issue. But I have this error, and I think the comma is the problem.

using MLJ
model = CatBoostClassifier()
mach = machine(model, X, y)
train, test = partition(eachindex(y), 0.7)

julia> fit!(mach, rows=train)
[ Info: Training machine(CatBoostClassifier(iterations = 1000, ), ).
┌ Error: Problem fitting the machine machine(CatBoostClassifier(iterations = 1000, ), ). 
└ @ MLJBase ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: Python: TypeError: 'Pool' object is not iterable
Python stacktrace: none
Stacktrace:
 [1] pythrow()
   @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/err.jl:94
 [2] errcheck
   @ ~/.julia/packages/PythonCall/3GRYN/src/err.jl:10 [inlined]
 [3] pyiter(x::PythonCall.Py)
   @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/abstract/iter.jl:6
 [4] iterate(x::PythonCall.Py)
   @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/Py.jl:339
 [5] fit_only!(mach::Machine{CatBoostClassifier, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)                                                     
   @ MLJBase ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:680
 [6] #fit!#63
   @ ~/.julia/packages/MLJBase/9kTjj/src/machines.jl:778 [inlined]
 [7] top-level scope
   @ REPL[29]:1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the missing comma

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@ablaom
Copy link
Member

ablaom commented Jan 30, 2023

There's still something wrong with that selectrows implementation. Here's a MWE:

using CatBoost.MLJCatBoostInterface
using MLJBase

model = CatBoostClassifier()
X, y = make_blobs(100) # 100 observations
train, test = partition(eachindex(y), 0.7)

Xpy = MLJBase.reformat(model, X)

# works fine:
selectrows(model, 71:99, Xpy...)

# fails:
selectrows(model, 71:100, Xpy...)
# ERROR: Python: CatBoostError: catboost/libs/helpers/array_subset.h:605: TIndexedSubset[29] (100) >= srcSize (100)
# Python stacktrace:
#  [1] _catboost._PoolBase._take_slice
#    @ _catboost.pyx:4425
#  [2] _catboost._PoolBase._take_slice
#    @ _catboost.pyx:4419
#  [3] slice
#    @ catboost.core ~/.julia/environments/catboost/.CondaPkg/env/lib/python3.10/site-packages/catboost/core.py:1052
# Stacktrace:
#  [1] pythrow()
#    @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/err.jl:94
#  [2] errcheck
#    @ ~/.julia/packages/PythonCall/3GRYN/src/err.jl:10 [inlined]
#  [3] pycallargs(f::PythonCall.Py, args::PythonCall.Py)
#    @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/abstract/object.jl:210
#  [4] pycall(f::PythonCall.Py, args::PythonCall.Py; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})                                        
#    @ PythonCall ~/.julia/packages/PythonCall/3GRYN/src/abstract/object.jl:228
#  [5] pycall
#    @ ~/.julia/packages/PythonCall/3GRYN/src/abstract/object.jl:218 [inlined]
#  [6] #_#11
#    @ ~/.julia/packages/PythonCall/3GRYN/src/Py.jl:352 [inlined]
#  [7] Py
#    @ ~/.julia/packages/PythonCall/3GRYN/src/Py.jl:352 [inlined]
#  [8] selectrows(#unused#::CatBoostClassifier, I::UnitRange{Int64}, data_pool::PythonCall.Py)
#    @ CatBoost.MLJCatBoostInterface ~/Julia/CatBoost/src/MLJCatBoostInterface.jl:105
#  [9] top-level scope
#    @ REPL[41]:1

0-based indexing adjustment?

@tylerjthomas9
Copy link
Collaborator Author

Ah, you're right. I just pushed a fix.

@ablaom
Copy link
Member

ablaom commented Jan 31, 2023

Thanks for that progress.

My stacking integration tests are failing and I have a pretty good idea what the problem is. In these tests, the folds can get quite small, and you have folds where the target (iris in this case) is missing a class. It's the responsibility of the implementation to track all classes, even those not explicitly manifest.

Currently this tracking in not happening, because we have pool=missing in this line. This means only the actual classes that python knows about are tracked. We only need to specify the correct pool and the UnivariateConstructor will automatically add zero probabilities for the classes not seen (and, so in stacking, always return a probability array of the correct dimension to the meta learner, on every fold). See the example in the next comment.

The pool can be any instance of CategoricalArray, CategoricalValue or CategoricalPool. Elsewhere, we have just grabbed the first element of the target y (a CategoricalValue) in fit and passed that along with the other learned parameters (you call this model, the docs call this the fitresult) to make it available to predict where the UnivariateFinite constructor appears. I suggest you do that.

Now that I think about it, it may be that your reformat operation is destroying this important information (when transforming y). So you may need to address the issue there first, so that the output of reformat includes the CategoricalValue first(y) sorting the pool. Something like

reformat(model, X, y) -> (X_py, y_py, first(y))

Adjust the signatures of fit and selectrows accordingly.

@ablaom
Copy link
Member

ablaom commented Jan 31, 2023

Perhaps this is helpful:

julia> v = ["X", "Y", "Z", "Z"] |> categorical
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "X"
 "Y"
 "Z"
 "Z"

julia> probs = UnivariateFinite(["Z", "X"], [0.1 0.9; 0.2 0.8], pool=v[1])
2-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
 UnivariateFinite{Multiclass{3}}(X=>0.9, Z=>0.1)     # <--- notice the Multiclass{3} indicating  3 classes in total
 UnivariateFinite{Multiclass{3}}(X=>0.8, Z=>0.2)

julia> pdf(probs, levels(v))
2×3 Matrix{Float64}:
 0.9  0.0  0.1
 0.8  0.0  0.2

@ablaom
Copy link
Member

ablaom commented Jan 31, 2023

Oh, and can I request that that fit not be so verbose. Preferably verbose <= 1 is silent but at present verbosity seems to have no effect at all.

@ablaom
Copy link
Member

ablaom commented Feb 2, 2023

Good news. If you just add these two missing methods, then all the level-3 MLJIntegrationTests pass, for both the classifier and the regressor:

function MMI.selectrows(::CatBoostClassifier, I, data_pool)
    py_I = numpy.array(numpy.array(I .- 1))
    return (data_pool.slice(py_I), )
end

function MMI.selectrows(::CatBoostClassifier, I::Colon, data_pool)
    return (data_pool, )
end

These are needed so we can subsample data for use in predict methods, which will not involve the extra y_first piece that fit gets.

@tylerjthomas9
Copy link
Collaborator Author

Good news. If you just add these two missing methods, then all the level-3 MLJIntegrationTests pass, for both the classifier and the regressor:

Done! I just expanded the types for the regressor one back to both models

src/MLJCatBoostInterface.jl Outdated Show resolved Hide resolved
Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments on the docstring for the classifier. The same would apply to the regressor.

Otherwise good to go.

@tylerjthomas9 Thanks for this valuable contribution and your perseverance.

@ablaom
Copy link
Member

ablaom commented Feb 2, 2023

Great. Here's a short demo to show what all this hard work helps you do:

using CatBoost.MLJCatBoostInterface
using MLJBase
using MLJIteration

X, y = @load_iris
model = CatBoostClassifier()
iterated_model = IteratedModel(model, retrain=true)

iterated_model.controls # show default controls
# 5-element Vector{Any}:
#  Step(10)
#  Patience(5)
#  GL(2.0)
#  TimeLimit(Dates.Millisecond(108000))
#  InvalidValue()

mach = machine(iterated_model, X, y) |> fit!
# [ Info: Training machine(ProbabilisticIteratedModel(model = CatBoostClassifier(iterations = 1000, …), …), …).
# [ Info: No iteration parameter specified. Using `iteration_parameter=:(iterations)`. 
# [ Info: No measure specified. Using `measure=LogLoss(tol = 2.220446049250313e-16)`. 
# [ Info: final loss: 0.6239726325698093
# [ Info: Stop triggered by Patience(5) stopping criterion. 
# [ Info: Retraining on all provided data. To suppress, specify `retrain=false`. 
# [ Info: Total of 480 iterations. 

predict(mach, rows=1:3)
# 3-element UnivariateFiniteVector{Multiclass{3}, String, UInt32, Float64}:
#  UnivariateFinite{Multiclass{3}}(setosa=>0.996, versicolor=>0.00243, virginica=>0.00169)
#  UnivariateFinite{Multiclass{3}}(setosa=>0.98, versicolor=>0.0153, virginica=>0.0043)
#  UnivariateFinite{Multiclass{3}}(setosa=>0.995, versicolor=>0.00287, virginica=>0.00203)

For more, see MNIST Flux example and the official docs.

Ping me when a new release is tagged and I will updated the MLJ model registry.

@tylerjthomas9
Copy link
Collaborator Author

@ericphanson How does this PR look to you? The core functionality should remain the same, and nothing that currently works should break. However, there is now a CatBoost.MLJCatBoostInterface module that implements the MLJ interface. The codecov has dropped, but from what I can see, it is just from the MLJ interface.

Copy link
Collaborator

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks so much for both of your hard work @ablaom & @tylerjthomas9.

I think it is technically breaking since we had documented certain functions return DataFrames and they no longer do. However I think it's totally fine to make a breaking release here. So I think we can just bump the release & merge/tag.

Project.toml Outdated Show resolved Hide resolved
@ericphanson ericphanson merged commit bde0968 into JuliaAI:main Feb 2, 2023
@ericphanson ericphanson mentioned this pull request Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants