Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to MLJ? #9

Closed
azev77 opened this issue Jun 7, 2021 · 12 comments
Closed

Port to MLJ? #9

azev77 opened this issue Jun 7, 2021 · 12 comments

Comments

@azev77
Copy link

azev77 commented Jun 7, 2021

Hey and thank you for this package!
I've been hoping for a CatBoost interface for a while!!

Have you considered porting this to MLJ.jl?
This would be an awesome addition as they currently support XGBoost & LightGBM.
@ablaom @tlienart

BTW, I noticed some Julia wrappers wrap ML models in high level code (like Python/R).
Other wrappers wrap the underlying low level code (eg GLMNet.jl wraps the Fortran code from glmnet.R).
Wrapping the underlying CatBoost code would prob be a pain, but would there be a performance difference?

@ablaom
Copy link
Member

ablaom commented Jun 8, 2021

For the record, there is also an MLJ interface for EvoTrees.jl, another pure julia implementation of gradient tree boosting. So this should make a good template for adding an MLJ interface to CatBoost.jl, I would expect. It includes, for example, an appropriate implementation of MLJ's update method, which makes "warm restarts" possible, and allows one to wrap these models in an iterative control strategy (eg, implement early stopping based on out-of-sample losses).

cc @jeremiedb

@femtomc
Copy link
Collaborator

femtomc commented Jun 25, 2021

Hi everyone -- thanks for the interest.

I suspect wrapping the low level code will be a pain. In terms of performance, of course a native wrapper would be faster than calling through the Python runtime -- but the performance penalty incurred by calling through the runtime should be negligible versus the time it takes for a model to train, etc. So we have no intention of trying to wrap the native C++ code (if CatBoost offers a C API -- this may change, although IIRC I don't think they export a C API).

We considered implementing the MLJ interface previously -- but ultimately decided that the way CatBoost does things and the way that MLJ does things are different enough that the impedance mismatch was not worth seriously trying to fix given our priorities. Our perspective then changed: this CatBoost.jl package would be a pure wrapper package -- and if someone wants to implement a MLJCatboost.jl package -- we would welcome it.

In particular, one point -- considering https://alan-turing-institute.github.io/MLJ.jl/dev/quick_start_guide_to_adding_models/#Model-type-and-constructor (the process of fitting with MLJ) --

Compare this to the (essentially API restricted) way of fitting CatBoost models:

# Create pools.
train = Pool(; data=x_train, label=y_train, group_id=queries_train)
test = Pool(; data=x_test, label=y_test, group_id=queries_test)

# small number of iterations to not slow down CI too much
default_parameters = Dict("iterations" => 10, "loss_function" => "RMSE",
                          "custom_metric" => ["MAP:top=10", "PrecisionAt:top=10",
                                              "RecallAt:top=10"], "verbose" => false,
                          "random_seed" => 314159)

function fit_model(params, train_pool, test_pool)
    model = catboost.CatBoost(params)
    model.fit(train_pool; eval_set=test_pool, plot=false)
    return model
end

Hyperparameters are passed over the line in Dict form to the Python runtime -- and there's a very large number of them available for customization by the user. So supporting a generic mutable CatBoostModel struct which satisfies the MLJ interfaces seemed more restrictive than just exposing this API to the user here.

Again, if either of you are interested in creating an MLJCatBoost wrapper library -- we would welcome it! But we are not prioritizing it.

Thank you.

@ablaom
Copy link
Member

ablaom commented Jun 28, 2021

Comment to self: This is not a pure Julia implementation, but a wrap of python code (wrapping C, presumably).

@ericphanson
Copy link
Collaborator

ericphanson commented Jun 28, 2021

Yep, a popular C++ library: https://github.com/catboost/catboost (if it were C, we might try to wrap it directly instead of its python interface). This is a pretty minimal wrapper that just uses PyCall and tries to make it a bit more convenient to send/receive tabular data.

@ericphanson
Copy link
Collaborator

I’d be interested in adding an MLJ interface directly to CatBoost.jl here; I think it would add a lot of value. I bet we can find a way to make the interfaces work.

@azev77
Copy link
Author

azev77 commented Oct 3, 2022

Any progress?

@ericphanson
Copy link
Collaborator

No, sorry. I was writing a new model and was thinking about doing it with CatBoost but ended up going with XGBoost after a quick check showed similar perf in this case. (I have seen CatBoost do noticeably better in other cases though). Hopefully we can find time to do it at some point, but for now it's not a priority for me.

@ablaom
Copy link
Member

ablaom commented Oct 5, 2022

BTW, it looks like XGBoost.jl is getting a well-needed rewrite. dmlc/XGBoost.jl#111

🤞🏾

@ericphanson
Copy link
Collaborator

Closed by #16

v0.3.0 will have MLJ integration thanks to @tylerjthomas9 and @ablaom !

@azev77
Copy link
Author

azev77 commented Feb 4, 2023

It would be great if the MLJ docs were updated to reflect this.

@ablaom
Copy link
Member

ablaom commented Feb 4, 2023

Will happen when I update the model registry shortly. I'll re-open this to flag this hasn't happened yet.

@ablaom
Copy link
Member

ablaom commented Feb 4, 2023

Oh, I can't reopen. I'll create the issue at MLJModels now instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants