Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement request: use xgboost as base learner #250

Open
ivan-marroquin opened this issue Apr 28, 2021 · 18 comments
Open

Enhancement request: use xgboost as base learner #250

ivan-marroquin opened this issue Apr 28, 2021 · 18 comments

Comments

@ivan-marroquin
Copy link

Hi all,

I have Python 3.6.5 with xgboost 1.1.0 and ngboost 0.3.10

So, when I train a NGBRegressor with xgboost as base learner, I get the following warning message:

c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py:445: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
"memory consumption")

which may be the source of the poor result shown on the plot on the left in attached image.

Is it possible to use xgboost as a base learner? Please advise.

The code source is as follows:

import numpy as np
import xgboost as xgb
import ngboost
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
from sklearn.metrics import median_absolute_error
from sklearn.model_selection import train_test_split
import multiprocessing

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

 x, y= load_boston(return_X_y= True)

y= y.astype(np.float32)

x= ((x - np.mean(x, axis= 0)) / np.std(x, axis= 0)).astype(np.float32)

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

Using xgboost with ngboost

learner= xgb.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 'reg:squarederror',
booster= 'gbtree', tree_method= 'exact', n_jobs= cpu_count, learning_rate= 0.05, gamma= 0.15,
reg_alpha= 0.20, reg_lambda= 0.50, random_state= 1969)

ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner, 
                                                    natural_gradient= True, n_estimators= 1, learning_rate= 0.01, verbose= False, 
                                                    random_state= 1969)

ngb_1.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds_1= ngb_1.predict(x_validation)

median_abs_error_1= median_absolute_error(y_validation, y_preds_1)

# Using only ngboost
learner= DecisionTreeRegressor(max_depth= 6, criterion= 'friedman_mse', min_impurity_decrease= 0, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, natural_gradient= True, 
                                                    n_estimators= 300, learning_rate= 0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds_2= ngb_2.predict(x_validation)

median_abs_error_2= median_absolute_error(y_validation, y_preds_2)

# Generate plot to compare results
fig, ax= plt.subplots(nrows= 1, ncols= 2)

ax[0].plot(range(0,len(y_validation)), y_validation, '-k')

ax[0].plot(range(0,len(y_validation)), y_preds_1, '--r')

ax[0].set_title("XGBOOST + NGBOOST: \n MedianAbsError {:.4f}".format(median_abs_error_1))

ax[1].plot(range(0,len(y_validation)), y_validation, '-k')

ax[1].plot(range(0,len(y_validation)), y_preds_2, '--r')

ax[1].set_title("NGBOOST \n MedianAbsError {:.4f}".format(median_abs_error_2))

comparison_xgboost-ngboost_against_only_ngboost.zip

@avati
Copy link
Collaborator

avati commented Apr 28, 2021

You would want to do at least two changes to your code:

  1. The base learner needs to be a Python constructor, so that each boosting stage gets its own model. Whereas in your case it is a pre-instantiated object which is getting repurposed/refit (i.e. modified) for every future boosting stage. In effect your whole boosted model is no more expressive than a single base learner.

  2. Ideally you want your base learner xgboost to have n_estimators=1 and the NGBoost model to have n_estimators=300, and not the other way around.

this is an interesting experiment and would love to see how it works out! Thanks for giving it a shot and sharing the results!

@ivan-marroquin
Copy link
Author

Hi @avati

Thanks for your prompt answer. I made the chance to the code, in which the xgboost n_estimators= 1 while NGBoost n_estimators = 300. Unfortunately, I still get the same result.

By any chance, do you have a Python code example on how to change the xgboost model to be more like a Python constructor?

Ivan

@avati
Copy link
Collaborator

avati commented Apr 28, 2021

Here's one way. Instead of:

learner = xgb.XGBRegressor(...)

do:

learner = lambda args: xgb.XGBRegressor(args)

@ivan-marroquin
Copy link
Author

Hi @avati

Thanks for the suggestion. Before pursuing more work with xgboost, I tried the following code:

#_________________
from sklearn.ensemble import GradientBoostingRegressor

learner= GradientBoostingRegressor(loss= 'ls', learning_rate= 0.05, n_estimators= 1, criterion= 'mse', max_depth= 6, min_impurity_decrease= 0, random_state= 1969)

ngb= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.CRPScore, Base= learner,
natural_gradient= True, n_estimators= 300, learning_rate= 0.01, verbose= False,
random_state= 1969)
ngb.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

y_preds= ngb.predict(x_validation)
#_________________

It gave a reasonable result which could be improved by playing with the hyperparameters.

This shows the strength of NGBoost to take learners from scikit-learn library.

On the other hand, the xgboost (although I am using its scikit-learn api) does not seem to work well with NGBoost - as you well explained. Could be possible that the xgboost's api library is missing something required by NGBoost?

Do you have more suggestions?

Ivan

@avati
Copy link
Collaborator

avati commented May 2, 2021

The same suggestion as my previous comment. Use learner with a 'lambda' as shown, whether it is for XGB or GBR.

@ivan-marroquin
Copy link
Author

Hi @avati

Thanks for the suggestion, I tried the command with lambda, and get this message:

Cannot clone object '<function at 0x000001F05A98A840>' (type <class 'function'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods

I am pretty sure that I am missing something on how to implement this approach. Could you provide a more detailed code example?

Ivan

@caiquanyou
Copy link

I also want to use LightGBM as base learner and the same issue with @ivan-marroquin , Could you provide some advise?

@ivan-marroquin
Copy link
Author

Hi @caiquanyou

I think that way on how to run xgboost with ngboost (and perhaps, it applies as well to lightgbm). I found this publication:
https://www.researchgate.net/publication/349528379_Reliable_Evapotranspiration_Predictions_with_a_Probabilistic_Machine_Learning_Framework

and the code source used in this publication can be found at:
https://codeocean.com/capsule/5244281/tree/v1

to make it work with xgboost, it is required to set number of estimators (along with the number of trees used in ngboost). I have xgboost 1.1.0 and ngboost 0.3.10.

I used the toy example used by ngboost (adapted to work with xgboost):

import numpy as np
import ngboost
import xgboost
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import multiprocessing
import matplotlib.pyplot as plt

if name == 'main':
cpu_count= 2 if (multiprocessing.cpu_count() < 4) else (multiprocessing.cpu_count() - 2)

x, y= load_boston(return_X_y= True)

mean_scaler= np.mean(x, axis= 0)

std_scaler= np.std(x, axis= 0)

x= (x - mean_scaler) / std_scaler

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

# using only ngboost
ngb_1= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE,
                                                    natural_gradient= True, n_estimators= 300, learning_rate= 0.01, 
                                                    verbose= False, random_state= 1969)

ngb_1.fit(x_train, y_train)

y_preds_ngboost= ngb_1.predict(x_validation)

# using xgboost with ngboost
learner= xgboost.XGBRegressor(max_depth= 6, n_estimators= 300, verbosity= 1, objective= 
                                                    'reg:squarederror',  booster= 'gbtree', tree_method= 'exact', n_jobs= 
                                                    cpu_count, learning_rate= 0.05, gamma= 0.15, reg_alpha= 0.20, 
                                                    reg_lambda= 0.50, random_state= 1969)


ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.MLE, Base= 
                                                    learner, natural_gradient= True, n_estimators= 300, learning_rate= 
                                                    0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train)

y_preds_hyboost= ngb_2.predict(x_validation)

fig, ax= plt.subplots(nrows= 1, ncols= 3, figsize= (10,5))    

ax[0].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[0].plot(range(0,len(x_validation)), y_preds_ngboost, '--r', label= 'ngboost')    
ax[0].set_title("NGBOOST: validation & prediction")
ax[0].legend()

ax[1].plot(range(0,len(x_validation)), y_validation, '-k', label= 'validation')    
ax[1].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[1].set_title("HYBOOST: validation & prediction")
ax[1].legend()

ax[2].plot(range(0,len(x_validation)), y_preds_ngboost, '-k', label= 'ngboost')    
ax[2].plot(range(0,len(x_validation)), y_preds_hyboost, '--r', label= 'hyboost')    
ax[2].set_title("NGBOOST - HYBOOST: prediction")
ax[2].legend()

plt.show()

Note that xgboost will raise the following warning message:
Warning (from warnings module):
File "C:\Temp\Python\Python3.6.5\lib\site-packages\xgboost\core.py", line 445
"memory consumption")
UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption

I don't know whether this issue may influence the quality of the result. Let me know what do you find on your side,

Hope this helps,

Ivan

@thomasaarholt
Copy link

thomasaarholt commented Aug 6, 2021

That warning shouldn't influence the predictions, but will increase the ram consumption of the computation. I'd be interested in hearing more experiences with using other packages as the Base learner.

@CDonnerer
Copy link

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

@thomasaarholt
Copy link

Exciting! Looking forward to checking it out!

@alejandroschuler
Copy link
Collaborator

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

This is fantastic @CDonnerer. If you're willing, I'd love to have features like these ported into the core NGBoost library. We've had previous discussions on how to make ngboost faster and easier to develop that you would be more than welcome to contribute to.

@astrogilda
Copy link

In case it's useful, I've written a "native" xgboost version of ngboost, implemented in the xgboost scitkit-learn API.

Really cool library! Related question: does xgboost-distribution offer a gpu implementation like xgboost, or nah? I'm assuming the relative performance numbers are for runs on the CPU, right?

@CDonnerer
Copy link

@alejandroschuler Thanks! Sure, I'll have a look at those discussions, there might be options to port those features across in a generic way.

@astrogilda No GPU support for xgboost-distribution yet, indeed, the performance numbers refer to CPU runs.

@kmedved
Copy link
Contributor

kmedved commented Aug 16, 2021

@CDonnerer - just want to say that's a fantastic library you've written. I don't know how practical it would be to port the features over to NGboost as @alejandroschuler suggested, and the coding is way over my head. If that's at all possible, as a user, that would be a great solution (rather than having forked development across two different probabilistic libraries). This would be especially helpful for the purposes of adding additional distribution support in a consistent way.

@StatMixedML
Copy link

@CDonnerer seems like there is quite some overlap with XGBoostLSS, an approach I have developed in 2019

https://github.com/StatMixedML/XGBoostLSS

@ivan-marroquin
Copy link
Author

@StatMixedML thanks for sharing the link of your approach!

@tkzeng
Copy link

tkzeng commented Jun 11, 2022

@ivan-marroquin
I think this should work--looks like the learning rate has an effect even when there is just one tree, and the way this interacts with NGBoost's learning rates might cause unexpected behavior.

learner = xgb.XGBRegressor(max_depth=3, n_estimators=1, learning_rate=1)
ngb_1 = ngboost.NGBRegressor(Base=learner)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants