Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The prediction of XGBoostClassifier doesn't match the output of ShapIQ #353

Open
Linh-nk opened this issue Mar 24, 2025 · 7 comments
Open
Assignees
Labels
explainer 🔍 All issues that are linked to explainers question ❔ Further information is requested

Comments

@Linh-nk
Copy link

Linh-nk commented Mar 24, 2025

The log odd of f(x) in the waterfall plot does not match the prediction I got from XGBoostClassifier model predict_proba method.

Also, should the baseline value be close to the log odd of the average of y? The log odd of E[f(X)] I got from the waterfall plot does not match the average of ground truth. Please correct me if I'm wrong.

@mmschlk mmschlk added the question ❔ Further information is requested label Mar 25, 2025
@mmschlk
Copy link
Owner

mmschlk commented Mar 25, 2025

I don't see an image/waterfall plot. So I can only speculate. But yes, you are right, the SV do not match with what comes out of XGBoost when you run predict_proba. Similar to SHAP, we explain the margin of the prediction and not the predicted probabilities (see this test case test for a reference):

# explain with shapiq
explainer_shapiq = TreeExplainer(
   model=xgb_clf_model, max_order=1, index="SV", class_index=class_label
)
sv = explainer_shapiq.explain(x=x_explain_shapiq)
sum_of_sv = sv

# get the margin prediction of the model
prediction = xgb_clf_model.predict(x_explain.reshape(1, -1), output_margin=True)
prediction = prediction[0, class_label]

print(prediction == sum_of_sv)
True

There is a paper discussing that the Margins are not a very nice object to explain models with. But up to now neither shap nor shapiq supports another means of explanation here.

Does this answer your question?

@mmschlk mmschlk self-assigned this Mar 25, 2025
@mmschlk mmschlk added the explainer 🔍 All issues that are linked to explainers label Mar 25, 2025
@Linh-nk
Copy link
Author

Linh-nk commented Mar 25, 2025

Thank you for the prompt response! However, I'm still not getting the same output from shapiq and XGBClassifier. Here is the code that I used

import shapiq

explainer_shapiq = shapiq.TreeExplainer(
   model=model, max_order=1, index="SV"
)
sv = explainer_shapiq.explain(x=X.iloc[0].to_numpy())
sum_of_sv = sv.get_n_order_values(1).sum()
base_value = sv.baseline_value

prediction = model.predict(X.iloc[0].to_numpy().reshape(1, -1), output_margin=True)

print(f'prediction: {prediction}')
print(f'shapiq prediction: {base_value + sum_of_sv}')

where model is an XGBClassifier model. And here is the output I got

prediction: [0.91437876]
shapiq prediction: 2.201628711526583

@Linh-nk
Copy link
Author

Linh-nk commented Mar 25, 2025

When I use the shap package, I could get the accurate prediction

import shap

explainer_shap = shap.TreeExplainer(
   model=model, #max_order=1, index="SV"
)
sv = explainer_shap(X=X.iloc[0].to_numpy().reshape(1, -1))

prediction = model.predict(X.iloc[0].to_numpy().reshape(1, -1), output_margin=True)

print(f'prediction: {prediction}')
print(f'shapiq prediction: {sv.values.sum() + sv.base_values}')
prediction: [0.91437876]
shapiq prediction: [0.9143772]

At the first glance, the SHAP values output by shap and shapiq seem to match, but their baseline_value do not

@mmschlk
Copy link
Owner

mmschlk commented Mar 25, 2025

Ah okay, just to make sure: First, are you certain you are explaining the correct class in the shapiq case?
Omitting the class_index in TreeExplainer might default to some other class than what you are getting predictions for. Can you check your case for all class indices?

explainer_shapiq = shapiq.TreeExplainer(
   model=model, max_order=1, index="SV", class_index=class_index
)

Second, what happens if you do not run:

sum_of_sv = sv.get_n_order_values(1).sum()
base_value = sv.baseline_value

but

sum_of_sv = sum(sv.values)

since with shapiq v1.2.3, we made sure that min_order is set to 0 for TreeExplainer which should include the baseline value inside the sv.values array.

Third is it possible to create a minimal reproducible case where shap returns the prediction and shapiq does not? Because here it is the same.

Best
Max

@Linh-nk
Copy link
Author

Linh-nk commented Mar 25, 2025

Hi Max,

I'm running the same example you have but for binary case

from sklearn.datasets import make_classification, make_regression


import numpy as np
import xgboost
import copy

def background_clf_dataset() -> tuple[np.ndarray, np.ndarray]:
    """Return a simple background dataset."""
    X, y = make_classification(
        n_samples=100,
        n_features=10,
        random_state=42,
        n_classes=2,  #binary here
        n_informative=5,
        n_repeated=0,
        n_redundant=0,
    )
    return copy.deepcopy(X), copy.deepcopy(y)

def xgb_clf_model(background_clf_dataset):
    """Return a simple xgboost classification model."""

    X, y = background_clf_dataset
    model = xgboost.XGBClassifier(random_state=42, n_estimators=3)
    model.fit(X, y)
    return model

background_clf_dataset = background_clf_dataset()
xgb_clf_model = xgb_clf_model(background_clf_dataset)
background_clf_data, y = background_clf_dataset

explanation_instance = 1
class_label = 1

# the following code is used to get the shap values from the SHAP implementation
import shap
model_copy = copy.deepcopy(xgb_clf_model)
explainer_shap = shap.TreeExplainer(model=model_copy)
baseline_shap = float(explainer_shap.expected_value)

x_explain_shap = copy.deepcopy(background_clf_data[explanation_instance].reshape(1, -1))
sv_shap_all_classes = explainer_shap.shap_values(x_explain_shap)
sv_shap = sv_shap_all_classes#[0][:, class_label]

# compute with shapiq
import shapiq
explainer_shapiq = shapiq.TreeExplainer(
    model=xgb_clf_model, max_order=1, index="SV", class_index=class_label
)
x_explain_shapiq = copy.deepcopy(background_clf_data[explanation_instance])
sv_shapiq = explainer_shapiq.explain(x=x_explain_shapiq)
sv_shapiq_values = sv_shapiq.get_n_order_values(1)
baseline_shapiq = sv_shapiq.baseline_value

prediction = xgb_clf_model.predict(x_explain_shapiq.reshape(1, -1), output_margin=True)

And here is the output that I got

baseline_shap: 0.0
SHAP values: [[ 0.21501832  0.01152     0.10970242  0.12228977  0.09716809  0.02027433
   0.8166808   0.         -0.01836885 -0.22206578]]
baseline_shapiq: 0.4894029824013125
SHAPIQ values: [ 0.21501832  0.01152     0.10970243  0.12228977  0.09716809  0.02027433
  0.81668073  0.         -0.01836885 -0.22206577]
baseline_shap + SHAP values: 1.1522190570831299
baseline_shapiq + SHAPIQ values: 1.6416220384585056
prediction: [1.1416221]

@mmschlk
Copy link
Owner

mmschlk commented Mar 26, 2025

Ah okay, I will take a look at this more closely! However, the Shapley values are the same for shap and shapiq, which is at least the most important thing. It might be that the baseline_value is not properly extracted from the xgboost model. I just know that this was actually not that easy. Thank you for pointing this out!

Could you let me know what's your xgboost and shapiq version?

@Linh-nk
Copy link
Author

Linh-nk commented Mar 26, 2025

It's xgboost 2.1.4 and shapiq 1.2.3. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
explainer 🔍 All issues that are linked to explainers question ❔ Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants