DT-Sampler

Abstract

Interpretable (or explainable) machine learning models, such as decision trees, play a crucial role in the context of trustworthy AI. However, finding optimal decision trees (i.e., minimum size and maximum accuracy trees) is not a simple task and remains an active area of research. While a single decision tree has limited expressivity, using an ensemble of decision trees can effectively capture the complex structures found in many real-world applications. Many existing tree ensemble methods are greedy and suboptimal, and often suffer from randomness in the tree generation process. In this paper, we introduce DT-sampler, a SAT-based decision tree ensemble which allows explicit control over both the size and accuracy of the sampled trees. We developed a novel SAT-based encoding method that utilizes only branch nodes, resulting in a compact representation of decision tree space. Additionally, standard point predictions made using decision tree ensembles do not offer any statistical guarantee over miscoverage rate. We employ conformal prediction (CP), a distribution- free statistical framework which provides a valid finite-sample coverage guarantee, to demonstrate that DT-sampler is statistically more efficient and produces stable results when compared with random forest classifier. We demonstrate the effectiveness of our method through several benchmark and real-world datasets.

Details

You can find more details about DT-Sampler in our paper. The previous version of DT-Sampler (ICML2023 Workshop paper) can be found at https://arxiv.org/abs/2307.13333 and its original codebase at https://github.com/tsudalab/DT-sampler.

Method overview:

SAT-based decision tree encoding

We propose an efficient SAT-based approach(only branch node encoding) for constructing decision trees . By introducing a small set of variables and constraints, this method can ensure high accuracy and reduce the search space.

Decision tree sampling

DT-sampler is an ensemble model based on decision tree sampling. Different from random forest, DT-sampler uniformly samples decision trees from a given space, which can generate more stable results and provide higher interpretability compared to random forest. DT-sampler only has two key parameters: #node and threshold. #node constrains the size of decision trees generated by DT-sampler and threshold ensures a minimum training accuracy for each decision tree.

Calibration using conformal prediction

We use conformal prediction to conformalize the sampled trees by a DT-sampler, each offering a coverage guarantee at level $1-\alpha$. Unlike a standard decision tree that yields a single best guess, a conformal tree outputs a prediction set $\mathcal{C}_{\alpha}(x)$ that contains the (unknown) true label y of a test example x with probability at least $1-\alpha$.This coverage-driven approach ensures statistical validity while maintaining interpretability, especially when restricting tree size.

By combining SAT-based tree generation with conformal calibration, we achieve flexible, controllable decision tree ensembles that retain the benefits of smaller, interpretable models without sacrificing performance guarantees.

Feature importance measurement:

The feature importance is defined as the contribution of each feature to a high accuracy space.

① Encode the construction of decision trees as a SAT problem.
② Utilize SAT sampler to uniformly sample multiple satisfiable solutions from the high accuracy space.
③ Decode the satisfiable solutions back into decision trees.
④ Estimate the training accuracy distribution of the decision trees in the high accuracy space.
⑤ Measure feature importance by calculating the emergence probability of each feature.

Environmental configuration

We recommend creating a virtual environment using conda or venv. The "requirements.txt" file has been provided to reproduce the environment. We tested our implementation using Python 3.12.8.

Create a Conda Virtual Environment

conda create -n dtsampler python=3.12.8 -y
conda activate dtsampler

Install Dependencies

pip install -r requirements.txt

Quick Start

To encode a decision tree invoke the following function in encode.py.

get_solution(X_train, y_train, traget_nodes, true_n, export_path, is_leaf_sampling=True)

To train and generate sample decision trees execute the following code snippet.

dt_sampler = DT_sampler(X_train, y_train, node_n, threshold, cnf_path)
dt_sampler.run(num_samples, method="unigen", sample_seed=seed)

And you check example.ipynb for a detailed understanding.

Contact

Tsuda Laboratory (https://www.tsudalab.org/)

Department of Computational Biology and Medical Science The University of Tokyo

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
cnf		cnf
code		code
data		data
img		img
results		results
README.md		README.md
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DT-Sampler

Abstract

Details

Method overview:

SAT-based decision tree encoding

Decision tree sampling

Calibration using conformal prediction

Feature importance measurement:

Environmental configuration

Quick Start

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

tsudalab/DT-sampler-CP

Folders and files

Latest commit

History

Repository files navigation

DT-Sampler

Abstract

Details

Method overview:

SAT-based decision tree encoding

Decision tree sampling

Calibration using conformal prediction

Feature importance measurement:

Environmental configuration

Quick Start

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages