Interpretable (or explainable) machine learning models, such as decision trees, play a crucial role in the context of trustworthy AI. However, finding optimal decision trees (i.e., minimum size and maximum accuracy trees) is not a simple task and remains an active area of research. While a single decision tree has limited expressivity, using an ensemble of decision trees can effectively capture the complex structures found in many real-world applications. Many existing tree ensemble methods are greedy and suboptimal, and often suffer from randomness in the tree generation process. In this paper, we introduce DT-sampler, a SAT-based decision tree ensemble which allows explicit control over both the size and accuracy of the sampled trees. We developed a novel SAT-based encoding method that utilizes only branch nodes, resulting in a compact representation of decision tree space. Additionally, standard point predictions made using decision tree ensembles do not offer any statistical guarantee over miscoverage rate. We employ conformal prediction (CP), a distribution- free statistical framework which provides a valid finite-sample coverage guarantee, to demonstrate that DT-sampler is statistically more efficient and produces stable results when compared with random forest classifier. We demonstrate the effectiveness of our method through several benchmark and real-world datasets.
You can find more details about DT-Sampler in our paper. The previous version of DT-Sampler (ICML2023 Workshop paper) can be found at https://arxiv.org/abs/2307.13333 and its original codebase at https://github.com/tsudalab/DT-sampler.
We propose an efficient SAT-based approach(only branch node encoding) for constructing decision trees . By introducing a small set of variables and constraints, this method can ensure high accuracy and reduce the search space.
DT-sampler is an ensemble model based on decision tree sampling. Different from random forest, DT-sampler uniformly samples decision trees from a given space, which can generate more stable results and provide higher interpretability compared to random forest. DT-sampler only has two key parameters: #node and threshold. #node constrains the size of decision trees generated by DT-sampler and threshold ensures a minimum training accuracy for each decision tree.
We use conformal prediction to conformalize the sampled trees by a DT-sampler, each offering a coverage guarantee at level
By combining SAT-based tree generation with conformal calibration, we achieve flexible, controllable decision tree ensembles that retain the benefits of smaller, interpretable models without sacrificing performance guarantees.
The feature importance is defined as the contribution of each feature to a high accuracy space.
① Encode the construction of decision trees as a SAT problem.
② Utilize SAT sampler to uniformly sample multiple satisfiable solutions from the high accuracy space.
③ Decode the satisfiable solutions back into decision trees.
④ Estimate the training accuracy distribution of the decision trees in the high accuracy space.
⑤ Measure feature importance by calculating the emergence probability of each feature.
We recommend creating a virtual environment using conda or venv. The "requirements.txt" file has been provided to reproduce the environment. We tested our implementation using Python 3.12.8.
- Create a Conda Virtual Environment
conda create -n dtsampler python=3.12.8 -y
conda activate dtsampler- Install Dependencies
pip install -r requirements.txtTo encode a decision tree invoke the following function in encode.py.
get_solution(X_train, y_train, traget_nodes, true_n, export_path, is_leaf_sampling=True)To train and generate sample decision trees execute the following code snippet.
dt_sampler = DT_sampler(X_train, y_train, node_n, threshold, cnf_path)
dt_sampler.run(num_samples, method="unigen", sample_seed=seed)And you check example.ipynb for a detailed understanding.
Tsuda Laboratory (https://www.tsudalab.org/)
Department of Computational Biology and Medical Science The University of Tokyo


