Skip to content

Latest commit

 

History

History
275 lines (202 loc) · 9.46 KB

README.md

File metadata and controls

275 lines (202 loc) · 9.46 KB

TabPFN

PyPI version Downloads Discord Documentation colab Python Versions

TabPFN Summary

TabPFN is a foundation model for tabular data that outperforms traditional methods while being dramatically faster. This repository contains the core PyTorch implementation with CUDA optimization.

⚠️ Major Update: Version 2.0: Complete codebase overhaul with new architecture and features. Previous version available at v1.0.0 and pip install tabpfn<2.

📚 For detailed usage examples and best practices, check out Interactive Colab Tutorial

🌐 TabPFN Ecosystem

Choose the right TabPFN implementation for your needs:

  • TabPFN Client: Easy-to-use API client for cloud-based inference
  • TabPFN Extensions: Community extensions and integrations
  • TabPFN (this repo): Core implementation for local deployment and research
  • TabPFN UX: No-code TabPFN usage

Try our Interactive Colab Tutorial to get started quickly.

🏁 Quick Start

Installation

# Simple installation
pip install tabpfn

# Local development installation
git clone https://github.com/PriorLabs/TabPFN.git
pip install -e "TabPFN[dev]"

Basic Usage

Classification

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))

# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

Regression

from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Assuming there is a TabPFNRegressor (if not, a different regressor should be used)
from tabpfn import TabPFNRegressor  

# Load Boston Housing data
df = fetch_openml(data_id=531, as_frame=True)  # Boston Housing dataset
X = df.data
y = df.target.astype(float)  # Ensure target is float for regression

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize the regressor
regressor = TabPFNRegressor()  
regressor.fit(X_train, y_train)

# Predict on the test set
predictions = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)

Best Results

For optimal performance, use the AutoTabPFNClassifier or AutoTabPFNRegressor for post-hoc ensembling. These can be found in the TabPFN Extensions repository. Post-hoc ensembling combines multiple TabPFN models into an ensemble.

Steps for Best Results:

  1. Install the extensions:

    git clone https://github.com/priorlabs/tabpfn-extensions.git
    pip install -e tabpfn-extensions
  2. from tabpfn_extensions.post_hoc_ensembles.sklearn_interface import AutoTabPFNClassifier
    
    clf = AutoTabPFNClassifier(max_time=120, device="cuda") # 120 seconds tuning time
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)

See our Colab

🤝 Join Our Community

We're building the future of tabular machine learning and would love your involvement:

  1. Connect & Learn:

  2. Contribute:

    • Report bugs or request features
    • Submit pull requests
    • Share your research and use cases
  3. Stay Updated: Star the repo and join Discord for the latest updates

📜 License

Prior Labs License (Apache 2.0 with additional attribution requirement): here

📚 Citation

You can read our paper explaining TabPFN here.

@article{hollmann2025tabpfn,
 title={Accurate predictions on small data with a tabular foundation model},
 author={Hollmann, Noah and M{\"u}ller, Samuel and Purucker, Lennart and
         Krishnakumar, Arjun and K{\"o}rfer, Max and Hoo, Shi Bin and
         Schirrmeister, Robin Tibor and Hutter, Frank},
 journal={Nature},
 year={2025},
 month={01},
 day={09},
 doi={10.1038/s41586-024-08328-6},
 publisher={Springer Nature},
 url={https://www.nature.com/articles/s41586-024-08328-6},
}

@inproceedings{hollmann2023tabpfn,
  title={TabPFN: A transformer that solves small tabular classification problems in a second},
  author={Hollmann, Noah and M{\"u}ller, Samuel and Eggensperger, Katharina and Hutter, Frank},
  booktitle={International Conference on Learning Representations 2023},
  year={2023}
}

❓ FAQ

Usage & Compatibility

Q: What dataset sizes work best with TabPFN?
A: TabPFN is optimized for datasets up to 10,000 rows. For larger datasets, consider using Random Forest preprocessing or other extensions. See our Colab notebook for strategies.

Q: Why can't I use TabPFN with Python 3.8?
A: TabPFN v2 requires Python 3.9+ due to newer language features. Compatible versions: 3.9, 3.10, 3.11, 3.12.

Installation & Setup

Q: How do I use TabPFN without an internet connection?

TabPFN automatically downloads model weights when first used. For offline usage:

Manual Download

  1. Download the model files manually from HuggingFace:

  2. Place the file in one of these locations:

    • Specify directly: TabPFNClassifier(model_path="/path/to/model.ckpt")
    • Set environment variable: os.environ["TABPFN_MODEL_CACHE_DIR"] = "/path/to/dir"
    • Default OS cache directory:
      • Windows: %APPDATA%\tabpfn\
      • macOS: ~/Library/Caches/tabpfn/
      • Linux: ~/.cache/tabpfn/

Quick Download Script

import requests
from tabpfn.utils import _user_cache_dir
import sys

# Get default cache directory using TabPFN's internal function
cache_dir = _user_cache_dir(platform=sys.platform)
cache_dir.mkdir(parents=True, exist_ok=True)

# Define models to download
models = {
    "tabpfn-v2-classifier.ckpt": "https://huggingface.co/Prior-Labs/TabPFN-v2-clf/resolve/main/tabpfn-v2-classifier.ckpt",
    "tabpfn-v2-regressor.ckpt": "https://huggingface.co/Prior-Labs/TabPFN-v2-reg/resolve/main/tabpfn-v2-regressor.ckpt",
}

# Download each model
for name, url in models.items():
    path = cache_dir / name
    print(f"Downloading {name} to {path}")
    with open(path, "wb") as f:
        f.write(requests.get(url).content)

print(f"Models downloaded to {cache_dir}")

Q: I'm getting a pickle error when loading the model. What should I do?
A: Try the following:

  • Download the newest version of tabpfn pip install tabpfn --upgrade
  • Ensure model files downloaded correctly (re-download if needed)

Performance & Limitations

Q: Can TabPFN handle missing values?
A: Yes!

Q: How can I improve TabPFN’s performance?
A: Best practices:

  • Use AutoTabPFNClassifier from TabPFN Extensions for post-hoc ensembling
  • Feature engineering: Add domain-specific features to improve model performance Not effective:
  • Adapt feature scaling
  • Convert categorical features to numerical values (e.g., one-hot encoding)

🛠️ Development

  1. Setup environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
git clone https://github.com/PriorLabs/TabPFN.git
cd tabpfn
pip install -e ".[dev]"
pre-commit install
  1. Before committing:
pre-commit run --all-files
  1. Run tests:
pytest tests/

Built with ❤️ by Prior Labs - Copyright (c) 2025 Prior Labs GmbH