Unified Lookup Tables

Transformers, despite their success in a variety of sequence modeling tasks, have a significant limitation: they are inherently data-greedy, which can lead to overfitting when the data are scarce. In such cases, common practice is to build a foundation model, a model trained on large amounts of publicly available data, that can then be fine tuned for a specific task. Another known problem of foundation models is training data leakage. It has been demonstrated that excerpts of the training data can be obtained by prompt engineering on a foundation model, which poses a high risk of exposing confidential data. This repository implements Unified Lookup Tables (ULT), a data preprocessing step for building and fine tuning foundation models in a privacy preserving manner, which simultaneously enables the reuse of a trained model on new datasets without exposing any training data. The method relies on data compression methods as efficient modality tokenizers, and a common representation vocabulary for all datasets.

Installation of dependencies

The repository contains a python package that implements ULT, dependencies are managed with poetry:

poetry install

Usage for data preprocessing

See examples of ULT data processing for text and images in notebooks/.

Usage for training

To pretrain SmolLM-135M on the wikitext dataset using our ULT run the following script where the file train_args.yaml contains training arguments that can be modified as desired.

poetry run scripts/python training_ult.py --dataset_name wikitext --train_args_path train_args.yaml

To pretrain on the PASS images dataset

poetry run scripts/python training_ult.py --dataset_name pass --train_args_path train_args.yaml

To finetune the image pretraining to the CIFAR10 dataset run the same script as above but replace loading the SmolLM-135M from scratch dataset for the checkpoint obtained from PASS pretraining.

Structure of the code

scripts contains the basic training pipelines leveraging the package
ult.data contains data loaders for the three datasets from the paper "Unified Lookup Tables: Privacy Preserving Foundation Models"
ult.transforms and ult.configuration contain the code for encoding and decoding the data, as well as building a lookup table.
tests contains unit tests

NOTE: for running tests follow this procedure

poetry run pip install pytest
poetry run pytest -sv tests

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
notebooks		notebooks
scripts		scripts
src/ult		src/ult
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Lookup Tables

Installation of dependencies

Usage for data preprocessing

Usage for training

Structure of the code

About

Releases

Packages

Contributors 3

Languages

License

IBM/unified-lookup-tables

Folders and files

Latest commit

History

Repository files navigation

Unified Lookup Tables

Installation of dependencies

Usage for data preprocessing

Usage for training

Structure of the code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages