Skip to content

Python library for embedding inference of relational tables.

Notifications You must be signed in to change notification settings


Repository files navigation


A Python library for embedding inference of relational tabular data. This repository evolves from the codebase of our VLDB 2024 paper Observatory: Characterizing Embeddings of Relational Tables.

We are open-sourcing Observatory library for beta-test. The library is under active development and we welcome feedback and contributions. Please feel free to open an issue or submit a pull request.


Install from source

Assume using Miniconda for Python package management on Linux machines.

  1. Clone the repository and go to the project directory:

    git clone <repo url>
    cd observatory-library
  2. Create and activate the environment:

    conda env create -f cpu_environment.yml
    conda activate observatory

    If you have access to GPUs, install the corresponding GPU environment:

    conda env create -f cuda_<11.8/12.1>_environment.yml
    conda activate observatory

Install from Conda

Coming soon after beta-test.

Quick Start

import os

import torch

from observatory.datasets.sotab import SotabDataset, collate_fn
from observatory.models.bert_family import BertModelWrapper
from observatory.preprocessing.columnwise import (
from import DataLoader

# Initialize data (the metadata file simply lists all the table file names)
data_dir = "./tests/sample_data/wiki_tables"
metadata_filepath = os.path.join(data_dir, "table_inventory.csv")
sotab_dataset = SotabDataset(data_dir, metadata_filepath)

sotab_dataloader = DataLoader(
    batch_size=4,  # batch size for loading tables

# Initialize model
model_name = "bert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_wrapper = BertModelWrapper(model_name, device)

# Create cell document frequency-based preprocessor for inferring column embeddings
cell_frequencies = sotab_dataset.compute_cell_document_frequencies()
columnwise_preprocessor = ColumnwiseCellDocumentFrequencyBasedPreprocessor(

# Infer column embeddings
for batch_tables in sotab_dataloader:
    encoded_inputs, _ = columnwise_preprocessor.serialize(batch_tables)

    column_embeddings = model_wrapper.batch_infer_embeddings(
        encoded_inputs, batch_size=16  # batch size for embedding inference

You can find more examples of embedding inference in the tests directory.


Leave serialization, encoding, and (batch) inference to Observatory

We currently support the following preprocessors:

Preprocessor Embedding Inference Source
CellDocumentFrequencyBasedPreprocessor column DeepJoin
MaxRowsPreprocessor column, row, table, cell Observatory

Easy integration with Hugging Face models

We currently support any BERT-like encoder model including BERT, RoBERTa and ALBERT. To extend the support to other models, simply implement a wrapper class that inherits from BERTFamilyModelWrapper and implements the get_model method. For example,

from transformers import AlbertModel

class AlbertModelWrapper(BERTFamilyModelWrapper):
    def get_model(self) -> AlbertModel:
            model = AlbertModel.from_pretrained(
                self.model_name, local_files_only=True
        except OSError:
            model = AlbertModel.from_pretrained(self.model_name)

        model =

        return model

Citing Observatory

If you find Observatory useful for your work, please cite the following BibTeX:

  author  = {Tianji Cong and
             Madelon Hulsebos and
             Zhenjie Sun and
             Paul Groth and
             H. V. Jagadish},
  title   = {Observatory: Characterizing Embeddings of Relational Tables},
  journal = {Proc. {VLDB} Endow.},
  volume  = {17},
  number  = {4},
  pages   = {849--862},
  year    = {2023},
  author    = {Cong, Tianji and
               Sun, Zhenjie and
               Groth, Paul and
               Jagadish, H. V. and
               Hulsebos, Madelon},
  title     = {Introducing the Observatory Library for End-to-End Table Embedding Inference},
  booktitle = {The 2nd Table Representation Learning Workshop at NeurIPS 2023},
  publisher = {},
  year      = {2023}


No releases published


No packages published
