Observatory

A Python library for embedding inference of relational tabular data. This repository evolves from the codebase of our VLDB 2024 paper Observatory: Characterizing Embeddings of Relational Tables.

We are open-sourcing Observatory for a beta test. The library is under active development and we welcome feedback and contributions. Please feel free to open an issue or submit a pull request. Once APIs become stable, we will release the library on conda and pip.

Installation

Install from source

Assume using Miniconda for Python package management on Linux machines.

Clone the repository and go to the project directory:
```
git clone <repo url>
cd observatory-library
```

Create and activate the environment:

conda env create -f cpu_environment.yml
conda activate observatory

If you have access to GPUs, install the corresponding GPU environment:

conda env create -f cuda_<11.8/12.1>_environment.yml
conda activate observatory

Install from Conda

Coming soon after beta-test.

Quick Start

import os

import torch

from observatory.datasets.sequence import (
    EncodedInputsDataset,
    encoded_inputs_collate_fn,
)
from observatory.datasets.sotab import SotabDataset, collate_fn
from observatory.models.encoder import BertModelWrapper
from observatory.preprocessing.columnwise import (
    ColumnwiseCellDocumentFrequencyBasedPreprocessor,
)
from torch.utils.data import DataLoader

# Initialize data (the metadata file simply lists all the table file names)
data_dir = "./tests/sample_data/sotab"
metadata_filepath = os.path.join(data_dir, "table_inventory.csv")
sotab_dataset = SotabDataset(data_dir, metadata_filepath)

sotab_dataloader = DataLoader(
    sotab_dataset,
    batch_size=16,  # batch size for loading tables
    shuffle=False,
    collate_fn=collate_fn,
)

# Initialize pretrained model
model_name = "bert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_wrapper = BertModelWrapper(model_name, device)

# Create cell document frequency-based preprocessor for inferring column embeddings
cell_frequencies = sotab_dataset.compute_cell_document_frequencies()
columnwise_preprocessor = ColumnwiseCellDocumentFrequencyBasedPreprocessor(
    tokenizer=model_wrapper.tokenizer,
    max_input_size=model_wrapper.max_input_size,
    cell_frequencies=cell_frequencies,
    include_table_name=True,
    include_column_names=True,
    include_column_stats=True,
)

# Batch loading tables
for batch_tables in sotab_dataloader:
    # Each column is serialized to a sequence of tokens
    encoded_inputs, cls_positions = columnwise_preprocessor.serialize(
        batch_tables
    )
    
    # Batch embedding inference
    encoded_inputs_dataset = EncodedInputsDataset(
        encoded_inputs, cls_positions
    )
    encoded_inputs_dataloader = DataLoader(
        encoded_inputs_dataset,
        batch_size=64,  # batch size for embedding inference
        collate_fn=encoded_inputs_collate_fn,
    )

    for batch_encoded_inputs, batch_cls_positions in encoded_inputs_dataloader:
        column_embeddings = self.model_wrapper.infer_embeddings(
            batch_encoded_inputs, batch_cls_positions
        )

You can find more examples of embedding inference in the tests directory.

Features

Leave serialization, encoding, and (batch) inference to Observatory

We currently support the following preprocessors:

Preprocessor	Embedding Inference	Source
CellDocumentFrequencyBasedPreprocessor	column	DeepJoin
MaxRowsPreprocessor	column, row, table, cell	Observatory

Easy integration with Hugging Face models

We currently support any BERT-like encoder model including BERT, RoBERTa and ALBERT. To extend the support to other models, simply implement a wrapper class that inherits from BERTFamilyModelWrapper and implements the get_model method. For example,

from transformers import AlbertModel

class AlbertModelWrapper(BERTFamilyModelWrapper):
    def get_model(self) -> AlbertModel:
        try:
            model = AlbertModel.from_pretrained(
                self.model_name, local_files_only=True
            )
        except OSError:
            model = AlbertModel.from_pretrained(self.model_name)

        model = model.to(self.device)
        model.eval()

        return model

Citing Observatory

If you find Observatory useful for your work, please cite the following BibTeX:

@article{cong2023observatory,
  author  = {Tianji Cong and
             Madelon Hulsebos and
             Zhenjie Sun and
             Paul Groth and
             H. V. Jagadish},
  title   = {Observatory: Characterizing Embeddings of Relational Tables},
  journal = {Proc. {VLDB} Endow.},
  volume  = {17},
  number  = {4},
  pages   = {849--862},
  year    = {2023},
}

@inproceedings{cong2023observatorylibrary,
  author    = {Cong, Tianji and
               Sun, Zhenjie and
               Groth, Paul and
               Jagadish, H. V. and
               Hulsebos, Madelon},
  title     = {Introducing the Observatory Library for End-to-End Table Embedding Inference},
  booktitle = {The 2nd Table Representation Learning Workshop at NeurIPS 2023},
  publisher = {https://table-representation-learning.github.io},
  year      = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.vscode		.vscode
observatory		observatory
tests		tests
.gitignore		.gitignore
README.md		README.md
cpu_environment.yml		cpu_environment.yml
cuda_11.8_environment.yml		cuda_11.8_environment.yml
cuda_12.1_environment.yml		cuda_12.1_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Observatory

Installation

Install from source

Install from Conda

Quick Start

Features

Leave serialization, encoding, and (batch) inference to Observatory

Easy integration with Hugging Face models

Citing Observatory

About

Uh oh!

Releases

Packages

Languages

superctj/observatory-library

Folders and files

Latest commit

History

Repository files navigation

Observatory

Installation

Install from source

Install from Conda

Quick Start

Features

Leave serialization, encoding, and (batch) inference to Observatory

Easy integration with Hugging Face models

Citing Observatory

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages