DIALITE: Discover, Align and Integrate Open Data Tables

This repository contains the implementation codes for our demo paper: DIALITE, presented at SIGMOD 2023.

Authors: Aamod Khatiwada, Roee Shraga and Renée J. Miller

Paper and demonstration video: https://dl.acm.org/doi/10.1145/3555041.3589732

Link to the demo website: https://tinyurl.com/dialite-sigmod

Block Diagram of DIALITE System

Abstract

We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze the integration result by applying different downstreaming tasks over it. Our pipeline has a flexible architecture such that the user can easily add and compare additional discovery and integration algorithms.

Repository Organization

alite folder contains ALITE codes adopted from its original implementation.
data folder contains the sub-folders for sample datasets and placeholder for additional datasets.
dialite.png file shows the block diagram of DIALITE system.
join folder contains the joinability system codes.
santos folder contains SANTOS codes and indexes adopted from its original implementation.
README.md file explains the repository.
requirements.txt file contains necessary packages to run the project.
templates folder contains frontend code for the demo website.
yago folder is placeholder for YAGO knowledge base files.
*.ipynb files are example notebooks to run the demo without using web api.
*.py files contain python flask backend codes for the demo website.

Setup

Clone the repo
CD to the repo directory. Create and activate a virtual environment for this project. We recommend using python version 3.7 or higher.

On macOS or Linux:

python3 -m venv env
source env/bin/activate
which python

On windows:

python -m venv env
.\env\Scripts\activate.bat
where.exe python

Install necessary packages.
```
pip install -r requirements.txt
```

Reproducibility

CD to the repo.

Run the following command that downloads preprocessed yago files from this link and uploads them to yago folder.

cd yago && gdown --folder https://drive.google.com/drive/folders/1FhvwxE0_iDO8Xy4jI7uq7roZSNXOJGr1 && mv yago/* ../ && rm -r yago && cd ../

Preprocess your data lake using SANTOS and upload the indexes to santos/hashmap folder. All index file names must start with: dialite_datalake as shown below.

dialite_datalake_main_relation_index.pickle
dialite_datalake_main_triple_index.pickle
dialite_datalake_main_yago_index.pickle
dialite_datalake_synth_relation_inverted_index.pbz2
dialite_datalake_synth_relation_kb.pbz2
dialite_datalake_synth_type_inverted_index.pbz2
dialite_datalake_synth_type_kb.pbz2

Alternatively, you can also run the following command that downloads the preprocessed indexes for SANTOS Small Benchmark from this link.

cd santos/hashmap && gdown --folder https://drive.google.com/drive/folders/1-1smQ5aD6iZLQcvdW6l_n2RhjhzY1UT_ && mv santos_hashmap/* ../ && rm -r santos_hashmap && cd ../../

Upload your data lake tables to data/dialite_datalake folder. You can run the following command that downloads SANTOS Small Benchmark and use it as a data lake. Note that we also provide the preprocessed indexes in the previous step for this benchmark.
```
cd data/dialite_datalake && zenodo_get 7758091 && unzip santos_benchmark.zip && cd santos_benchmark && mv * ../ && cd ../ && rm -r santos_benchmark && rm *.zip & cd ../
```
Set Environment Variables.

On macOS or Linux:

export FLASK_APP=main.py
export FLASK_DEBUG=1

On Windows:
```
set FLASK_APP=main.py
set FLASK_DEBUG=1
```
If you want to turn off the debug mode, set FLASK_DEBUG=0.

Start Flask Application in a terminal.
```
python main.py
```
Open the link shown in the terminal using any web browser. The browser must support HTML5 and javascript.

Citation

@inproceedings{DBLP:conf/sigmod/KhatiwadaSM23,
  author       = {Aamod Khatiwada and
                  Roee Shraga and
                  Ren{\'{e}}e J. Miller},
  title        = {{DIALITE:} Discover, Align and Integrate Open Data Tables},
  booktitle    = {Companion of the 2023 International Conference on Management of Data,
                  {SIGMOD/PODS} 2023, Seattle, WA, USA, June 18-23, 2023},
  pages        = {187--190},
  publisher    = {{ACM}},
  year         = {2023},
  url          = {https://doi.org/10.1145/3555041.3589732},
  doi          = {10.1145/3555041.3589732}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
alite		alite
data		data
join		join
py_entitymatching		py_entitymatching
santos		santos
templates		templates
yago		yago
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_entity_resolution.ipynb		analyze_entity_resolution.ipynb
dialite.jpg		dialite.jpg
dialite_extendibility_demo.ipynb		dialite_extendibility_demo.ipynb
dialite_server.py		dialite_server.py
dialite_usecase_demo.ipynb		dialite_usecase_demo.ipynb
load_dictionaries.py		load_dictionaries.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIALITE: Discover, Align and Integrate Open Data Tables

Abstract

Repository Organization

Setup

Reproducibility

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DIALITE: Discover, Align and Integrate Open Data Tables

Abstract

Repository Organization

Setup

Reproducibility

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages