This repository contains the implementation codes for our demo paper: DIALITE, presented at SIGMOD 2023.
Authors: Aamod Khatiwada, Roee Shraga and Renée J. Miller
Paper and demonstration video: https://dl.acm.org/doi/10.1145/3555041.3589732
Link to the demo website: https://tinyurl.com/dialite-sigmod
Block Diagram of DIALITE System
We demonstrate a novel table discovery pipeline called DIALITE that allows users to discover, integrate and analyze open data tables. DIALITE has three main stages. First, it allows users to discover tables from open data platforms using state-of-the-art table discovery techniques. Second, DIALITE integrates the discovered tables to produce an integrated table. Finally, it allows users to analyze the integration result by applying different downstreaming tasks over it. Our pipeline has a flexible architecture such that the user can easily add and compare additional discovery and integration algorithms.
- alite folder contains ALITE codes adopted from its original implementation.
- data folder contains the sub-folders for sample datasets and placeholder for additional datasets.
- dialite.png file shows the block diagram of DIALITE system.
- join folder contains the joinability system codes.
- santos folder contains SANTOS codes and indexes adopted from its original implementation.
- README.md file explains the repository.
- requirements.txt file contains necessary packages to run the project.
- templates folder contains frontend code for the demo website.
- yago folder is placeholder for YAGO knowledge base files.
- *.ipynb files are example notebooks to run the demo without using web api.
- *.py files contain python flask backend codes for the demo website.
-
Clone the repo
-
CD to the repo directory. Create and activate a virtual environment for this project. We recommend using python version 3.7 or higher.
- On macOS or Linux:
python3 -m venv env source env/bin/activate which python - On windows:
python -m venv env .\env\Scripts\activate.bat where.exe python
- Install necessary packages.
pip install -r requirements.txt
-
CD to the repo.
-
Run the following command that downloads preprocessed yago files from this link and uploads them to yago folder.
cd yago && gdown --folder https://drive.google.com/drive/folders/1FhvwxE0_iDO8Xy4jI7uq7roZSNXOJGr1 && mv yago/* ../ && rm -r yago && cd ../ -
Preprocess your data lake using SANTOS and upload the indexes to santos/hashmap folder. All index file names must start with: dialite_datalake as shown below.
dialite_datalake_main_relation_index.pickle dialite_datalake_main_triple_index.pickle dialite_datalake_main_yago_index.pickle dialite_datalake_synth_relation_inverted_index.pbz2 dialite_datalake_synth_relation_kb.pbz2 dialite_datalake_synth_type_inverted_index.pbz2 dialite_datalake_synth_type_kb.pbz2Alternatively, you can also run the following command that downloads the preprocessed indexes for SANTOS Small Benchmark from this link.
cd santos/hashmap && gdown --folder https://drive.google.com/drive/folders/1-1smQ5aD6iZLQcvdW6l_n2RhjhzY1UT_ && mv santos_hashmap/* ../ && rm -r santos_hashmap && cd ../../ -
Upload your data lake tables to data/dialite_datalake folder. You can run the following command that downloads SANTOS Small Benchmark and use it as a data lake. Note that we also provide the preprocessed indexes in the previous step for this benchmark.
cd data/dialite_datalake && zenodo_get 7758091 && unzip santos_benchmark.zip && cd santos_benchmark && mv * ../ && cd ../ && rm -r santos_benchmark && rm *.zip & cd ../ -
Set Environment Variables.
-
On macOS or Linux:
export FLASK_APP=main.py export FLASK_DEBUG=1 -
On Windows:
set FLASK_APP=main.py set FLASK_DEBUG=1If you want to turn off the debug mode, set FLASK_DEBUG=0.
-
Start Flask Application in a terminal.
python main.py -
Open the link shown in the terminal using any web browser. The browser must support HTML5 and javascript.
@inproceedings{DBLP:conf/sigmod/KhatiwadaSM23,
author = {Aamod Khatiwada and
Roee Shraga and
Ren{\'{e}}e J. Miller},
title = {{DIALITE:} Discover, Align and Integrate Open Data Tables},
booktitle = {Companion of the 2023 International Conference on Management of Data,
{SIGMOD/PODS} 2023, Seattle, WA, USA, June 18-23, 2023},
pages = {187--190},
publisher = {{ACM}},
year = {2023},
url = {https://doi.org/10.1145/3555041.3589732},
doi = {10.1145/3555041.3589732}
}