The source code and the artifacts required to reproduce the results of the paper “Discovering Similarity Inclusion Dependencies” [1] can be found at the link https://github.com/HPI-Information-Systems/Sawfish.
First, clone the repository on your machine by running:
git clone https://github.com/HPI-Information-Systems/Sawfish.git
The repository uses Git LFS for the datasets. Therefore, ensure that you have installed Git LFS. You can download it here.
Ensure the LFS file has been downloaded correctly by checking the size of datasets/SawfishDatasets.zip
. It might be necessary to execute git lfs pull
yourself. You do not need to extract the ZIP file, this will be handled automatically later.
In case there is a rate limit error for LFS, you can download the datasets from here. Simply place the zip in the datasets
directory.
We do not require any special hardware, but we use a main memory threshold of 32GB for the largest dataset.
The structure of the repository is the following:
- The
src/
contains the source code of the Sawfish algorithm, including the algorithm itself (sawfish/main/java/de/metanome/algorithms/sawfish
) and a Metanome Mock version to test execution (sawfish/test/java/de/metanome/algorithms/mock
). Additionally, all baselines from the paper are contained. - The
docker-compose.yml
contains a list of the docker images that are used to execute and test the Sawfish algorithm in different ways. - The
datasets/
folder includes the four datasets used to reproduce the results in the paper - The
metanome
folder contains the compiled JAR of the algorithm (created withmvn clean package
) and a version of the metanome-cli (obtained from the central repository https://github.com/HPI-Information-Systems/Metanome) - The
plot_generation/
folder includes all scripts to generate the plots output data - The
paper_generation/
folder includes all scripts to generate the final paper based on the results and generated plots. The paper is then created infinal_paper/
We offer different ways to execute Sawfish, but each one of them requires docker
and docker-compose
. If it is not yet installed on your machine, please follow these instructions: https://docs.docker.com/engine/install/
Each method differentiates in the ability to customize the output.
1. Master Script
The Master Script is the least customizable, but with one command, the following things will be done:
- Fetch required input data for Sawfish by unzipping
datasets/SawfishDatasets.zip
- Execution of all experiments for the datasets used in the paper (may take up to a week to finish). A single experiment should take at most around two hours. Since the IMDB dataset consistently timed out in ED mode, we excluded those experiments altogether to save time.
- Generation of all the plots & graphs that can be found in the paper (after execution visible in
paper_generation/figures/
directory) - Full compilation of the paper with all new statistics, graphs & plots (after execution visible in
final_paper/paperSINDsKaminsky.pdf
)
The master script can be executed with the .sh file master-script.sh
.
If you are on MacOS or Linux do the following steps:
- Make the script executable by running:
chmod +x master-script.sh
- Execute
master-script.sh
by running:sh master-script.sh
If you are on Windows, do the following steps:
- Ensure that you are using Git Bash, Cygwin or Windows Subsystem for Linux (WSL).
- Make the script executable by running:
chmod +x master-script.sh
- Execute
master-script.sh
by running:./master-script.sh
2. Run each step separately
You can also run each step separately, with different docker images.
- To extract the datasets, use
docker-compose up sawfish-datasets-extraction
- To run all the scripts to generate the results, execute
docker-compose up sawfish-result-generation
- After generating the results, you can generate the plots seen in the paper with
docker-compose up sawfish-plot-generation
- To generate the final paper with the new plots, execute
docker-compose up sawfish-paper-generation
The project contains default values for most configuration options. To run the algorithm, only an input file and the edit distance threshold are required.
Other configuration options are listed below:
editDistanceThreshold
: absolute edit distance thresholdsimilarityThreshold
: Jaccard similarity threshold / normalized edit distance thresholdtokenMode
: turn on the token mode, which computes the Jaccard similarity instead of the edit distanceignoreShortString
: ignore strings that are shorter than the edit distance threshold, only required for comparison to PassJoinmemoryCheckFrequency
: number of values until a memory check occursmaxMemoryUsagePercentage
: percentage of available memory SAWFISH should usewriteDataErrors
: creates an output file with the indirect, i.e. non-equal, matches that were found during validationmeasureTime
: creates an output file with fine-grained time statsignoreNumericColumns
: ignore columns that only contain numeric valueshybridMode
: turns on the hybrid mode of the normalized edit distance computation in SAWFISH