Skip to content

Integrate new methods

Kyriakos Psarakis edited this page Mar 2, 2020 · 2 revisions

Introduction

This is a guide on how to add new methods to the Valentine suite.

Steps

  1. Create a new branch from master.

  2. Add your implementation inside the algorithms folder in a folder of its own.

  3. The "main" class of your algorithm must extend the BaseMatcher class.

The base matcher implemented method get_matches takes as an input the two data loaders (source and target) and the dataset name. Based on the type of your method (schema-based, instance-based or both) select the appropriate data loader from the data_loader module. There is also the BaseLoader that just contains the paths of the schema and instance files if you do not want to use one of the given ones.

  1. Implement the get_matches method that returns a dictionary entry with a tuple as a key that looks like this (column1, column2) and a value that is a similarity metric between the two columns.

Important: Make sure that the column names are formatted as a tuple in the following way: (table_name, column_name). This is needed because two columns might have the same name in two different tables.

Important: We need a similarity metric because we sort the results in descending order. If you have a distance metric use the following formula to convert it to a similarity metric 1/(1+D).

  1. Import the class that you extended inside the algorithms.py script so that it is visible to the framework.

  2. Create a .json file with the grid-search that you would like to be performed on the parameters of your method. In this file, you also have to specify the data loader type used in the get_matches method of the "main" class (step 3). Take as an example the algorithm_configurations.json file. In this file, two types of parameters are supported:

    1. range: where you have to specify the max, min and step
    2. values: where you create a list of all the possible values that you want to run the grid search on.

A good example to start with is the Jaccard Levenshtein implementation due to its simplicity.

If you have a non-python implementation take a look at the wrapper around COMA which was a java project.

Clone this wiki locally