Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch matching method addressing #62 #67

Merged
merged 1 commit into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
[![PyPI version](https://badge.fury.io/py/valentine.svg)](https://badge.fury.io/py/valentine)
![PyPI - Downloads](https://img.shields.io/pypi/dm/valentine)
[![Python 3.8+](https://img.shields.io/badge/python-3.8|3.9|3.10|3.11-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/85cfebfc9c6a43359c5b2e56a5fdf3a3)](https://app.codacy.com/gh/delftdata/valentine/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)

A python package for capturing potential relationships among columns of different tabular datasets, which are given in the form of pandas DataFrames. Valentine is based on [Valentine: Evaluating Matching Techniques for Dataset Discovery](https://ieeexplore.ieee.org/abstract/document/9458921)

Expand Down Expand Up @@ -70,15 +71,25 @@ In order to do so, the user can choose one of the following 5 matching methods:
* **coeff_policy**(*str*) - Policy for deciding the weight coefficients of the propagation graph. Choice of "inverse\_product" or "inverse\_average" (default).
* **formula**(*str*) - Formula on which iterative fixpoint computation is based. Choice of "basic", "formula\_a", "formula\_b" and "formula\_c" (default).

### Matching DataFrames
### Matching DataFrame Pair

After selecting one of the 5 matching methods, the user can initiate the matching process in the following way:
After selecting one of the 5 matching methods, the user can initiate the pairwise matching process in the following way:

```python
matches = valentine_match(df1, df2, matcher, df1_name, df2_name)
```

where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a dictionary storing as keys column pairs from the two DataFrames and as keys the corresponding similarity scores.
where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a dictionary storing as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

### Matching DataFrame Batch

After selecting one of the 5 matching methods, the user can initiate the batch matching process in the following way:

```python
matches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)
```

where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a dictionary storing as keys column pairs from the DataFrames and as values the corresponding similarity scores.

### Measuring effectiveness

Expand Down
22 changes: 20 additions & 2 deletions tests/test_valentine.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

from valentine.data_sources import DataframeTable

from valentine import valentine_match, valentine_metrics, NotAValentineMatcher
from valentine import valentine_match, valentine_match_batch, valentine_metrics, NotAValentineMatcher
from tests import df1, df2
from valentine.algorithms import Coma
from valentine.algorithms import Coma, DistributionBased


class TestValentine(unittest.TestCase):
Expand All @@ -28,3 +28,21 @@ def test_metrics(self):
('EID', 'EID')]
metrics = valentine_metrics.all_metrics(matches, golden_standard)
assert metrics['recall_at_sizeof_ground_truth'] == 1.0

def test_batch_generator(self):
n = 3

def generate_df1():
for _ in range(n):
yield df1

def generate_df2():
for _ in range(n):
yield df2

matches = valentine_match_batch(generate_df1(), generate_df2(), DistributionBased())
assert len(matches) > 0

def test_batch_list(self):
matches = valentine_match_batch([df1, df1, df1], [df2, df2, df2], DistributionBased())
assert len(matches) > 0
44 changes: 37 additions & 7 deletions valentine/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from typing import Iterable, List, Union

import pandas as pd

import valentine.metrics as valentine_metrics
Expand All @@ -9,17 +11,45 @@ class NotAValentineMatcher(Exception):
pass


def validate_matcher(matcher):
if not isinstance(matcher, valentine.algorithms.BaseMatcher):
raise NotAValentineMatcher('The method that you selected is not supported by Valentine')


def valentine_match(df1: pd.DataFrame,
df2: pd.DataFrame,
matcher: valentine.algorithms.BaseMatcher,
df1_name: str = 'table_1',
df2_name: str = 'table_2'):
if isinstance(matcher, valentine.algorithms.BaseMatcher):
table_1 = valentine.data_sources.DataframeTable(df1, name=df1_name)
table_2 = valentine.data_sources.DataframeTable(df2, name=df2_name)
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
key=lambda item: item[1], reverse=True))
else:
raise NotAValentineMatcher('The method that you selected is not supported by Valentine')

validate_matcher(matcher)

table_1 = valentine.data_sources.DataframeTable(df1, name=df1_name)
table_2 = valentine.data_sources.DataframeTable(df2, name=df2_name)
matches = dict(sorted(matcher.get_matches(table_1, table_2).items(),
key=lambda item: item[1], reverse=True))

return matches


def valentine_match_batch(df_iter_1: Iterable[pd.DataFrame],
df_iter_2: Iterable[pd.DataFrame],
matcher: valentine.algorithms.BaseMatcher,
df_iter_1_names: Union[List[str], None] = None,
df_iter_2_names: Union[List[str], None] = None):

validate_matcher(matcher)

matches = {}

for df1_idx, df1 in enumerate(df_iter_1):
for df2_idx, df2 in enumerate(df_iter_2):
table_1_name = df_iter_1_names[df1_idx] if df_iter_1_names is not None else f'table_1_{df1_idx}'
table_2_name = df_iter_2_names[df2_idx] if df_iter_2_names is not None else f'table_2_{df2_idx}'
table_1 = valentine.data_sources.DataframeTable(df1, name=table_1_name)
table_2 = valentine.data_sources.DataframeTable(df2, name=table_2_name)
matches.update(matcher.get_matches(table_1, table_2))

matches = dict(sorted(matches.items(), key=lambda item: item[1], reverse=True))

return matches
Loading