API Refactor - MatcherResults and metrics (#70)

delftdata · Jan 31, 2024 · dd15f95 · dd15f95
1 parent 96430e7
commit dd15f95
Show file tree

Hide file tree

Showing 13 changed files with 659 additions and 341 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,4 +5,6 @@ __pycache__/
 dist
 valentine.egg-info
 build
-.vscode/
+.vscode/
+valentine.sublime-workspace
+valentine.sublime-project
diff --git a/README.md b/README.md
@@ -76,7 +76,7 @@ After selecting one of the 5 matching methods, the user can initiate the pairwis
 matches = valentine_match(df1, df2, matcher, df1_name, df2_name)
 ```
 
-where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a dictionary storing as keys column pairs from the two DataFrames and as values the corresponding similarity scores.
+where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.
 
 ### Matching DataFrame Batch
 
@@ -86,23 +86,48 @@ After selecting one of the 5 matching methods, the user can initiate the batch m
 matches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)
 ```
 
-where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a dictionary storing as keys column pairs from the DataFrames and as values the corresponding similarity scores.
+where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.
 
-### Measuring effectiveness
 
-Based on the matches retrieved by calling `valentine_match` the user can use 
+### MatcherResults instance
+The `MatcherResults` instance has some convenience methods that the user can use to either obtain a subset of the data or to transform the data. This instance is a dictionary and is sorted upon instantiation, from high similarity to low similarity.
+```python
+top_n_matches = matches.take_top_n(5)
+
+top_n_percent_matches = matches.take_top_percent(25)
+
+one_to_one_matches = matches.one_to_one()
+```
+
+
+### Measuring effectiveness
+The MatcherResults instance that is returned by `valentine_match` or `valentine_match_batch` also has a `get_metrics` method that the user can use 
 
 ```python 
-metrics = valentine_metrics.all_metrics(matches, ground_truth)
+metrics = matches.get_metrics(ground_truth)
 ``` 
 
-in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold.
+in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold (see example below).
+
+By default, all the core metrics will be used for this with default parameters, but the user can also customize which metrics to run with what parameters, and implement own custom metrics by extending from the `Metric` base class. Some sets of metrics are available as well.
+
+```python
+from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
+metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
+metrics_prefefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)
+
+```
 
 
 ### Example
-The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (as found in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):
+The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):
 
 ```python
+import os
+import pandas as pd
+from valentine import valentine_match
+from valentine.algorithms import Coma
+
 # Load data using pandas
 d1_path = os.path.join('data', 'authors1.csv')
 d2_path = os.path.join('data', 'authors2.csv')
@@ -120,25 +145,26 @@ ground_truth = [('Cited by', 'Cited by'),
                 ('Authors', 'Authors'),
                 ('EID', 'EID')]
 
-metrics = valentine_metrics.all_metrics(matches, ground_truth)
+metrics = matches.get_metrics(ground_truth)
 
 print(metrics)
 ```
 
 The output of the above code block is:
 
 ```
-{(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313, 
-(('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037, 
-(('table_1', 'EID'), ('table_2', 'EID')): 0.8214057}
-{'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0, 
-'precision_at_10_percent': 1.0, 
-'precision_at_30_percent': 1.0,
-'precision_at_50_percent': 1.0, 
-'precision_at_70_percent': 1.0, 
-'precision_at_90_percent': 1.0, 
-'recall_at_sizeof_ground_truth': 1.0}
-
+{
+     (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.86994505, 
+     (('table_1', 'Authors'), ('table_2', 'Authors')): 0.8679843, 
+     (('table_1', 'EID'), ('table_2', 'EID')): 0.8571245
+}
+{
+     'Recall': 1.0, 
+     'F1Score': 1.0, 
+     'RecallAtSizeofGroundTruth': 1.0, 
+     'Precision': 1.0, 
+     'PrecisionTop10Percent': 1.0
+}
 ```
 
 ## Cite Valentine

diff --git a/examples/valentine_example.py b/examples/valentine_example.py
@@ -1,8 +1,10 @@
 import os
 import pandas as pd
-from valentine import valentine_match, valentine_metrics
-from valentine.algorithms import Coma
+from valentine.metrics import F1Score, PrecisionTopNPercent
+from valentine import valentine_match
+from valentine.algorithms import JaccardDistanceMatcher
 import pprint
+pp = pprint.PrettyPrinter(indent=4, sort_dicts=False)
 
 
 def main():
@@ -13,28 +15,40 @@ def main():
     df2 = pd.read_csv(d2_path)
 
     # Instantiate matcher and run
-    # Coma requires java to be installed on your machine
-    # If java is not an option, all the other algorithms are in Python (e.g., Cupid)
-    matcher = Coma(use_instances=False)
+    matcher = JaccardDistanceMatcher()
     matches = valentine_match(df1, df2, matcher)
 
+    # MatcherResults is a wrapper object that has several useful
+    # utility/transformation functions
+    print("Found the following matches:")
+    pp.pprint(matches)
+
+    print("\nGetting the one-to-one matches:")
+    pp.pprint(matches.one_to_one())
+
     # If ground truth available valentine could calculate the metrics
     ground_truth = [('Cited by', 'Cited by'),
                     ('Authors', 'Authors'),
                     ('EID', 'EID')]
 
-    metrics = valentine_metrics.all_metrics(matches, ground_truth)
-
-    pp = pprint.PrettyPrinter(indent=4)
-    print("Found the following matches:")
-    pp.pprint(matches)
+    metrics = matches.get_metrics(ground_truth)
 
     print("\nAccording to the ground truth:")
     pp.pprint(ground_truth)
 
-    print("\nThese are the scores of the matcher:")
+    print("\nThese are the scores of the default metrics for the matcher:")
     pp.pprint(metrics)
 
+    print("\nYou can also get specific metric scores:")
+    pp.pprint(matches.get_metrics(ground_truth, metrics={
+        PrecisionTopNPercent(n=80),
+        F1Score()
+    }))
+
+    print("\nThe MatcherResults object is a dict and can be treated such:")
+    for match in matches:
+        print(f"{str(match): <60} {matches[match]}")
+
 
 if __name__ == '__main__':
     main()
diff --git a/tests/test_matcher_results.py b/tests/test_matcher_results.py
@@ -0,0 +1,86 @@
+import unittest
+import math
+
+from tests import df1, df2
+from valentine.algorithms.matcher_results import MatcherResults
+from valentine.algorithms import JaccardDistanceMatcher
+from valentine.metrics import Precision
+from valentine import valentine_match
+
+
+class TestMatcherResults(unittest.TestCase):
+    def setUp(self):
+        self.matches = valentine_match(df1, df2, JaccardDistanceMatcher())
+        self.ground_truth = [
+            ('Cited by', 'Cited by'),
+            ('Authors', 'Authors'),
+            ('EID', 'EID')
+        ]
+
+    def test_dict(self):
+        assert isinstance(self.matches, dict)
+
+    def test_get_metrics(self):
+        metrics = self.matches.get_metrics(self.ground_truth)
+        assert all([x in metrics for x in {"Precision", "Recall", "F1Score"}])
+
+        metrics_specific = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
+        assert "Precision" in metrics_specific
+
+    def test_one_to_one(self):
+        m = self.matches
+
+        # Add multiple matches per column
+        pairs = list(m.keys())
+        for (ta, ca), (tb, cb) in pairs:
+            m[((ta, ca), (tb, cb + 'foo'))] = m[((ta, ca), (tb, cb))] / 2
+
+        # Verify that len gets corrected from 6 to 3
+        m_one_to_one = m.one_to_one()
+        assert len(m_one_to_one) == 3 and len(m) == 6
+
+        # Verify that none of the lower similarity "foo" entries made it
+        for (ta, ca), (tb, cb) in pairs:
+            assert ((ta, ca), (tb, cb + 'foo')) not in m_one_to_one
+
+        # Verify that the cache resets on a new MatcherResults instance
+        m_entry = MatcherResults(m)
+        assert m_entry._cached_one_to_one is None
+
+        # Add one new entry with lower similarity
+        m_entry[(('table_1', 'BLA'), ('table_2', 'BLA'))] = 0.7214057
+
+        # Verify that the new one_to_one is different from the old one
+        m_entry_one_to_one = m_entry.one_to_one()
+        assert m_one_to_one != m_entry_one_to_one
+
+        # Verify that all remaining values are above the median
+        median = sorted(list(m_entry.values()), reverse=True)[math.ceil(len(m_entry)/2)]
+        for k in m_entry_one_to_one:
+            assert m_entry_one_to_one[k] >= median
+
+    def test_take_top_percent(self):
+        take_0_percent = self.matches.take_top_percent(0)
+        assert len(take_0_percent) == 0
+
+        take_40_percent = self.matches.take_top_percent(40)
+        assert len(take_40_percent) == 2
+
+        take_100_percent = self.matches.take_top_percent(100)
+        assert len(take_100_percent) == len(self.matches)
+
+    def test_take_top_n(self):
+        take_none = self.matches.take_top_n(0)
+        assert len(take_none) == 0
+
+        take_some = self.matches.take_top_n(2)
+        assert len(take_some) == 2
+
+        take_all = self.matches.take_top_n(len(self.matches))
+        assert len(take_all) == len(self.matches)
+
+        take_more_than_all = self.matches.take_top_n(len(self.matches)+1)
+        assert len(take_more_than_all) == len(self.matches)
+
+    def test_copy(self):
+        assert self.matches.get_copy() is not self.matches
diff --git a/tests/test_metrics.py b/tests/test_metrics.py
@@ -1,47 +1,76 @@
 import unittest
+from valentine.metrics import *
+from valentine.algorithms.matcher_results import MatcherResults
+from valentine.metrics.metric_helpers import get_fp, get_tp_fn
 
-import math
-from valentine.metrics.metrics import one_to_one_matches
-from copy import deepcopy
+class TestMetrics(unittest.TestCase):
+    def setUp(self):
+        self.matches = MatcherResults({
+            (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313,
+            (('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037,
+            (('table_1', 'EID'), ('table_2', 'EID')): 0.8214057,
+            (('table_1', 'Title'), ('table_2', 'DUMMY1')): 0.8214057,
+            (('table_1', 'Title'), ('table_2', 'DUMMY2')): 0.8114057,
+        })
+        self.ground_truth = [
+            ('Cited by', 'Cited by'),
+            ('Authors', 'Authors'),
+            ('EID', 'EID'),
+            ('Title', 'Title'),
+            ('DUMMY3', 'DUMMY3')
 
-matches = {
-    (('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313,
-    (('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037,
-    (('table_1', 'EID'), ('table_2', 'EID')): 0.8214057,
-}
+        ]
 
-ground_truth = [
-    ('Cited by', 'Cited by'),
-    ('Authors', 'Authors'),
-    ('EID', 'EID')
-]
+    def test_precision(self):
+        precision = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
+        assert 'Precision' in precision and precision['Precision'] == 0.75
 
+        precision_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={Precision(one_to_one=False)})
+        assert 'Precision' in precision_not_one_to_one and precision_not_one_to_one['Precision'] == 0.6
 
-class TestMetrics(unittest.TestCase):
+    def test_recall(self):
+        recall = self.matches.get_metrics(self.ground_truth, metrics={Recall()})
+        assert 'Recall' in recall and recall['Recall'] == 0.6
+
+        recall_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={Recall(one_to_one=False)})
+        assert 'Recall' in recall_not_one_to_one and recall_not_one_to_one['Recall'] == 0.6
+
+    def test_f1(self):
+        f1 = self.matches.get_metrics(self.ground_truth, metrics={F1Score()})
+        assert 'F1Score' in f1 and round(100*f1['F1Score']) == 67
+
+        f1_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={F1Score(one_to_one=False)})
+        assert 'F1Score' in f1_not_one_to_one and f1_not_one_to_one['F1Score'] == 0.6
+
+    def test_precision_top_n_percent(self):
+        precision_0 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=0)})
+        assert 'PrecisionTop0Percent' in precision_0 and precision_0['PrecisionTop0Percent'] == 0
 
-    def test_one_to_one(self):
-        m = deepcopy(matches)
+        precision_50 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=50)})
+        assert 'PrecisionTop50Percent' in precision_50 and precision_50['PrecisionTop50Percent'] == 1.0
 
-        # Add multiple matches per column
-        pairs = list(m.keys())
-        for (ta, ca), (tb, cb) in pairs:
-            m[((ta, ca), (tb, cb + 'foo'))] = m[((ta, ca), (tb, cb))] / 2
+        precision = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
+        precision_100 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=100)})
+        assert 'PrecisionTop100Percent' in precision_100 and precision_100['PrecisionTop100Percent'] == precision['Precision']
 
-        # Verify that len gets corrected to 3
-        m_one_to_one = one_to_one_matches(m)
-        assert len(m_one_to_one) == 3 and len(m) == 6
+        precision_70_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=70, one_to_one=False)})
+        assert 'PrecisionTop70Percent' in precision_70_not_one_to_one and precision_70_not_one_to_one['PrecisionTop70Percent'] == 0.75
 
-        # Verify that none of the lower similarity "foo" entries made it
-        for (ta, ca), (tb, cb) in pairs:
-            assert ((ta, ca), (tb, cb + 'foo')) not in m_one_to_one
+    def test_recall_at_size_of_ground_truth(self):
+        recall = self.matches.get_metrics(self.ground_truth, metrics={RecallAtSizeofGroundTruth()})
+        assert 'RecallAtSizeofGroundTruth' in recall and recall['RecallAtSizeofGroundTruth'] == 0.6
 
-        # Add one new entry with lower similarity
-        m_entry = deepcopy(matches)
-        m_entry[(('table_1', 'BLA'), ('table_2', 'BLA'))] = 0.7214057
+    def test_metric_helpers(self):
+        limit = 2
+        tp, fn = get_tp_fn(self.matches, self.ground_truth, n=limit)
+        assert tp <= len(self.ground_truth) and fn <= len(self.ground_truth)
 
-        m_entry_one_to_one = one_to_one_matches(m_entry)
+        fp = get_fp(self.matches, self.ground_truth, n=limit)
+        assert fp <= limit
+        assert tp == 2 and fn == 3  # Since we limit to 2 of the matches
+        assert fp == 0
 
-        # Verify that all remaining values are above the median
-        median = sorted(set(m_entry.values()), reverse=True)[math.ceil(len(m_entry)/2)]
-        for k in m_entry_one_to_one:
-            assert m_entry_one_to_one[k] >= median
+    def test_metric_equals(self):
+        assert PrecisionTopNPercent(n=10, one_to_one=False) == PrecisionTopNPercent(n=10, one_to_one=False)
+        assert PrecisionTopNPercent(n=10, one_to_one=False) != PrecisionTopNPercent(n=10, one_to_one=True)
+        assert PrecisionTopNPercent(n=10, one_to_one=False) != Precision()