New feature: Get top n columns #55

michaelkonstantinou · 2023-07-11T01:21:53Z

Resolves #52

As stated in issue #52 , it would be useful to be able to get the top n similar columns when analyzing the data. Since the issue is still open, I decided to add this feature myself as I could use it during my data preprocessing

Solution

This pull request adds two new methods into the metrics.py file

get_top_n_columns which returns a complete dictionary of all top-n columns for each column in the two datasets
For instance: {('Table1', 'Authors'): ['Authors, 'AccessList']}
get_top_n_columns_for_column Something that could be more useful in my opinion, to return the top columns for a specific dataset
For instance: What are the top two matches for column 'Access' of table 1?

I am not quite sure what exactly the OP wanted or what the team would prefer to, but at least a boilerplate is established and in case more information should be added that can be easily modified. (e.g. add float value next to it)

Additional changes

Added a new example to demonstrate the new feature. It uses a different algorithm though as COMA compares the names as well and in this case it might not be much informative

Notes

I didn't find a test case for metrics that's why I didn't add one
The code style being used complies with PEP-8

I hope this is useful. Let me know if you prefer any changes or any additional functionality.

Archer6621 · 2023-09-18T19:54:34Z

Hello @Mikhail-Konstantinou , first of all thank you for your contribution, it's great to see contributions from the outside being made.

Overall the code looks good!

I have a couple of comments for you to take into consideration:

I feel like the two methods have significant overlap in functionality. It would make sense if get_top_n_columns would use get_top_n_columns_for_column somehow. Another option is to provide get_top_n_columns with a keyword argument that allows to you specify a list of specific columns of df1 to use for top n in df2 (and by default have it pick all columns), so you could get rid of the second method that has an overly long name :)
Maybe it's nice to have a list of dicts, with column name as key and score as value, instead of just a list of column names. Doing this gives insight into the distribution of the scores. I think this is also what you suggested with the "add float value" remark.

EDIT: After a second look I dropped some of my comments, so I've adjusted the post.

…column has been deleted

michaelkonstantinou · 2023-10-22T19:10:58Z

@Archer6621

Hello and thanks for your input. I believe the final changes solve both of the issues/suggestions you mentioned

Indeed, get_top_n_columns_for_column is long and not needed anymore. I refactored get_top_n_columns to accept a list of keys. If not, it returns all columns by default as you suggested. However, I changed it a bit and instead of choosing which columns from df1 you want... you can choose which columns you want either from df1 or from df2. I believe the latter is stronger, more flexible and cleaner
Yes, now it returns a list of dicts. The column name is the key and the score is the value

PS. I checked the conflicting files that github complains about, and they are not related to this function. I believe you can merge it easily by selecting the line of code you think is correct

michaelkonstantinou added 2 commits July 11, 2023 03:01

Added: Return top n columns metrics

43c2986

Updated: Comments

6c8ac55

michaelkonstantinou added 2 commits October 22, 2023 19:19

Added: Score value in get_top_n_columns result

ae8442d

Refactored: get_top_n_columns accepts keys and get_top_n_columns_for_…

4976891

…column has been deleted

Archer6621 mentioned this pull request Jan 24, 2024

API Refactor - MatcherResults and metrics #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New feature: Get top n columns #55

New feature: Get top n columns #55

michaelkonstantinou commented Jul 11, 2023

Archer6621 commented Sep 18, 2023 •

edited

Loading

michaelkonstantinou commented Oct 22, 2023

New feature: Get top n columns #55

Are you sure you want to change the base?

New feature: Get top n columns #55

Conversation

michaelkonstantinou commented Jul 11, 2023

Solution

Additional changes

Archer6621 commented Sep 18, 2023 • edited Loading

michaelkonstantinou commented Oct 22, 2023

Archer6621 commented Sep 18, 2023 •

edited

Loading