Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Get top n columns #55

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

michaelkonstantinou
Copy link

Resolves #52

As stated in issue #52 , it would be useful to be able to get the top n similar columns when analyzing the data. Since the issue is still open, I decided to add this feature myself as I could use it during my data preprocessing

Solution

This pull request adds two new methods into the metrics.py file

  • get_top_n_columns which returns a complete dictionary of all top-n columns for each column in the two datasets
  • For instance: {('Table1', 'Authors'): ['Authors, 'AccessList']}
  • get_top_n_columns_for_column Something that could be more useful in my opinion, to return the top columns for a specific dataset
  • For instance: What are the top two matches for column 'Access' of table 1?

I am not quite sure what exactly the OP wanted or what the team would prefer to, but at least a boilerplate is established and in case more information should be added that can be easily modified. (e.g. add float value next to it)

Additional changes

Added a new example to demonstrate the new feature. It uses a different algorithm though as COMA compares the names as well and in this case it might not be much informative

Notes

  • I didn't find a test case for metrics that's why I didn't add one
  • The code style being used complies with PEP-8

I hope this is useful. Let me know if you prefer any changes or any additional functionality.

@Archer6621
Copy link
Contributor

Archer6621 commented Sep 18, 2023

Hello @Mikhail-Konstantinou , first of all thank you for your contribution, it's great to see contributions from the outside being made.

Overall the code looks good!

I have a couple of comments for you to take into consideration:

  • I feel like the two methods have significant overlap in functionality. It would make sense if get_top_n_columns would use get_top_n_columns_for_column somehow. Another option is to provide get_top_n_columns with a keyword argument that allows to you specify a list of specific columns of df1 to use for top n in df2 (and by default have it pick all columns), so you could get rid of the second method that has an overly long name :)

  • Maybe it's nice to have a list of dicts, with column name as key and score as value, instead of just a list of column names. Doing this gives insight into the distribution of the scores. I think this is also what you suggested with the "add float value" remark.

EDIT: After a second look I dropped some of my comments, so I've adjusted the post.

@michaelkonstantinou
Copy link
Author

@Archer6621

Hello and thanks for your input. I believe the final changes solve both of the issues/suggestions you mentioned

  1. Indeed, get_top_n_columns_for_column is long and not needed anymore. I refactored get_top_n_columns to accept a list of keys. If not, it returns all columns by default as you suggested. However, I changed it a bit and instead of choosing which columns from df1 you want... you can choose which columns you want either from df1 or from df2. I believe the latter is stronger, more flexible and cleaner
  2. Yes, now it returns a list of dicts. The column name is the key and the score is the value

PS. I checked the conflicting files that github complains about, and they are not related to this function. I believe you can merge it easily by selecting the line of code you think is correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Top n matches
2 participants