Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appreciation and Curiosity #2

Open
KevinDanikowski opened this issue Feb 26, 2021 · 1 comment
Open

Appreciation and Curiosity #2

KevinDanikowski opened this issue Feb 26, 2021 · 1 comment

Comments

@KevinDanikowski
Copy link

Just wanted to say that I think this is an amazing package you created. I'm really curious what sources you used to do the pre-processing? I've found various resources which support ever thing you're doing, but I've not found one succinct approach such as this aside from yours.

@hhhhhhhhhn
Copy link
Owner

First of all, sorry for the very late reply, and thanks for the appreciation.

For the pre-processing, stop words are removed, words are stemmed using snowball stemmers, and finally are divided into n-grams. After that, matching n-grams in both texts are clustered together based on their Chebyshev distance, and each cluster is given a score, equivalent to the match length times its density.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants