Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate non-exact / fuzzy matches #11

Open
johann-petrak opened this issue Jun 18, 2018 · 4 comments
Open

Investigate non-exact / fuzzy matches #11

johann-petrak opened this issue Jun 18, 2018 · 4 comments

Comments

@johann-petrak
Copy link
Collaborator

From johann-petrak/gateplugin-StringAnnotation#10
Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.

Also, see if we can use gateplugin-ft-distance (GateNLP private so far)

@johann-petrak
Copy link
Collaborator Author

From johann-petrak/gateplugin-StringAnnotation#13
There are various possible approaches, several of them based on tries.
See also http://dbgroup.cs.tsinghua.edu.cn/dd/projects/taste/index.html

@thomas-heitz
Copy link

thomas-heitz commented Aug 26, 2018

Hello Johann,

I'm using your Extended Gazetteer with several millions of entries and it works very well, thanks for that!

To contribute to the ideas about non-exact matches, I can share my method:

  • creation of different versions of gazetteer directories from a source gazetteer directory with a Java script
  • for example one is with normalized punctuation, then accent, stemmed, metaphone, etc.
  • point to the needed gazetteer in the .def file, for entities it will be punctuation and/or accent, for concepts it will be stemmed
  • add a Jape grammar with the same code as the previous Java script before the Extended Gazetteer to create as many Token features as type of gazetteer, for example: normalized, punctuation, stemmed, etc.
  • set the feature in the parameters of the Extended Gazetteer

This method allows to choose the best normalization for each gazetteer and it's easy to change it as all the type of gazetteers are always created by the Java script and also the Token features by the Jape grammar.

@johann-petrak
Copy link
Collaborator Author

Thanks Thomas! Yes, this is a good method, when it is possible to generate most or all of the alternatives one wants to match automatically.

However, sometimes users want to match in a way that cannot be predicted, e.g. based on Levenshtein distance, phonetic similarity or some such. If there is a well defined distance metric between the strings, it is possible to implement this as an extension to the trie matching algorithm but it is not easy to implement.

@thomas-heitz
Copy link

I also plan to use the double metaphone then levenshtein for a phonetic match.

  • convert the gazetteer into double metaphone values
  • keep the original value of the gazetteer entry as a feature "string"
  • add a feature Token.metaphone for Tokens in the text that are not in my dictionary
  • apply the metaphone gazetteer on these Token.metaphone in the text
  • you get several Lookup.string for each Token
  • use levenshtein to choose the best candidate for each Token by comparing the Lookup.string and Token.string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants