Skip to content
This repository has been archived by the owner on Sep 3, 2018. It is now read-only.

Rethink case sensitivity #4

Closed
johann-petrak opened this issue Jun 30, 2014 · 2 comments
Closed

Rethink case sensitivity #4

johann-petrak opened this issue Jun 30, 2014 · 2 comments

Comments

@johann-petrak
Copy link
Owner

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)

@johann-petrak
Copy link
Owner Author

Maybe an alternate approach would be to add per-entry information about what the actual case of the entry is: for example we could always assume all-lower case and encode the most common deviations from that in just a few bits (e.g. all upper case, title case). Only if the deviations are bigger we would store all alternate case variations with the entry.
Another approach could be to have the normal trie with all-lowercase entries and just store the fact that alternate cases exist and that all-lowercase is NOT a match in a few bits. Then the alternate cases would be stored in a second compressed trie.
In other words we would always build the normal non-casesensitive trie, but in addition we would also build a case-sensitive trie for those entries which are not all lowercase.
If we match case sensitively, we would match the case-sensitive trie and the non-case sensitive one at the same time. If we get a non-lower case in the input, the non-case trie is abandoned. We are left we the case sensitive matches.
If we match non case sensitive, we do the same but do not abandon the lowercase trie but only set the non-sensitive lower match flag. We end up with either matching the lower case entry exactly, or matching an alternatavie exactly, or matching the lower case entry case insensitively which we can then record in the annotation.

@johann-petrak
Copy link
Owner Author

Moved to GateNLP/gateplugin-StringAnnotation#7

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant