Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink case sensitivity #7

Open
johann-petrak opened this issue Jun 18, 2018 · 1 comment
Open

Rethink case sensitivity #7

johann-petrak opened this issue Jun 18, 2018 · 1 comment

Comments

@johann-petrak
Copy link
Collaborator

Copied from: johann-petrak/gateplugin-StringAnnotation#4

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)

@johann-petrak
Copy link
Collaborator Author

Maybe an alternate approach would be to add per-entry information about what the actual case of the entry is: for example we could always assume all-lower case and encode the most common deviations from that in just a few bits (e.g. all upper case, title case). Only if the deviations are bigger we would store all alternate case variations with the entry.
Another approach could be to have the normal trie with all-lowercase entries and just store the fact that alternate cases exist and that all-lowercase is NOT a match in a few bits. Then the alternate cases would be stored in a second compressed trie.
In other words we would always build the normal non-casesensitive trie, but in addition we would also build a case-sensitive trie for those entries which are not all lowercase.
If we match case sensitively, we would match the case-sensitive trie and the non-case sensitive one at the same time. If we get a non-lower case in the input, the non-case trie is abandoned. We are left we the case sensitive matches.
If we match non case sensitive, we do the same but do not abandon the lowercase trie but only set the non-sensitive lower match flag. We end up with either matching the lower case entry exactly, or matching an alternatavie exactly, or matching the lower case entry case insensitively which we can then record in the annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant