Rethink case sensitivity #4

johann-petrak · 2014-06-30T14:46:16Z

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)

johann-petrak · 2015-06-30T14:52:51Z

Maybe an alternate approach would be to add per-entry information about what the actual case of the entry is: for example we could always assume all-lower case and encode the most common deviations from that in just a few bits (e.g. all upper case, title case). Only if the deviations are bigger we would store all alternate case variations with the entry.
Another approach could be to have the normal trie with all-lowercase entries and just store the fact that alternate cases exist and that all-lowercase is NOT a match in a few bits. Then the alternate cases would be stored in a second compressed trie.
In other words we would always build the normal non-casesensitive trie, but in addition we would also build a case-sensitive trie for those entries which are not all lowercase.
If we match case sensitively, we would match the case-sensitive trie and the non-case sensitive one at the same time. If we get a non-lower case in the input, the non-case trie is abandoned. We are left we the case sensitive matches.
If we match non case sensitive, we do the same but do not abandon the lowercase trie but only set the non-sensitive lower match flag. We end up with either matching the lower case entry exactly, or matching an alternatavie exactly, or matching the lower case entry case insensitively which we can then record in the annotation.

johann-petrak · 2018-06-18T14:59:13Z

Moved to GateNLP/gateplugin-StringAnnotation#7

johann-petrak mentioned this issue Jun 18, 2018

Rethink case sensitivity GateNLP/gateplugin-StringAnnotation#7

Open

johann-petrak closed this as completed Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink case sensitivity #4

Rethink case sensitivity #4

johann-petrak commented Jun 30, 2014

johann-petrak commented Jun 30, 2015

johann-petrak commented Jun 18, 2018

Rethink case sensitivity #4

Rethink case sensitivity #4

Comments

johann-petrak commented Jun 30, 2014

johann-petrak commented Jun 30, 2015

johann-petrak commented Jun 18, 2018