English only uses diacritics (like ´
, ^
and ¨
) for imported words — like rôle
, déjà and däis
— but usually they are optional. Other
languages require diacritics in order to be correct. Of course, just because
words are spelled correctly in your index doesn’t mean that the user will
search for the correct spelling.
It is often useful to strip diacritics from words, allowing rôle
to match
role
and vice versa. With Western languages, this can be done with the
asciifolding
character filter. Actually, it does more than just strip
diacritics. It tries to convert many Unicode characters into a simpler ASCII
representation, including:
-
ß
⇒ss
-
æ
⇒ae
-
ł
⇒l
-
ɰ
⇒m
-
⁇
⇒??
-
❷
⇒2
-
⁶
⇒6
Like the lowercase
filter, the asciifolding
filter doesn’t require any
configuration but can be included directly in a custom analyzer:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
GET /my_index?analyzer=folding
My œsophagus caused a débâcle (1)
-
Emits
my
,oesophagus
,caused
,a
,debacle
Of course, when you strip diacritical marks from a word, you lose meaning. For instance, consider these three Spanish words:
esta
|
Feminine form of the adjective |
ésta
|
An archaic form of |
está
|
The third person form of the verb |
While we would like to conflate the first two forms, they differ in meaning from the third form, which we would like to keep separate. Similarly:
sé
|
The first person form of the verb |
se
|
The third person reflexive pronoun used with many verbs, such as ``se sabe'' (it is known). |
Unfortunately, there is no easy way to separate out words that should have their diacritics removed and words that shouldn’t. And it is quite likely that your users won’t know either.
Instead, we index the text twice: once in the original form and once with diacritics removed:
PUT /my_index/_mapping/my_type
{
"properties": {
"title": { (1)
"type": "string",
"analyzer": "standard",
"fields": {
"folded": { (2)
"type": "string",
"analyzer": "folding"
}
}
}
}
}
-
The
title
field uses thestandard
analyzer and will contain the original word with diacritics in place. -
The
title.folded
field uses thefolding
analyzer, which strips the diacritical marks.
You can test out the field mappings using the analyze
API on the sentence
``Esta está loca'' (This woman is crazy):
GET /my_index/_analyze?field=title (1)
Esta está loca
GET /my_index/_analyze?field=title.folded (2)
Esta está loca
-
Emits
esta
,está
,loca
-
Emits
esta
,esta
,loca
Let’s index some documents to test it out:
PUT /my_index/my_type/1
{ "title": "Esta loca!" }
PUT /my_index/my_type/2
{ "title": "Está loca!" }
Now we can search across both fields, using the multi_match
query in
most_fields
mode to combine the scores from each field.
GET /my_index/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "esta loca",
"fields": [ "title", "title.folded" ]
}
}
}
Running this query through the validate-query
API helps to explain how the
query is executed:
GET /my_index/_validate/query?explain
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "está loca",
"fields": [ "title", "title.folded" ]
}
}
}
It searches for the original form of the word está
in the title
field,
and the form without diacritics esta
in the title.folded
field:
(title:está title:loca ) (title.folded:esta title.folded:loca)
It doesn’t matter whether the user searches for esta
or está
— both
documents will match because the form without diacritics exists in the the
title.folded
field. However, only the original form exists in the title
field. This extra match will push the document containing the original form of
the word to the top of the results list.
We use the title.folded
field to `widen the net'' in order to match more
documents, and use the original `title
field to push the most relevant
document to the top. This same technique can be used wherever an analyzer is
used to increase matches at the expense of meaning.
Tip
|
The Position 1 Position 2 -------------------------- (ésta,esta) loca -------------------------- While this appears to be a nice way to save space, it does mean that you have no way of saying ``Give me an exact match on the original word''. Mixing tokens with and without diacritics can also end up interfering with term frequency counts, resulting in less reliable relevance calcuations. As a rule, it is cleaner to index each field variant into a separate field, as we have done in this section. |