You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 4, 2022. It is now read-only.
The currently used SentenceTokenizer generates wrong results when a punctuation mark such as ! or ? or . are used within a word (e.g., in a company name).
The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.
gets incorrectly parsed into the following sentences
0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."
Example 2
The same text as in Example 1 but with a . instead of the !
The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.
gets correcty parsed into one sentence
0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."
Example 3
The same text as in Example 2 but the entire word FRITZ.APP capitalized
The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.
gets incorrectly parsed into the following sentences
0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."
Why does it happen?
The problem in Example 1 occurs because the SentenceTokenizer splits text on !, ?, ; and ... without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.
Note that such a check is implemented for the situation when the text is split on a .. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc.
However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.
Things to consider
A fix for both problems seems to be pretty straight-forward to implement.
A few users complained about these issues.
The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags.
We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.
The text was updated successfully, but these errors were encountered:
Explanation
The currently used SentenceTokenizer generates wrong results when a punctuation mark such as
!
or?
or.
are used within a word (e.g., in a company name).Examples
Example 1
The following text (see Yoast/wordpress-seo#13726)
gets incorrectly parsed into the following sentences
Example 2
The same text as in Example 1 but with a
.
instead of the!
gets correcty parsed into one sentence
Example 3
The same text as in Example 2 but the entire word
FRITZ.APP
capitalizedgets incorrectly parsed into the following sentences
Why does it happen?
The problem in Example 1 occurs because the SentenceTokenizer splits text on
!
,?
,;
and...
without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). Here is the rule where this check should take place.Note that such a check is implemented for the situation when the text is split on a
.
. Specifically, the rule checks if the second letter of the cut-off remainder text is a capital letter, or a number, etc.However, the SentenceTokenizer does not check that the first letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.
Things to consider
The text was updated successfully, but these errors were encountered: