SentenceTokenizer incorrectly processes punctuation marks within words

## Explanation

The currently used [SentenceTokenizer](https://github.com/Yoast/javascript/blob/develop/packages/yoastseo/src/stringProcessing/SentenceTokenizer.js) generates wrong results when a punctuation mark such as `!` or `?` or `.` are used within a word (e.g., in a company name).

### Examples
#### Example 1
The following text (see https://github.com/Yoast/wordpress-seo/issues/13726)

> The free App FRITZ!App WLAN helps to find the ideal locations when setting up a repeater.

gets **incorrectly** parsed into the following sentences
```
0: "The free App FRITZ!"
1: "App WLAN helps to find the ideal locations when setting up a repeater."
```

#### Example 2
The same text as in Example 1 but with a `.` instead of the `!`

> The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater.

gets **correcty** parsed into one sentence
```
0: "The free App FRITZ.App WLAN helps to find the ideal locations when setting up a repeater."
```

#### Example 3
The same text as in Example 2 but the entire word `FRITZ.APP` capitalized

> The free App FRITZ.APP WLAN helps to find the ideal locations when setting up a repeater.

gets **incorrectly** parsed into the following sentences
```
0: "The free App FRITZ."
1: "APP WLAN helps to find the ideal locations when setting up a repeater."
```

## Why does it happen?
The problem in Example 1 occurs because the SentenceTokenizer splits text on `!`, `?`, `;` and `...` without checking if the cut-off part begins as a proper sentence should (e.g., with a space and a capital letter). [Here](https://github.com/Yoast/javascript/blob/develop/packages/yoastseo/src/stringProcessing/SentenceTokenizer.js#L320:L327) is the rule where this check should take place.

Note that such a check [is implemented](https://github.com/Yoast/javascript/blob/develop/packages/yoastseo/src/stringProcessing/SentenceTokenizer.js#L329:L346) for the situation when the text is split on a `.`. Specifically, the rule checks if the **second** letter of the cut-off remainder text is a capital letter, or a number, etc. 
However, the SentenceTokenizer does not check that the **first** letter of the remainder text is a space. Which is a reason why the problem in Example 3 occurs.

## Things to consider
1. A fix for both problems seems to be pretty straight-forward to implement. 
2. A few users complained about these issues.
3. The currently used SentenceTokenizer will not be used in its current form when the tree-based text parser is implemented, because the said tokenizer relies on HTML tags. 
4. We will still need a variant of a sentence tokenizer to be able to operate with sentences in researches. The work on implementing fixes to the current sentence tokenizer will not necessarily be lost therefore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SentenceTokenizer incorrectly processes punctuation marks within words #402

Explanation

Examples

Example 1

Example 2

Example 3

Why does it happen?

Things to consider

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SentenceTokenizer incorrectly processes punctuation marks within words #402

Description

Explanation

Examples

Example 1

Example 2

Example 3

Why does it happen?

Things to consider

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions