Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Update docstrings and allow AbstractString #7

Merged
merged 4 commits into from
Jul 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://astariul.github.io/Sentencize.jl/stable/) [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://astariul.github.io/Sentencize.jl/dev/) [![Build Status](https://github.com/astariul/Sentencize.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/astariul/Sentencize.jl/actions/workflows/CI.yml?query=branch%3Amain) [![Coverage](https://codecov.io/gh/astariul/Sentencize.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/astariul/Sentencize.jl) [![Aqua](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)

**Text to sentence splitter using heuristic algorithm.**
**Text to sentence splitter using a heuristic algorithm.**

This module is a port of the [python package `sentence-splitter`](https://github.com/berkmancenter/mediacloud-sentence-splitter).
This module is a port of the [Python package `sentence-splitter`](https://github.com/berkmancenter/mediacloud-sentence-splitter).

The module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the [Europarl corpus](http://www.statmt.org/europarl/).
The module allows the splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the [Europarl corpus](http://www.statmt.org/europarl/).

## Usage

Expand All @@ -31,7 +31,7 @@ println(sen)
You can specify your own non-breaking prefixes file:

```julia
sen = Sentencize.split_sentence("This is an example.", prefix_file="my_prefixes.txt", lang=missing)
sen = Sentencize.split_sentence("This is an example.", prefix_file="my_prefixes.txt", lang=nothing)
```

Or even pass the prefixes as a dictionary:
Expand Down
3 changes: 2 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ makedocs(;
assets = String[]
),
pages = [
"Home" => "index.md"
"Home" => "index.md",
"API Reference" => "api_reference.md"
]
)

Expand Down
6 changes: 6 additions & 0 deletions docs/src/api_reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
```@index
```

```@autodocs
Modules = [Sentencize]
```
67 changes: 64 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,70 @@ CurrentModule = Sentencize

Documentation for [Sentencize](https://github.com/astariul/Sentencize.jl).

```@index
**Text to sentence splitter using a heuristic algorithm.**

This module is a port of the [Python package `sentence-splitter`](https://github.com/berkmancenter/mediacloud-sentence-splitter).

The module allows the splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the [Europarl corpus](http://www.statmt.org/europarl/).

## Usage

The module uses punctuation and capitalization clues to split plain text into a list of sentences :

```julia
import Sentencize

sen = Sentencize.split_sentence("This is a paragraph. It contains several sentences. \"But why,\" you ask?")
println(sen)
# ["This is a paragraph.", "It contains several sentences.", "\"But why,\" you ask?"]
```

You can specify another language than English:

```julia
sen = Sentencize.split_sentence("Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement...", lang="fr")
println(sen)
# ["Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement..."]
```

You can specify your own non-breaking prefixes file:

```julia
sen = Sentencize.split_sentence("This is an example.", prefix_file="my_prefixes.txt", lang=nothing)
```

```@autodocs
Modules = [Sentencize]
Or even pass the prefixes as a dictionary:

```julia
sen = Sentencize.split_sentence("This is another example. Another sentence.", prefixes=Dict("example" => Sentencize.default))
# ["This is another example. Another sentence."]
```

## Languages

Currently supported languages are :

- Catalan (`ca`)
- Czech (`cs`)
- Danish (`da`)
- Dutch (`nl`)
- English (`en`)
- Finnish (`fi`)
- French (`fr`)
- German (`de`)
- Greek (`el`)
- Hungarian (`hu`)
- Icelandic (`is`)
- Italian (`it`)
- Latvian (`lv`)
- Lithuanian (`lt`)
- Norwegian (Bokmål) (`no`)
- Polish (`pl`)
- Portuguese (`pt`)
- Romanian (`ro`)
- Russian (`ru`)
- Slovak (`sk`)
- Slovene (`sl`)
- Spanish (`es`)
- Swedish (`sv`)
- Turkish (`tr`)
160 changes: 112 additions & 48 deletions src/Sentencize.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,59 +5,106 @@ export PrefixType
export Prefixes
export split_sentence

"""
SUPPORTED_LANG

Supported languages.

Contains:
- "ca"
- "cs"
- "da"
- "de"
- "el"
- "en"
- "es"
- "fi"
- "fr"
- "hu"
- "is"
- "it"
- "lt"
- "lv"
- "nl"
- "no"
- "pl"
- "pt"
- "ro"
- "ru"
- "sk"
- "sl"
- "sv"
- "tr"
"""
const SUPPORTED_LANG = ["ca", "cs", "da", "de", "el", "en", "es", "fi", "fr",
"hu", "is", "it", "lt", "lv", "nl", "no", "pl", "pt",
"ro", "ru", "sk", "sl", "sv", "tr"]
"hu", "is", "it", "lt", "lv", "nl", "no", "pl", "pt",
"ro", "ru", "sk", "sl", "sv", "tr"]

@enum PrefixType default numeric_only

struct Prefixes
non_breaking_prefixes::Dict{String, PrefixType}

function Prefixes(prefixes::Dict{String, PrefixType}=Dict{String, PrefixType}(); prefix_file=missing, lang="en")
function _load_prefix_file(pfile)
nb_prefixes = Dict{String, PrefixType}()
open(pfile) do file
for line in eachline(file)
if occursin("#NUMERIC_ONLY#", line)
prefix_type = numeric_only
else
prefix_type = default
end
function _load_prefix_file(T::Type{<:AbstractString}, pfile)
nb_prefixes = Dict{T,PrefixType}()
open(pfile) do file
for line in eachline(file)
if occursin("#NUMERIC_ONLY#", line)
prefix_type = numeric_only
else
prefix_type = default
end

line = replace(line, r"#.*" => "") # Remove comments
line = strip(line)

if isempty(line)
continue
end

nb_prefixes[line] = prefix_type
end
line = replace(line, r"#.*" => "") # Remove comments
line = strip(line)

if isempty(line)
continue
end
return nb_prefixes

nb_prefixes[line] = prefix_type
end

if !ismissing(lang)
end
return nb_prefixes
end


@enum PrefixType default numeric_only

"""
Prefixes(prefixes::Dict{T,PrefixType}=Dict{String,PrefixType}(); prefix_file::Union{String,Nothing}=nothing, lang::Union{String,Nothing}="en") where {T<:AbstractString}

Constructs `Prefixes`.

# Arguments
- `prefixes::Dict{<:AbstractString,PrefixType}=Dict{T,PrefixType}()`: Optional. A dictionary of non-breaking prefixes.
- `prefix_file::Union{String,Nothing}=nothing`: Optional. A path to a file containing non-breaking prefixes to add to provided `prefixes`.
- `lang::AbstractString="en"`: Optional. The language of the non-breaking prefixes (see `?SUPPORTED_LANG` for available languages) to be added to `prefixes`.
"""
struct Prefixes{T<:AbstractString}
non_breaking_prefixes::Dict{T,PrefixType}

function Prefixes(prefixes::Dict{T,PrefixType}=Dict{String,PrefixType}(); prefix_file::Union{String,Nothing}=nothing, lang::Union{String,Nothing}="en") where {T<:AbstractString}

if !isnothing(lang)
if !(lang in SUPPORTED_LANG)
throw(ArgumentError("Unsupported language. Use a supported language ($SUPPORTED_LANG). " *
throw(ArgumentError("Unsupported language. Use a supported language ($SUPPORTED_LANG). " *
"You can also provide your own non_breaking_prefixes file with the " *
"keyword argument `prefix_file`."))
else
merge!(prefixes, _load_prefix_file(joinpath(@__DIR__, "non_breaking_prefixes/$lang.txt")))
merge!(prefixes, _load_prefix_file(T, joinpath(@__DIR__, "non_breaking_prefixes/$lang.txt")))
end
end

if !ismissing(prefix_file)
merge!(prefixes, _load_prefix_file(prefix_file))
if !isnothing(prefix_file)
if isfile(prefix_file)
merge!(prefixes, _load_prefix_file(T, prefix_file))
else
throw(ArgumentError("File $prefix_file does not exist."))
end
end

new(prefixes)
new{T}(prefixes)
end
end


function _basic_sentence_breaks(text::String)
function _basic_sentence_breaks(text::AbstractString)
# Non-period end of sentence markers (?!) followed by sentence starters
text = replace(text, r"([?!]) +(['\"([\u00bf\u00A1\p{Pi}]*[\p{Lu}\p{Lo}])" => s"\1\n\2")

Expand All @@ -76,12 +123,12 @@ function _basic_sentence_breaks(text::String)
end


function _is_prefix_honorific(prefix::SubString{String}, starting_punct::SubString{String}, non_breaking_prefixes::Dict{String,PrefixType})
function _is_prefix_honorific(prefix::AbstractString, starting_punct::AbstractString, non_breaking_prefixes::Dict{<:AbstractString,PrefixType})
# Check if \\1 is a known honorific and \\2 is empty.
if prefix != ""
if !isempty(prefix)
if prefix in keys(non_breaking_prefixes)
if non_breaking_prefixes[prefix] == default
if starting_punct == ""
if isempty(starting_punct)
return true
end
end
Expand All @@ -91,13 +138,13 @@ function _is_prefix_honorific(prefix::SubString{String}, starting_punct::SubStri
end


function _is_numeric(prefix::SubString{String}, starting_punct::SubString{String}, next_word::SubString{String}, non_breaking_prefixes::Dict{String,PrefixType})
function _is_numeric(prefix::AbstractString, starting_punct::AbstractString, next_word::AbstractString, non_breaking_prefixes::Dict{<:AbstractString,PrefixType})
# The next word has a bunch of initial quotes, maybe a space, then either upper case or a number.
if prefix != ""
if !isempty(prefix)
if prefix in keys(non_breaking_prefixes)
if non_breaking_prefixes[prefix] == numeric_only
if starting_punct == ""
if match(r"^[0-9]+", next_word) != nothing
if isempty(starting_punct)
if !isnothing(match(r"^[0-9]+", next_word))
return true
end
end
Expand All @@ -108,7 +155,24 @@ function _is_numeric(prefix::SubString{String}, starting_punct::SubString{String
end


function split_sentence(text::String; prefixes::Dict{String, PrefixType}=Dict{String, PrefixType}(), prefix_file=missing, lang="en")
"""
split_sentence(text::AbstractString; prefixes::Dict{<:AbstractString,PrefixType}=Dict{String,PrefixType}(), prefix_file::Union{String,Nothing}=nothing, lang::Union{String,Nothing}="en")

Splits a `text` into sentences.

# Arguments
- `text::AbstractString`: The text to split into sentences.
- `prefixes::Dict{<:AbstractString,PrefixType}`: Optional. A dictionary of non-breaking prefixes.
- `prefix_file::Union{String,Nothing}`: Optional. A path to a file containing non-breaking prefixes to add to provided `prefixes`.
- `lang::Union{String,Nothing}`: Optional. The language of the non-breaking prefixes (see `?SUPPORTED_LANG` for available languages) to be added to `prefixes` Default is "en" (=English).

# Examples
```julia
split_sentence("This is a paragraph. It contains several sentences. \"But why,\" you ask?")
# Output: ["This is a paragraph.", "It contains several sentences.", "\"But why,\" you ask?"]
```
"""
function split_sentence(text::AbstractString; prefixes::Dict{<:AbstractString,PrefixType}=Dict{String,PrefixType}(), prefix_file::Union{String,Nothing}=nothing, lang::Union{String,Nothing}="en")
if text == ""
return []
end
Expand All @@ -120,19 +184,19 @@ function split_sentence(text::String; prefixes::Dict{String, PrefixType}=Dict{St
# Special punctuation cases : check all remaining periods
words = split(text, r" +")
text = ""
for i in 1:length(words) - 1
for i in 1:length(words)-1
m = match(r"([\w\.\-]*)(['\"\)\]\%\p{Pf}]*)(\.+)$", words[i])

if m != nothing
if !isnothing(m)
prefix = m.captures[1]
starting_punct = m.captures[2]

if _is_prefix_honorific(prefix, starting_punct, pf.non_breaking_prefixes)
# Not breaking
elseif match(r"(\.)[\p{Lu}\p{Lo}\-]+(\.+)$", words[i]) != nothing
elseif !isnothing(match(r"(\.)[\p{Lu}\p{Lo}\-]+(\.+)$", words[i]))
# Not breaking - upper case acronym
elseif match(r"^([ ]*['\"([\u00bf\u00A1\p{Pi}]*[ ]*[\p{Lu}\p{Lo}0-9])", words[i + 1]) != nothing
if !_is_numeric(prefix, starting_punct, words[i + 1], pf.non_breaking_prefixes)
elseif !isnothing(match(r"^([ ]*['\"([\u00bf\u00A1\p{Pi}]*[ ]*[\p{Lu}\p{Lo}0-9])", words[i+1]))
if !_is_numeric(prefix, starting_punct, words[i+1], pf.non_breaking_prefixes)
words[i] = words[i] * "\n"
# We always add a return for these unless we have a numeric non-breaker and a number start
end
Expand Down
17 changes: 14 additions & 3 deletions test/Sentencize.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@
@test "test-prefix" in keys(ss.non_breaking_prefixes) &&
"Apr" in keys(ss.non_breaking_prefixes) # English is also loaded

ss = Prefixes(Dict("test-prefix" => Sentencize.default), lang = missing)
ss = Prefixes(Dict("test-prefix" => Sentencize.default), lang = nothing)
@test "test-prefix" in keys(ss.non_breaking_prefixes) &&
!("Apr" in keys(ss.non_breaking_prefixes))

ss = Prefixes(prefix_file = "test.txt")
@test "another-test-prefix" in keys(ss.non_breaking_prefixes) &&
"Apr" in keys(ss.non_breaking_prefixes)

ss = Prefixes(prefix_file = "test.txt", lang = missing)
ss = Prefixes(prefix_file = "test.txt", lang = nothing)
@test "another-test-prefix" in keys(ss.non_breaking_prefixes) &&
!("Apr" in keys(ss.non_breaking_prefixes))

Expand All @@ -28,7 +28,7 @@
"test-prefix" in keys(ss.non_breaking_prefixes)

ss = Prefixes(
Dict("test-prefix" => Sentencize.default), prefix_file = "test.txt", lang = missing)
Dict("test-prefix" => Sentencize.default), prefix_file = "test.txt", lang = nothing)
@test "another-test-prefix" in keys(ss.non_breaking_prefixes) &&
!("Apr" in keys(ss.non_breaking_prefixes)) &&
"test-prefix" in keys(ss.non_breaking_prefixes)
Expand All @@ -40,6 +40,11 @@
"test-prefix" in keys(ss.non_breaking_prefixes)

@test_throws ArgumentError Prefixes(lang = "some-weird-language")

## Test non-string prefix
ss = Prefixes(Dict(strip(" test-prefix") => Sentencize.default), lang = nothing)
@test "test-prefix" in keys(ss.non_breaking_prefixes) &&
!("Apr" in keys(ss.non_breaking_prefixes))
end

@testset "split_sentence" begin
Expand Down Expand Up @@ -106,4 +111,10 @@ end
prefixes = Dict("Prefix1" => Sentencize.default, "#Hello" => Sentencize.default,
"Prefix2" => Sentencize.default)) ==
["Hello.", "Prefix1. Prefix2. Hello again.", "Good bye."]

## Different string types
@test split_sentence(strip(" Hello. Prefix1. Prefix2. Hello again. Good bye. "),
prefixes = Dict("Prefix1" => Sentencize.default, "#Hello" => Sentencize.default,
"Prefix2" => Sentencize.default)) ==
["Hello.", "Prefix1. Prefix2. Hello again.", "Good bye."]
end
Loading