Sentencize.jl

Text to sentence splitter using a heuristic algorithm.

This module is a port of the Python package sentence-splitter.

The module allows the splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus.

Usage

The module uses punctuation and capitalization clues to split plain text into a list of sentences :

import Sentencize

sen = Sentencize.split_sentence("This is a paragraph. It contains several sentences. \"But why,\" you ask?")
println(sen)
# ["This is a paragraph.", "It contains several sentences.", "\"But why,\" you ask?"]

You can specify another language than English:

sen = Sentencize.split_sentence("Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement...", lang="fr")
println(sen)
# ["Brookfield Office Properties Inc. (« BOPI »), dont les actifs liés aux immeubles directement..."]

You can specify your own non-breaking prefixes file:

sen = Sentencize.split_sentence("This is an example.", prefix_file="my_prefixes.txt", lang=nothing)

Or even pass the prefixes as a dictionary:

sen = Sentencize.split_sentence("This is another example. Another sentence.", prefixes=Dict("example" => Sentencize.default))
# ["This is another example. Another sentence."]

Languages

Currently supported languages are :

Catalan (ca)
Czech (cs)
Danish (da)
Dutch (nl)
English (en)
Finnish (fi)
French (fr)
German (de)
Greek (el)
Hungarian (hu)
Icelandic (is)
Italian (it)
Latvian (lv)
Lithuanian (lt)
Norwegian (Bokmål) (no)
Polish (pl)
Portuguese (pt)
Romanian (ro)
Russian (ru)
Slovak (sk)
Slovene (sl)
Spanish (es)
Swedish (sv)
Turkish (tr)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
docs		docs
src		src
test		test
.JuliaFormatter.toml		.JuliaFormatter.toml
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentencize.jl

Usage

Languages

About

Releases 1

Packages

Contributors 3

Languages

License

astariul/Sentencize.jl

Folders and files

Latest commit

History

Repository files navigation

Sentencize.jl

Usage

Languages

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages