Skip to content

Spellchecker that recursively fetches HTML pages, converts them to plain text, and spellchecks them.

Notifications You must be signed in to change notification settings

suhlig/httpspell

Folders and files

NameName
Last commit message
Last commit date
May 3, 2021
Jun 1, 2024
Jun 1, 2024
Jun 1, 2024
May 29, 2024
Sep 2, 2021
Jun 1, 2018
May 30, 2024
May 28, 2024
May 29, 2024
Aug 7, 2024
May 29, 2024
Jun 1, 2024
May 28, 2024
May 31, 2024
May 31, 2024

Repository files navigation

httpspell

This is a spellchecker that recursively fetches HTML pages, converts them to plain text (using pandoc), and spellchecks them with hunspell. Unknown words will be printed to stdout, which makes the tool a good candidate for CI pipelines where you might want to take action when a spelling error is found on a web page.

Words that are not in the dictionary for the given language (inferred from the lang attribute of the HTML document's root element) can be added to a personal dictionary, which will mark the word as correctly spelled.

Usage

  • The following command will retrieve the HTML document at https://example.com, spellcheck it, and not print anything because there are no errors:

    $ httpspell https://example.com

    The exit code is 0.

  • The following command will spellcheck the README of this project as rendered by GitHub, and print a list of unknown words. Note that we set the language to en_US because GitHub declares 'en' as document language, but the installed dictionaries usually refer the a specific language variant like en_US:

    $ httpspell https://github.com/suhlig/httpspell/blob/master/README.markdown --language en_US
    suhlig
    Permalink
    httpspell
    sloc
    pandoc
    hunspell
    ...

    The exit code is 1.

What is not checked

  • When spidering a site, httpspell will skip all responses with a content-type header other than text/html (unless pointing it to file, in which case it accepts anything).
  • Before converting, httpspell removes the following nodes from the HTML DOM as they are not a good target for spellchecking:
    • code
    • pre
    • Elements with spellcheck='false' (this is how HTML5 allows tagging elements as a being target for spellchecking or not)

Misc

If you produce content with kramdown (e.g. using Jekyll), an Inline Attribute List can be used to set spellcheck='false' for an element by adding this line after the element (e.g. heading):

{: spellcheck="false"}

Dictionaries

Hunspell uses the system dictionary paths; on the Mac this is ~/Library/Spelling/. Get some dictionaries as explained in the hunspell project:

$ wget -O ~/Library/Spelling/en_US.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
$ wget -O ~/Library/Spelling/en_US.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic

German:

$ wget -O ~/Library/Spelling/de_DE.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/de_DE_frami.dic
$ wget -O ~/Library/Spelling/de_DE.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/de_DE_frami.aff

Italian (for integration tests):

$ wget -O ~/Library/Spelling/it_IT.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/it_IT/it_IT.dic
$ wget -O ~/Library/Spelling/it_IT.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/it_IT/it_IT.aff

About

Spellchecker that recursively fetches HTML pages, converts them to plain text, and spellchecks them.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages