Skip to content
/ KaTo Public

KaTo is a command-line interface (CLI) tool designed specifically for Tokenization, Normalization, and Lemmatization in Natural Language Processing (NLP). Built entirely in Haskell using functional programming principles, KaTo provides a straightforward and efficient solution for processing text input.

License

Notifications You must be signed in to change notification settings

gluppler/KaTo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation


KaTo: Tokenization, Normalization, and Lemmatization CLI-Based NLP Tool

Overview

KaTo is a command-line interface (CLI) tool designed specifically for Tokenization, Normalization, and Lemmatization in Natural Language Processing (NLP). Built entirely in Haskell using functional programming principles, KaTo provides a straightforward and efficient solution for processing text input.


Logic Diagram

Below is a simplified logic diagram illustrating the flow of data in KaTo:

+------------------+
|                  |
|  User Input Text |
|                  |
+--------+---------+
         |
         v
+--------+---------+
|                  |
|    Tokenization  | <----+
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Normalization  |       |
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Lemmatization  |       |
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Display Results |      |
|                  |       |
+------------------+       |
         |                 |
         v                 |
+------------------+       |
|                  |       |
|  User Receives    |      |
|  Processed Output |      |
|                  |       |
+------------------+       |

Components of the Logic Diagram

  1. User Input Text: The process begins with the user providing text input through the command line.

  2. Tokenization: The text is split into individual tokens or words for further processing.

  3. Normalization: Each token is standardized (e.g., lowercased, punctuation removed) to ensure consistency.

  4. Lemmatization: The normalized tokens are transformed into their base forms (lemmas).

  5. Display Results: The results of the tokenization, normalization, and lemmatization are formatted and prepared for display.

  6. User Receives Processed Output: Finally, the user sees the processed output in the terminal.


Setting Up KaTo

  1. Clone the Repository:

    git clone <repository-url>
    cd <repository-name>
  2. Install Dependencies: Ensure you have Haskell and Cabal installed. Run:

    cabal update
    cabal install --only-dependencies
  3. Build the Project:

    cabal build

Using KaTo

  1. Run the Application:

    cabal run
  2. Enter Text for Processing: When prompted, type or paste the text you want to analyze and press Enter.

  3. View Results: After processing, KaTo will display the tokens, normalized tokens, and lemmatized tokens.

Example Usage

$ kato
Welcome to KaTo: A Tokenization, Normalization, and Lemmatization CLI-Based NLP Tool!
Please enter the text you want to process:
> The children are running quickly.
Tokens: ["The", "children", "are", "running", "quickly."]
Normalized Tokens: ["the", "children", "are", "running", "quickly"]
Lemmatized Tokens: ["the", "child", "be", "run", "quick"]
Process completed successfully.

Contributing

We welcome contributions! Please refer to the Contribution Guidelines (replace with actual link) for details on how to get involved.


License

This project is licensed under the MIT License (replace with actual link).


Feel free to customize any sections, especially the links for contribution guidelines and licensing. If you need further adjustments or additional sections, let me know!

This project is licensed under the MIT License (replace with actual link).


About

KaTo is a command-line interface (CLI) tool designed specifically for Tokenization, Normalization, and Lemmatization in Natural Language Processing (NLP). Built entirely in Haskell using functional programming principles, KaTo provides a straightforward and efficient solution for processing text input.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published