Skip to content

Latest commit

 

History

History
142 lines (104 loc) · 3.85 KB

README.md

File metadata and controls

142 lines (104 loc) · 3.85 KB

KaTo: Tokenization, Normalization, and Lemmatization CLI-Based NLP Tool

Overview

KaTo is a command-line interface (CLI) tool designed specifically for Tokenization, Normalization, and Lemmatization in Natural Language Processing (NLP). Built entirely in Haskell using functional programming principles, KaTo provides a straightforward and efficient solution for processing text input.


Logic Diagram

Below is a simplified logic diagram illustrating the flow of data in KaTo:

+------------------+
|                  |
|  User Input Text |
|                  |
+--------+---------+
         |
         v
+--------+---------+
|                  |
|    Tokenization  | <----+
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Normalization  |       |
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Lemmatization  |       |
|                  |       |
+--------+---------+       |
         |                 |
         v                 |
+--------+---------+       |
|                  |       |
|   Display Results |      |
|                  |       |
+------------------+       |
         |                 |
         v                 |
+------------------+       |
|                  |       |
|  User Receives    |      |
|  Processed Output |      |
|                  |       |
+------------------+       |

Components of the Logic Diagram

  1. User Input Text: The process begins with the user providing text input through the command line.

  2. Tokenization: The text is split into individual tokens or words for further processing.

  3. Normalization: Each token is standardized (e.g., lowercased, punctuation removed) to ensure consistency.

  4. Lemmatization: The normalized tokens are transformed into their base forms (lemmas).

  5. Display Results: The results of the tokenization, normalization, and lemmatization are formatted and prepared for display.

  6. User Receives Processed Output: Finally, the user sees the processed output in the terminal.


Setting Up KaTo

  1. Clone the Repository:

    git clone <repository-url>
    cd <repository-name>
  2. Install Dependencies: Ensure you have Haskell and Cabal installed. Run:

    cabal update
    cabal install --only-dependencies
  3. Build the Project:

    cabal build

Using KaTo

  1. Run the Application:

    cabal run
  2. Enter Text for Processing: When prompted, type or paste the text you want to analyze and press Enter.

  3. View Results: After processing, KaTo will display the tokens, normalized tokens, and lemmatized tokens.

Example Usage

$ kato
Welcome to KaTo: A Tokenization, Normalization, and Lemmatization CLI-Based NLP Tool!
Please enter the text you want to process:
> The children are running quickly.
Tokens: ["The", "children", "are", "running", "quickly."]
Normalized Tokens: ["the", "children", "are", "running", "quickly"]
Lemmatized Tokens: ["the", "child", "be", "run", "quick"]
Process completed successfully.

Contributing

We welcome contributions! Please refer to the Contribution Guidelines (replace with actual link) for details on how to get involved.


License

This project is licensed under the MIT License (replace with actual link).


Feel free to customize any sections, especially the links for contribution guidelines and licensing. If you need further adjustments or additional sections, let me know!

This project is licensed under the MIT License (replace with actual link).