KaTo is a command-line interface (CLI) tool designed specifically for Tokenization, Normalization, and Lemmatization in Natural Language Processing (NLP). Built entirely in Haskell using functional programming principles, KaTo provides a straightforward and efficient solution for processing text input.
Below is a simplified logic diagram illustrating the flow of data in KaTo:
+------------------+
| |
| User Input Text |
| |
+--------+---------+
|
v
+--------+---------+
| |
| Tokenization | <----+
| | |
+--------+---------+ |
| |
v |
+--------+---------+ |
| | |
| Normalization | |
| | |
+--------+---------+ |
| |
v |
+--------+---------+ |
| | |
| Lemmatization | |
| | |
+--------+---------+ |
| |
v |
+--------+---------+ |
| | |
| Display Results | |
| | |
+------------------+ |
| |
v |
+------------------+ |
| | |
| User Receives | |
| Processed Output | |
| | |
+------------------+ |
-
User Input Text: The process begins with the user providing text input through the command line.
-
Tokenization: The text is split into individual tokens or words for further processing.
-
Normalization: Each token is standardized (e.g., lowercased, punctuation removed) to ensure consistency.
-
Lemmatization: The normalized tokens are transformed into their base forms (lemmas).
-
Display Results: The results of the tokenization, normalization, and lemmatization are formatted and prepared for display.
-
User Receives Processed Output: Finally, the user sees the processed output in the terminal.
-
Clone the Repository:
git clone <repository-url> cd <repository-name>
-
Install Dependencies: Ensure you have Haskell and Cabal installed. Run:
cabal update cabal install --only-dependencies
-
Build the Project:
cabal build
-
Run the Application:
cabal run
-
Enter Text for Processing: When prompted, type or paste the text you want to analyze and press Enter.
-
View Results: After processing, KaTo will display the tokens, normalized tokens, and lemmatized tokens.
$ kato
Welcome to KaTo: A Tokenization, Normalization, and Lemmatization CLI-Based NLP Tool!
Please enter the text you want to process:
> The children are running quickly.
Tokens: ["The", "children", "are", "running", "quickly."]
Normalized Tokens: ["the", "children", "are", "running", "quickly"]
Lemmatized Tokens: ["the", "child", "be", "run", "quick"]
Process completed successfully.
We welcome contributions! Please refer to the Contribution Guidelines (replace with actual link) for details on how to get involved.
This project is licensed under the MIT License (replace with actual link).
Feel free to customize any sections, especially the links for contribution guidelines and licensing. If you need further adjustments or additional sections, let me know!
This project is licensed under the MIT License (replace with actual link).