EnText is a Rust-based tool that uses ONNX Runtime to perform Named Entity Recognition (NER) on text documents. It's specifically trained to identify and extract:
- Company Names
- Domain Names
- URLs
- IP Addresses
- Email Addresses
- Fast and efficient entity extraction using ONNX Runtime
- Pre-trained NER model optimized for technical and business document processing
- Command-line interface for easy integration into workflows
- Detailed entity output with confidence scores and position information
- Rust 1.56 or higher
- ONNX Runtime
- Pre-trained NER model file (
ner_roberta.onnx
) - RoBERTa tokenizer files (
roberta-vocab.json
androberta-merges.txt
)
-
Clone this repository:
git clone https://github.com/yourusername/entext.git cd entext
-
Build the project:
cargo build --release
-
Make sure you have the required model files in the project directory:
ner_roberta.onnx
(the ONNX model)roberta-vocab.json
(tokenizer vocabulary)roberta-merges.txt
(tokenizer merges)
Process a text file to extract entities:
./target/release/entext --input path/to/your/document.txt
By default, the tool will look for an input.txt
file in the current directory if no input is specified.
The tool outputs recognized entities to the console in the following format:
Recognized Entities:
- Type: COMPANY, Score: 0.982, Start: 45, End: 58, Text: "Acme Corp"
- Type: DOMAIN, Score: 0.876, Start: 102, End: 116, Text: "example.com"
- Type: URL, Score: 0.945, Start: 203, End: 231, Text: "https://www.example.com/page"
- Type: IP_ADDR, Score: 0.991, Start: 300, End: 311, Text: "192.168.1.1"
- Type: EMAIL, Score: 0.989, Start: 400, End: 421, Text: "[email protected]"
- The tool reads the input text file
- Text is tokenized using a RoBERTa-compatible tokenizer
- Tokenized inputs are processed by the ONNX model
- The model outputs are post-processed to identify and extract entities
- Entities are displayed with their type, confidence score, and position in the original text
The NER model was trained on a custom dataset to recognize technical entities commonly found in business and technical documents. It uses a RoBERTa-based architecture fine-tuned for token classification tasks.
anyhow
: Error handlingonnxruntime
: ONNX model inferencetokenizers
: Text tokenizationndarray
: Numerical operationsclap
: Command-line argument parsingserde
andserde_json
: JSON serialization/deserialization
Contributions are welcome! Please feel free to submit a Pull Request.