EnText: Entity Extraction Tool

EnText is a Rust-based tool that uses ONNX Runtime to perform Named Entity Recognition (NER) on text documents. It's specifically trained to identify and extract:

Company Names
Domain Names
URLs
IP Addresses
Email Addresses

Features

Fast and efficient entity extraction using ONNX Runtime
Pre-trained NER model optimized for technical and business document processing
Command-line interface for easy integration into workflows
Detailed entity output with confidence scores and position information

Requirements

Rust 1.56 or higher
ONNX Runtime
Pre-trained NER model file (ner_roberta.onnx)
RoBERTa tokenizer files (roberta-vocab.json and roberta-merges.txt)

Installation

Clone this repository:

git clone https://github.com/yourusername/entext.git
cd entext

Build the project:
```
cargo build --release
```
Make sure you have the required model files in the project directory:
- ner_roberta.onnx (the ONNX model)
- roberta-vocab.json (tokenizer vocabulary)
- roberta-merges.txt (tokenizer merges)

Usage

Process a text file to extract entities:

./target/release/entext --input path/to/your/document.txt

By default, the tool will look for an input.txt file in the current directory if no input is specified.

Output Format

The tool outputs recognized entities to the console in the following format:

Recognized Entities:
  - Type: COMPANY,  Score: 0.982, Start: 45, End: 58, Text: "Acme Corp"
  - Type: DOMAIN,   Score: 0.876, Start: 102, End: 116, Text: "example.com"
  - Type: URL,      Score: 0.945, Start: 203, End: 231, Text: "https://www.example.com/page"
  - Type: IP_ADDR,  Score: 0.991, Start: 300, End: 311, Text: "192.168.1.1"
  - Type: EMAIL,    Score: 0.989, Start: 400, End: 421, Text: "[email protected]"

How It Works

The tool reads the input text file
Text is tokenized using a RoBERTa-compatible tokenizer
Tokenized inputs are processed by the ONNX model
The model outputs are post-processed to identify and extract entities
Entities are displayed with their type, confidence score, and position in the original text

Model Information

The NER model was trained on a custom dataset to recognize technical entities commonly found in business and technical documents. It uses a RoBERTa-based architecture fine-tuned for token classification tasks.

Dependencies

anyhow: Error handling
onnxruntime: ONNX model inference
tokenizers: Text tokenization
ndarray: Numerical operations
clap: Command-line argument parsing
serde and serde_json: JSON serialization/deserialization

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
testdoc.txt		testdoc.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EnText: Entity Extraction Tool

Features

Requirements

Installation

Usage

Output Format

How It Works

Model Information

Dependencies

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

keathmilligan/entext

Folders and files

Latest commit

History

Repository files navigation

EnText: Entity Extraction Tool

Features

Requirements

Installation

Usage

Output Format

How It Works

Model Information

Dependencies

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages