Skip to content

EnText is a Rust-based tool that uses ONNX Runtime to perform Named Entity Recognition (NER) on text documents

Notifications You must be signed in to change notification settings

keathmilligan/entext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EnText: Entity Extraction Tool

EnText is a Rust-based tool that uses ONNX Runtime to perform Named Entity Recognition (NER) on text documents. It's specifically trained to identify and extract:

  • Company Names
  • Domain Names
  • URLs
  • IP Addresses
  • Email Addresses

Features

  • Fast and efficient entity extraction using ONNX Runtime
  • Pre-trained NER model optimized for technical and business document processing
  • Command-line interface for easy integration into workflows
  • Detailed entity output with confidence scores and position information

Requirements

  • Rust 1.56 or higher
  • ONNX Runtime
  • Pre-trained NER model file (ner_roberta.onnx)
  • RoBERTa tokenizer files (roberta-vocab.json and roberta-merges.txt)

Installation

  1. Clone this repository:

    git clone https://github.com/yourusername/entext.git
    cd entext
    
  2. Build the project:

    cargo build --release
    
  3. Make sure you have the required model files in the project directory:

    • ner_roberta.onnx (the ONNX model)
    • roberta-vocab.json (tokenizer vocabulary)
    • roberta-merges.txt (tokenizer merges)

Usage

Process a text file to extract entities:

./target/release/entext --input path/to/your/document.txt

By default, the tool will look for an input.txt file in the current directory if no input is specified.

Output Format

The tool outputs recognized entities to the console in the following format:

Recognized Entities:
  - Type: COMPANY,  Score: 0.982, Start: 45, End: 58, Text: "Acme Corp"
  - Type: DOMAIN,   Score: 0.876, Start: 102, End: 116, Text: "example.com"
  - Type: URL,      Score: 0.945, Start: 203, End: 231, Text: "https://www.example.com/page"
  - Type: IP_ADDR,  Score: 0.991, Start: 300, End: 311, Text: "192.168.1.1"
  - Type: EMAIL,    Score: 0.989, Start: 400, End: 421, Text: "[email protected]"

How It Works

  1. The tool reads the input text file
  2. Text is tokenized using a RoBERTa-compatible tokenizer
  3. Tokenized inputs are processed by the ONNX model
  4. The model outputs are post-processed to identify and extract entities
  5. Entities are displayed with their type, confidence score, and position in the original text

Model Information

The NER model was trained on a custom dataset to recognize technical entities commonly found in business and technical documents. It uses a RoBERTa-based architecture fine-tuned for token classification tasks.

Dependencies

  • anyhow: Error handling
  • onnxruntime: ONNX model inference
  • tokenizers: Text tokenization
  • ndarray: Numerical operations
  • clap: Command-line argument parsing
  • serde and serde_json: JSON serialization/deserialization

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

EnText is a Rust-based tool that uses ONNX Runtime to perform Named Entity Recognition (NER) on text documents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages