Skip to content

A microservice and workflow for extracting invoice data via OCR/LLM, matching products against an internal catalog, and managing orders—including handling uncertain items for manual review.

Notifications You must be signed in to change notification settings

author31/invoice-agent

Repository files navigation

Invoice Agent

A microservice and workflow for extracting invoice data via OCR/LLM, matching products against an internal catalog, and managing orders—including handling uncertain items for manual review.


Setup Instructions

invoice-agent standalone service setup

# Install the UVicorn server
make install-uv

# Install Python dependencies
make dep

# Run the service and reinitialize the index each time
make run

# Run in development mode (does not reinitialize index)
make dev

# Manually initialize or reinitialize the index
make init

n8n launching

docker-compose up -d
Services

📚 Indexing Explanation

To perform product matching against our internal catalog, index each product name (and its aliases) into embedding vectors with the following metadata:

{
  "original_id": "<original_id_val>",
  "original_display": "<display_text_val>",
  "indexed_keyword": "<keyword_val>"
}

Because the raw product_list.xlsx contains multiple aliases per ID (e.g. "生花生\\花生仁"), first preprocess it into this JSON-ready structure:

[
  {
    "id": "S021490",
    "display_text": "炸薯(地瓜)片",
    "keywords": ["炸薯(地瓜)片"]
  },
  {
    "id": "S023200",
    "display_text": "熟花生",
    "keywords": ["熟花生"]
  },
  {
    "id": "S023220",
    "display_text": "生花生\\花生仁",
    "keywords": ["生花生", "花生仁"]
  }
]

This enhances embedding richness and ensures that any alias query (e.g. “花生仁”) will hit the correct product.


💡 Idea & Implementation Decisions

RAG libraries — thought experiments

Goal: Keep the RAG stack simple and iteration-friendly for fast indexing & retrieval.

  • RAGatouille ­­– Not chosen: ColBERT’s raw score range makes uncertainty thresholds tricky.

  • txtai ­­– Chosen: minimal API surface, straightforward indexing and search.

Quick test method: Use eval_1.png as a baseline for extraction → matching pipeline validation.

ColBERT (eliminated)

  • Retrieval task: fuzzy matching of extracted product names.

  • Theory: Not every top-K result is a true match.

  • Strategy: compute sim_gap = top1_score − top2_score; if sim_gap is high → confidence.

  • However:

    • ColBERT’s score formula simi,j = Dⱼ · Qᵢ yields a dynamic range −|Q| … |Q|

    • Hard to set static thresholds.

    • Solutions considered:

      1. Normalize by query length
      2. Use relative gap = (top1 − top2) / top1 → normalized to [0…1]

🤖 Comparing OCR vs. LLM Approaches

EasyOCR (out‐of‐the‐box)

  • Flow: image → EasyOCR → raw text → LLM parse → structured data
  • Cost: Free, runs locally
extracted_texts = [
  (65.0,  "幅塔6兩",  0.0897),
  (81.5,  "#7",      0.1738),
  (174.5, "?23付",   0.0079),
  (189.5, "契枇并",   0.00007),
  (258.0, "酯.把",   0.0110),
  (289.0, "3絲",     0.00027),
  (334.5, "嵯之?",   0.00123),
  (348.5, "(-|!32&,,)5", 0.00084)
]

LLM Version

extracted_texts = [
  {'name':'九層塔','price':'6雨','quantity':'6','unit':'颗'},
  {'name':'熟花生','price':'3斤','quantity':'1','unit':'斤'},
  {'name':'腰果','price':'3件','quantity':'1','unit':'件'},
  {'name':'海帶絲','price':'3斤','quantity':'1','unit':'斤'},
  {'name':'醋','price':'1','quantity':'1','unit':'锅'},
  {'name':'韭黃','price':'1','quantity':'1','unit':'包'},
  {'name':'不明食材','price':'1','quantity':'1','unit':'包'}
]

Decision: In this phase, the OCR module is a swappable component—using cloud-hosted LLMs now, with room to pivot later.


🔍 Fuzzy Matching Logic

  • score — raw similarity score from embedding search
  • relative_sim_gap(highest_score − second_highest_score) / highest_score

An uncertainty metric: small gap → flag for manual review.


📐 Project Design Pattern

📁 invoice_agent/
├── tools/      # External integrations (Excel, OpenRouter, OCR, DB)
├── services/   # Core business logic (init, extract, match, order)
└── api/        # FastAPI routes & CLI entrypoint
  1. tools: low-level I/O, embedding index, DB schema
  2. services: orchestrates indexing, extraction, matching, order creation
  3. api: HTTP endpoints (FastAPI) & CLI (typer)

test_ocr_llm Evaluation

This section outlines how to evaluate the invoice agent pipelines for each candidate solution. The test_ocr_llm.py pytest script:

  1. Sets up a temporary environment, dummy product list, and initializes the service.

  2. Runs services.extract_texts_from_input(...) against sample files (eval_1.png, eval_2.png, eval_3.pdf).

  3. Compares extracted+matched results to ground truth (tests/gt.json), computing:

    • Total ground-truth items
    • Matched count
    • Correctly matched count & accuracy
    • Uncertain item count
  4. Asserts overall accuracy > 0.0 to catch breaking changes.

  5. Outputs a timestamped CSV in tests/evaluation_reports/ for deeper analysis.

Use this test harness to benchmark and compare future OCR/LLM or pure-OCR approaches before merging into main.


For full code examples and tests, see the ./tests folder and individual modules under ./src/invoice_agent/.

n8n Workflow Explanation

This section describes the InvoiceAgent n8n workflow (n8n_workflow/InvoiceAgent.json), outlining the end-to-end process from form submission to Slack notifications:

  1. On form submission (formTrigger)

    • Presents a form with Name, File (image/PDF), and Date fields.
    • Triggers the workflow when a user submits.
  2. Check OCR Readability (HTTP Request)

    • POSTs the uploaded file to /check-ocr-readability.
    • Branches via If1: only proceeds if the image is deemed readable.
  3. Extract Order (HTTP Request)

    • POSTs customer_name, order_date, and the invoice file to /extract-order, kicking off the extraction and matching process.
  4. Get Uncertain Items (HTTP Request)

    • Queries /uncertain-items to retrieve any items that the service flagged as uncertain.
  5. Decision (If node)

    • Routes based on the count of uncertain items:

      • > 0 → handle uncertain items.
  6. Read/Write Files from Disk

    • Fetches the saved invoice file (in the service’s .artifacts/uncertain_invoices directory).
  7. Slack Upload Image (Slack file upload)

    • Uploads the uncertain invoice file to Slack and retrieves a permalink.
  8. Slack Send Message (Slack message)

    • Posts to #all-invoice-agent with:

      • New Uncertain Invoices header.
      • From/Date metadata.
      • List of uncertain item details (ID, input, quantity, unit).
      • Download link to the uploaded invoice image.

About

A microservice and workflow for extracting invoice data via OCR/LLM, matching products against an internal catalog, and managing orders—including handling uncertain items for manual review.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages