A novel data representation framework for the AI era—offering structured annotations, granular traceability, and enhanced evaluation metrics to tackle hallucinations and compliance challenges.
- Overview
- Features
- Installation
- Quick Start
- Hugging Face Spaces Demo
- Code Overview
- Evaluation & Metrics
- Conclusion
- Contributing
Supermat introduces a structured approach to data processing and retrieval for Large Language Models (LLMs). It preserves annotations even after an LLM is trained, enabling clear traceability from any LLM output back to the original source text. This is critical for:
- Hallucination Prevention: Identify and mitigate fabricated answers
- Compliance & Auditing: Ensure regulatory standards are met by tracing outputs
- Legal & Security: Quickly verify authenticity and control sensitive content
By leveraging Structure IDs (e.g., 2.1.4.8
for document/section/paragraph/sentence), Supermat maintains a transparent map between raw data and tokenized text, thereby reducing hallucinations and offering granular document-level context.
-
Persistent Annotations
- Supermat encodes unique identifiers at the sentence or paragraph level, so the lineage of any output text is never lost—even when building or fine-tuning LLMs.
-
Structure-Aware Data
- Parsed documents maintain hierarchical relationships: sections, paragraphs, and sentences. This allows for more informed chunking and retrieval strategies.
-
Traceability & Compliance
- Instantly link LLM outputs to their original references. Ideal for auditing, legal e-discovery, and policy enforcement.
-
Drop-In Retriever
- The
SupermatRetriever
class seamlessly integrates with LangChain’s VectorStore, enabling structured queries with minimal refactoring.
- The
-
Enhanced Evaluation Pipeline
- Built-in metrics (Faithfulness, Accuracy, ROUGE, Cosine Similarity, etc.) let you rigorously test and iterate on your retrieval-augmented generation (RAG) workflows.
Supermat uses Poetry for dependency management:
# 1. Clone the repository
git clone https://github.com/supermatai/supermat.git
cd supermat
# 2. Install Poetry (if not already installed)
# Follow the official Poetry docs for your environment
# 3. Install dependencies
poetry install --with=frontend --all-extras
# 4. Activate your virtual environment
poetry shell
For additional instructions or troubleshooting, check our Documentation.
from supermat import FileProcessor
from pathlib import Path
pdf_path = Path("sample_document.pdf")
parsed_document = FileProcessor.parse_file(pdf_path)
from supermat.langchain.bindings import SupermatRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
# Suppose you have multiple parsed documents
documents = [parsed_document] # or a list of them
embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
vector_store = Chroma(
embedding_function=embedding_model,
collection_name="PDFS_SUPERMAT_DEMO",
persist_directory="./chromadb"
)
retriever = SupermatRetriever(parsed_docs=documents, vector_store=vector_store)
python -m supermat.gradio
Open the provided local URL to see a live demo of how Supermat processes and retrieves text.
cd notebooks
poetry run jupyter notebook pdf_demo.ipynb
This end-to-end walkthrough demonstrates:
- Parsing and annotating PDF content
- Structuring data into the
ParsedDocument
model - Using the retriever for queries and tracing outputs
Try Supermat directly in your browser—no setup required:
- Purpose: Converts files (PDF, DOCX, HTML, etc.) into a
ParsedDocument
model, preserving hierarchical structure. - Usage:
from supermat import FileProcessor, ParsedDocument from pathlib import Path doc: ParsedDocument = FileProcessor.parse_file(Path("your_file.pdf"))
- Handler Management:
handlers = FileProcessor.get_handlers(Path("your_file.pdf")) doc_custom = FileProcessor.get_handler("some_handler").parse(Path("your_file.pdf"))
- Goal: Serve as a drop-in replacement for LangChain’s standard retrievers, adding structure-aware indexing and traceability.
- Usage:
from supermat.langchain.bindings import SupermatRetriever from langchain.vectorstores import Chroma retriever = SupermatRetriever(parsed_docs=[doc1, doc2], vector_store=Chroma(...))
- Advantages:
- Retains hierarchical references (Structure IDs)
- Easily integrates into RAG workflows
- Minimizes hallucination risk by enabling direct text tracebacks
Supermat includes an evaluation module aligned with LangChain’s frameworks to measure the quality of LLM outputs. Key metrics include:
- Faithfulness: Checks if the generated response accurately reflects the source documents (i.e., no made-up facts).
- Accuracy: Measures correctness against reference answers or ground truth.
- Cosine Similarity: Quantifies semantic closeness between the generated response and reference text.
- ROUGE (1, 2, L): Assesses textual overlap at unigram, bigram, and longest common subsequence levels.
Highlights (vs. standard chunking & semantic chunking strategies):
- +12.5% improvement in faithfulness
- +15.6% improvement in accuracy
- +33% ROUGE-1 recall lift
- Slightly faster or comparable runtime performance
Such gains emphasize Supermat’s focus on preserving structural context and annotated references, which reduces hallucinations and improves overall LLM response quality.
Supermat is more than just another chunking library. By embedding structured annotations into the document processing pipeline, it ensures every piece of information remains traceable—an essential component for building trustworthy AI systems. Whether you need robust compliance checks, advanced RAG pipelines, or improved user confidence, Supermat delivers a scalable and adaptable solution for AI-driven data workflows.
We welcome your contributions! You can help by:
- Forking the repository
- Creating a feature branch
- Submitting a pull request
For guidelines, please see CONTRIBUTING.md (coming soon).
Thanks for trying Supermat!
Find more details and advanced guides at our Documentation. Feel free to open an issue or a pull request if you have any suggestions or improvements!