vector-embedder is a flexible, language-agnostic document ingestion and embedding pipeline. It transforms structured and unstructured content from multiple sources into vector embeddings and stores them in your vector database of choice.
It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs.
- 📚 vector-embedder
- ✅ Multi-DB support:
- Redis (RediSearch)
- Elasticsearch
- PGVector (PostgreSQL)
- SQL Server (preview)
- Qdrant
- Dry Run (no DB required; logs to console)
- ✅ Flexible input sources:
- Git repositories via glob patterns (
**/*.pdf
,*.md
, etc.) - Web pages via configurable URL lists
- Git repositories via glob patterns (
- ✅ Smart chunking with configurable
CHUNK_SIZE
andCHUNK_OVERLAP
- ✅ Embeddings via
sentence-transformers
- ✅ Parsing via LangChain + Unstructured
- ✅ UBI-compatible container, OpenShift-ready
- ✅ Fully configurable via
.env
or-e
environment flags
Set your configuration in a .env
file at the project root.
# Temporary working directory
TEMP_DIR=/tmp
# Logging
LOG_LEVEL=info
# Sources
REPO_SOURCES=[{"repo": "https://github.com/example/repo.git", "globs": ["docs/**/*.md"]}]
WEB_SOURCES=["https://example.com/docs/", "https://example.com/report.pdf"]
# Chunking
CHUNK_SIZE=2048
CHUNK_OVERLAP=200
# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
# Vector DB
DB_TYPE=DRYRUN
🧪 DB_TYPE=DRYRUN
logs chunks to stdout and skips database indexing—great for development!
./embed_documents.py
podman build -t embed-job .
podman run --rm --env-file .env embed-job
You can also pass inline vars:
podman run --rm \
-e DB_TYPE=REDIS \
-e REDIS_URL=redis://localhost:6379 \
embed-job
Dry run skips vector DB upload and prints chunk metadata and content to the terminal.
DB_TYPE=DRYRUN
Run it:
./embed_documents.py
This project keeps two dependency files under version control:
File | Purpose | Edited by |
---|---|---|
requirements.in |
Short, human-readable list of top-level libraries (no pins) | You |
requirements.txt |
Fully-resolved, pinned lock file—including hashes—for exact, reproducible builds | pip-compile |
python -m pip install --upgrade pip-tools
-
Edit
requirements.in
- sentence-transformers + sentence-transformers>=4.1 + llama-index
-
Re-lock the environment
pip-compile --upgrade
-
Synchronise your virtual-env
pip-sync
.
├── embed_documents.py # Main entrypoint script
├── config.py # Config loader from env
├── loaders/ # Git, web, PDF, and text loaders
├── vector_db/ # Pluggable DB providers
├── requirements.txt # Python dependencies
├── redis_schema.yaml # Redis index schema (if used)
└── .env # Default runtime config
Run a compatible DB locally to test full ingestion + indexing.
podman run --rm -d \
--name pgvector \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=pass \
-e POSTGRES_DB=mydb \
-p 5432:5432 \
docker.io/ankane/pgvector
DB_TYPE=PGVECTOR ./embed_documents.py
podman run --rm -d \
--name elasticsearch \
-p 9200:9200 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=true" \
-e "ELASTIC_PASSWORD=changeme" \
-e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
docker.io/elastic/elasticsearch:8.11.1
DB_TYPE=ELASTIC ./embed_documents.py
podman run --rm -d \
--name redis-stack \
-p 6379:6379 \
docker.io/redis/redis-stack-server:6.2.6-v19
DB_TYPE=REDIS ./embed_documents.py
podman run -d \
-p 6333:6333 \
--name qdrant \
docker.io/qdrant/qdrant
DB_TYPE=QDRANT ./embed_documents.py
podman run --rm -d \
--name mssql \
-e ACCEPT_EULA=Y \
-e SA_PASSWORD=StrongPassword! \
-p 1433:1433 \
mcr.microsoft.com/mssql/rhel/server:2025-latest
DB_TYPE=MSSQL ./embed_documents.py
Built with: