📚 vector-embedder

vector-embedder is a flexible, language-agnostic document ingestion and embedding pipeline. It transforms structured and unstructured content from multiple sources into vector embeddings and stores them in your vector database of choice.

It supports Git repositories, web URLs, and file types like Markdown, PDFs, and HTML. Designed for local runs, containers, or OpenShift/Kubernetes jobs.

📚 vector-embedder

⚙️ Features

✅ Multi-DB support:
- Redis (RediSearch)
- Elasticsearch
- PGVector (PostgreSQL)
- SQL Server (preview)
- Qdrant
- Dry Run (no DB required; logs to console)
✅ Flexible input sources:
- Git repositories via glob patterns (**/*.pdf, *.md, etc.)
- Web pages via configurable URL lists
✅ Smart chunking with configurable CHUNK_SIZE and CHUNK_OVERLAP
✅ Embeddings via sentence-transformers
✅ Parsing via LangChain + Unstructured
✅ UBI-compatible container, OpenShift-ready
✅ Fully configurable via .env or -e environment flags

🚀 Quick Start

1. Configuration

Set your configuration in a .env file at the project root.

# Temporary working directory
TEMP_DIR=/tmp

# Logging
LOG_LEVEL=info

# Sources
REPO_SOURCES=[{"repo": "https://github.com/example/repo.git", "globs": ["docs/**/*.md"]}]
WEB_SOURCES=["https://example.com/docs/", "https://example.com/report.pdf"]

# Chunking
CHUNK_SIZE=2048
CHUNK_OVERLAP=200

# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2

# Vector DB
DB_TYPE=DRYRUN

🧪 DB_TYPE=DRYRUN logs chunks to stdout and skips database indexing—great for development!

2. Run Locally

./embed_documents.py

3. Or Run in a Container

podman build -t embed-job .

podman run --rm --env-file .env embed-job

You can also pass inline vars:

podman run --rm \
  -e DB_TYPE=REDIS \
  -e REDIS_URL=redis://localhost:6379 \
  embed-job

🧪 Dry Run Mode

Dry run skips vector DB upload and prints chunk metadata and content to the terminal.

DB_TYPE=DRYRUN

Run it:

./embed_documents.py

📦 Dependency Management & Updates

This project keeps two dependency files under version control:

File	Purpose	Edited by
`requirements.in`	Short, human-readable list of top-level libraries (no pins)	You
`requirements.txt`	Fully-resolved, pinned lock file—including hashes—for exact, reproducible builds	`pip-compile`

🔧 Installing `pip-tools`

python -m pip install --upgrade pip-tools

➕ Adding / Updating a Package

Edit requirements.in

- sentence-transformers
+ sentence-transformers>=4.1
+ llama-index

Re-lock the environment
```
pip-compile --upgrade
```
Synchronise your virtual-env
```
pip-sync
```

🗂️ Project Layout

.
├── embed_documents.py      # Main entrypoint script
├── config.py               # Config loader from env
├── loaders/                # Git, web, PDF, and text loaders
├── vector_db/              # Pluggable DB providers
├── requirements.txt        # Python dependencies
├── redis_schema.yaml       # Redis index schema (if used)
└── .env                    # Default runtime config

🧪 Local DB Testing

Run a compatible DB locally to test full ingestion + indexing.

PGVector (PostgreSQL)

podman run --rm -d \
  --name pgvector \
  -e POSTGRES_USER=user \
  -e POSTGRES_PASSWORD=pass \
  -e POSTGRES_DB=mydb \
  -p 5432:5432 \
  docker.io/ankane/pgvector

DB_TYPE=PGVECTOR ./embed_documents.py

Elasticsearch

podman run --rm -d \
  --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=true" \
  -e "ELASTIC_PASSWORD=changeme" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.io/elastic/elasticsearch:8.11.1

DB_TYPE=ELASTIC ./embed_documents.py

Redis (RediSearch)

podman run --rm -d \
  --name redis-stack \
  -p 6379:6379 \
  docker.io/redis/redis-stack-server:6.2.6-v19

DB_TYPE=REDIS ./embed_documents.py

Qdrant

podman run -d \
  -p 6333:6333 \
  --name qdrant \
  docker.io/qdrant/qdrant

DB_TYPE=QDRANT ./embed_documents.py

SQL Server (MSSQL)

podman run --rm -d \
  --name mssql \
  -e ACCEPT_EULA=Y \
  -e SA_PASSWORD=StrongPassword! \
  -p 1433:1433 \
  mcr.microsoft.com/mssql/rhel/server:2025-latest

DB_TYPE=MSSQL ./embed_documents.py

🙌 Acknowledgments

Built with:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📚 vector-embedder

⚙️ Features

🚀 Quick Start

1. Configuration

2. Run Locally

3. Or Run in a Container

🧪 Dry Run Mode

📦 Dependency Management & Updates

🔧 Installing `pip-tools`

➕ Adding / Updating a Package

🗂️ Project Layout

🧪 Local DB Testing

PGVector (PostgreSQL)

Elasticsearch

Redis (RediSearch)

Qdrant

SQL Server (MSSQL)

🙌 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
loaders		loaders
vector_db		vector_db
.env		.env
.gitignore		.gitignore
Containerfile		Containerfile
README.md		README.md
config.py		config.py
embed_documents.py		embed_documents.py
requirements.in		requirements.in
requirements.txt		requirements.txt

validatedpatterns-sandbox/vector-embedder

Folders and files

Latest commit

History

Repository files navigation

📚 vector-embedder

⚙️ Features

🚀 Quick Start

1. Configuration

2. Run Locally

3. Or Run in a Container

🧪 Dry Run Mode

📦 Dependency Management & Updates

🔧 Installing pip-tools

➕ Adding / Updating a Package

🗂️ Project Layout

🧪 Local DB Testing

PGVector (PostgreSQL)

Elasticsearch

Redis (RediSearch)

Qdrant

SQL Server (MSSQL)

🙌 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

🔧 Installing `pip-tools`

Packages