promptdiff — lint, diff, and score LLM system prompts (works with any provider) #3018

HadiFrt20 · 2026-03-26T08:23:41Z

HadiFrt20
Mar 26, 2026

Built a CLI that applies static analysis to LLM system prompts.

If you manage system prompts for OpenAI models, promptdiff catches issues that silently degrade output:

npm install -g promptdiff
promptdiff lint my-agent.prompt
promptdiff score my-agent.prompt
promptdiff diff v1.prompt v2.prompt --annotate

What it catches:

Semantic diff — not line-by-line. Tells you "word limit tightened 150→100, high impact" with behavioral annotations.

Quality score — 0-100 across structure, specificity, examples, safety, completeness. Usable as a CI gate.

A/B compare — run two prompt versions through GPT-4o (or Claude, Ollama) and score both outputs:

promptdiff compare v1.prompt v2.prompt --input "test query" --model gpt4o

Runs locally, 3 deps, 217 tests.