A comparison of structured output performance among popular open and closed source large language models.
poetry install --no-root
Copy .env.example
to .env
and add your API keys.
- Measure tokens per second
- Add Anthropic models
- Update Groq to use
with_structured_output
- Break out GPT 4 versions
- Ensure instance of class isn't empty
- Add ability to mix in different Pydantic objects and prompts
- Parallelize
- Create a version that doesn't use LangChain
This analysis is performed with one, fairly simple prompt template, run with just 10 samples per model. More complicated prompts/tasks will negatively impact consistency. And tinkering with different prompt strategies will improve consistency. It's also important to note, quality/accuracy of output is not considered here, only consistency and latency.