Skip to content

Commit

Permalink
Add schema restrictions to the README (#44)
Browse files Browse the repository at this point in the history
  • Loading branch information
neerajprad authored Jan 17, 2025
1 parent 30db63c commit 627a09c
Showing 1 changed file with 42 additions and 9 deletions.
51 changes: 42 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,23 @@ schema = {
"name": {"type": "string", "description": "Candidate name"},
"experience": {
"type": "array",
"description": "Work history",
"items": {
"type": "object",
"properties": {
"company": {"type": "string"},
"title": {"type": "string"},
"start_date": {"type": "string"},
"end_date": {"type": "string"},
"company": {
"type": "string",
"description": "Company name",
},
"title": {"type": "string", "description": "Job title"},
"start_date": {
"anyOf": [{"type": "string"}, {"type": "null"}],
"description": "Start date of employment",
},
"end_date": {
"anyOf": [{"type": "string"}, {"type": "null"}],
"description": "End date of employment",
},
},
},
},
Expand All @@ -84,6 +94,22 @@ schema = {
agent = extractor.create_agent(name="resume-parser", data_schema=schema)
```

### Important restrictions on JSON/Pydantic Schema

*LlamaExtract only supports a subset of the JSON Schema specification.* While limited, it should
be sufficient for a wide variety of use-cases.

- All fields are required by default. Nullable fields must be explicitly marked as such,
using `"anyOf"` with a `"null"` type. See `"start_date"` field above.
- Root node must be of type `"object"`.
- Schema nesting must be limited to within 5 levels.
- The important fields are key names/titles, type and description. Fields for
formatting, default values, etc. are not supported.
- There are other restrictions on number of keys, size of the schema, etc. that you may
hit for complex extraction use cases. In such cases, it is worth thinking how to restructure
your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
and later merging them together.

## Other Extraction APIs

### Batch Processing
Expand Down Expand Up @@ -137,13 +163,20 @@ pip install llama-extract==0.1.0
## Tips & Best Practices

1. **Schema Design**:
- Make fields optional when data might not always be present.
- Use descriptive field names and detailed descriptions. Use descriptions to pass formatting instructions or few-shot examples.
- Start simple and iterate on schema complexity.
- Try to limit schema nesting to 3-4 levels.
- Make fields optional when data might not always be present. Having required fields may force the model
to hallucinate when these fields are not present in the documents.
- When you want to extract a variable number of entities, use an `array` type. Note that you cannot use
an `array` type for the root node.
- Use descriptive field names and detailed descriptions. Use descriptions to pass formatting
instructions or few-shot examples.
- Start simple and iteratively build your schema to incorporate requirements.

2. **Running Extractions**:
- Note that resetting `agent.schema` will not save the schema to the database, until you call `agent.save`, but it will be used for running extractions.
- Check job status prior to accessing results. Any extraction error should be available as part of `job.error` or `extraction_run.error` fields for debugging.
- Note that resetting `agent.schema` will not save the schema to the database,
until you call `agent.save`, but it will be used for running extractions.
- Check job status prior to accessing results. Any extraction error should be available as
part of `job.error` or `extraction_run.error` fields for debugging.
- Consider async operations (`queue_extraction`) for large-scale extraction once you have finalized your schema.

## Additional Resources
Expand Down

0 comments on commit 627a09c

Please sign in to comment.