-
Notifications
You must be signed in to change notification settings - Fork 361
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ec9b733
commit c98239c
Showing
5 changed files
with
120 additions
and
162 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,165 +1,51 @@ | ||
# LlamaParse | ||
|
||
[](https://pypi.org/project/llama-parse/) | ||
[](https://github.com/run-llama/llama_parse/graphs/contributors) | ||
[](https://pypi.org/project/llama-cloud-services/) | ||
[](https://github.com/run-llama/llama_cloud_services/graphs/contributors) | ||
[](https://discord.gg/dGcwcsnxhU) | ||
|
||
LlamaParse is a **GenAI-native document parser** that can parse complex document data for any downstream LLM use case (RAG, agents). | ||
|
||
It is really good at the following: | ||
|
||
- ✅ **Broad file type support**: Parsing a variety of unstructured file types (.pdf, .pptx, .docx, .xlsx, .html) with text, tables, visual elements, weird layouts, and more. | ||
- ✅ **Table recognition**: Parsing embedded tables accurately into text and semi-structured representations. | ||
- ✅ **Multimodal parsing and chunking**: Extracting visual elements (images/diagrams) into structured formats and return image chunks using the latest multimodal models. | ||
- ✅ **Custom parsing**: Input custom prompt instructions to customize the output the way you want it. | ||
# Llama Cloud Services | ||
|
||
LlamaParse directly integrates with [LlamaIndex](https://github.com/run-llama/llama_index). | ||
This repository contains the code for hand-written SDKs and clients for interacting with LlamaCloud. | ||
|
||
The free plan is up to 1000 pages a day. Paid plan is free 7k pages per week + 0.3c per additional page by default. There is a sandbox available to test the API [**https://cloud.llamaindex.ai/parse ↗**](https://cloud.llamaindex.ai/parse). | ||
This includes: | ||
|
||
Read below for some quickstart information, or see the [full documentation](https://docs.cloud.llamaindex.ai/). | ||
|
||
If you're a company interested in enterprise RAG solutions, and/or high volume/on-prem usage of LlamaParse, come [talk to us](https://www.llamaindex.ai/contact). | ||
- [LlamaParse](./parse.md) - A GenAI-native document parser that can parse complex document data for any downstream LLM use case (Agents, RAG, data processing, etc.). | ||
- [LlamaReport (beta/invite-only)](./report.md) - A prebuilt agentic report builder that can be used to build reports from a variety of data sources. | ||
- [LlamaExtract (beta/invite-only)](./extract.md) - A prebuilt agentic data extractor that can be used to transform data into a structured JSON representation. | ||
|
||
## Getting Started | ||
|
||
First, login and get an api-key from [**https://cloud.llamaindex.ai/api-key ↗**](https://cloud.llamaindex.ai/api-key). | ||
|
||
Then, make sure you have the latest LlamaIndex version installed. | ||
|
||
**NOTE:** If you are upgrading from v0.9.X, we recommend following our [migration guide](https://pretty-sodium-5e0.notion.site/v0-10-0-Migration-Guide-6ede431dcb8841b09ea171e7f133bd77), as well as uninstalling your previous version first. | ||
|
||
``` | ||
pip uninstall llama-index # run this if upgrading from v0.9.x or older | ||
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall | ||
``` | ||
|
||
Lastly, install the package: | ||
|
||
`pip install llama-parse` | ||
|
||
Now you can parse your first PDF file using the command line interface. Use the command `llama-parse [file_paths]`. See the help text with `llama-parse --help`. | ||
Install the package: | ||
|
||
```bash | ||
export LLAMA_CLOUD_API_KEY='llx-...' | ||
|
||
# output as text | ||
llama-parse my_file.pdf --result-type text --output-file output.txt | ||
|
||
# output as markdown | ||
llama-parse my_file.pdf --result-type markdown --output-file output.md | ||
|
||
# output as raw json | ||
llama-parse my_file.pdf --output-raw-json --output-file output.json | ||
pip install llama-cloud-services | ||
``` | ||
|
||
You can also create simple scripts: | ||
|
||
```python | ||
import nest_asyncio | ||
|
||
nest_asyncio.apply() | ||
|
||
from llama_parse import LlamaParse | ||
|
||
parser = LlamaParse( | ||
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY | ||
result_type="markdown", # "markdown" and "text" are available | ||
num_workers=4, # if multiple files passed, split in `num_workers` API calls | ||
verbose=True, | ||
language="en", # Optionally you can define a language, default=en | ||
) | ||
|
||
# sync | ||
documents = parser.load_data("./my_file.pdf") | ||
|
||
# sync batch | ||
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"]) | ||
Then, get your API key from [LlamaCloud](https://cloud.llamaindex.ai/). | ||
|
||
# async | ||
documents = await parser.aload_data("./my_file.pdf") | ||
|
||
# async batch | ||
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"]) | ||
``` | ||
|
||
## Using with file object | ||
|
||
You can parse a file object directly: | ||
Then, you can use the services in your code: | ||
|
||
```python | ||
import nest_asyncio | ||
|
||
nest_asyncio.apply() | ||
|
||
from llama_parse import LlamaParse | ||
from llama_cloud_services import LlamaParse, LlamaReport, LlamaExtract | ||
|
||
parser = LlamaParse( | ||
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY | ||
result_type="markdown", # "markdown" and "text" are available | ||
num_workers=4, # if multiple files passed, split in `num_workers` API calls | ||
verbose=True, | ||
language="en", # Optionally you can define a language, default=en | ||
) | ||
|
||
file_name = "my_file1.pdf" | ||
extra_info = {"file_name": file_name} | ||
|
||
with open(f"./{file_name}", "rb") as f: | ||
# must provide extra_info with file_name key with passing file object | ||
documents = parser.load_data(f, extra_info=extra_info) | ||
|
||
# you can also pass file bytes directly | ||
with open(f"./{file_name}", "rb") as f: | ||
file_bytes = f.read() | ||
# must provide extra_info with file_name key with passing file bytes | ||
documents = parser.load_data(file_bytes, extra_info=extra_info) | ||
parser = LlamaParse(api_key="YOUR_API_KEY") | ||
report = LlamaReport(api_key="YOUR_API_KEY") | ||
extractor = LlamaExtract(api_key="YOUR_API_KEY") | ||
``` | ||
|
||
## Using with `SimpleDirectoryReader` | ||
See the quickstart guides for each service for more information: | ||
|
||
You can also integrate the parser as the default PDF loader in `SimpleDirectoryReader`: | ||
|
||
```python | ||
import nest_asyncio | ||
|
||
nest_asyncio.apply() | ||
|
||
from llama_parse import LlamaParse | ||
from llama_index.core import SimpleDirectoryReader | ||
|
||
parser = LlamaParse( | ||
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY | ||
result_type="markdown", # "markdown" and "text" are available | ||
verbose=True, | ||
) | ||
|
||
file_extractor = {".pdf": parser} | ||
documents = SimpleDirectoryReader( | ||
"./data", file_extractor=file_extractor | ||
).load_data() | ||
``` | ||
|
||
Full documentation for `SimpleDirectoryReader` can be found on the [LlamaIndex Documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html). | ||
|
||
## Examples | ||
|
||
Several end-to-end indexing examples can be found in the examples folder | ||
|
||
- [Getting Started](examples/demo_basic.ipynb) | ||
- [Advanced RAG Example](examples/demo_advanced.ipynb) | ||
- [Raw API Usage](examples/demo_api.ipynb) | ||
- [LlamaParse](./parse.md) | ||
- [LlamaReport (beta/invite-only)](./report.md) | ||
- [LlamaExtract (beta/invite-only)](./extract.md) | ||
|
||
## Documentation | ||
|
||
[https://docs.cloud.llamaindex.ai/](https://docs.cloud.llamaindex.ai/) | ||
You can see complete SDK and API documentation for each service on [our official docs](https://docs.cloud.llamaindex.ai/). | ||
|
||
## Terms of Service | ||
|
||
See the [Terms of Service Here](./TOS.pdf). | ||
|
||
## Get in Touch (LlamaCloud) | ||
|
||
LlamaParse is part of LlamaCloud, our e2e enterprise RAG platform that provides out-of-the-box, production-ready connectors, indexing, and retrieval over your complex data sources. We offer SaaS and VPC options. | ||
|
||
LlamaCloud is currently available via waitlist (join by [creating an account](https://cloud.llamaindex.ai/)). If you're interested in state-of-the-art quality and in centralizing your RAG efforts, come [get in touch with us](https://www.llamaindex.ai/contact). | ||
You can get in touch with us by following our [contact link](https://www.llamaindex.ai/contact). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# LlamaExtract | ||
# LlamaExtract (beta/invite-only) | ||
|
||
> **⚠️ EXPERIMENTAL** | ||
> This library is under active development with frequent breaking changes. APIs and functionality may change significantly between versions. If you're interested in being an early adopter, please contact us at [[email protected]](mailto:[email protected]) or join our [Discord](https://discord.com/invite/eN6D2HQ4aX). | ||
|
@@ -7,6 +7,10 @@ LlamaExtract provides a simple API for extracting structured data from unstructu | |
|
||
## Quick Start | ||
|
||
```bash | ||
pip install llama-cloud-services | ||
``` | ||
|
||
```python | ||
from llama_extract import LlamaExtract | ||
from pydantic import BaseModel, Field | ||
|
@@ -154,12 +158,6 @@ agent = extractor.get_agent(name="resume-parser") | |
extractor.delete_agent(agent.id) | ||
``` | ||
|
||
## Installation | ||
|
||
```bash | ||
pip install llama-extract==0.1.0 | ||
``` | ||
|
||
## Tips & Best Practices | ||
|
||
1. **Schema Design**: | ||
|
@@ -182,5 +180,5 @@ pip install llama-extract==0.1.0 | |
|
||
## Additional Resources | ||
|
||
- [Example Notebook](examples/resume_screening.ipynb) - Detailed walkthrough of resume parsing | ||
- [Example Notebook](examples/extract/resume_screening.ipynb) - Detailed walkthrough of resume parsing | ||
- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# LlamaReport (beta/invite-only) | ||
|
||
LlamaReport is a prebuilt agentic report builder that can be used to build reports from a variety of data sources. | ||
|
||
The python SDK for interacting with the LlamaReport API. The SDK provides two main classes: | ||
|
||
- `LlamaReport`: For managing reports (create, list, delete) | ||
- `ReportClient`: For working with a specific report (editing, approving, etc.) | ||
|
||
## Quickstart | ||
|
||
```bash | ||
pip install llama-cloud-services | ||
``` | ||
|
||
```python | ||
from llama_report import LlamaReport | ||
|
||
# Initialize the client | ||
client = LlamaReport( | ||
api_key="your-api-key", | ||
# Optional: Specify project_id, organization_id, async_httpx_client | ||
) | ||
|
||
# Create a new report | ||
report = client.create_report( | ||
"My Report", | ||
# must have one of template_text or template_instructions | ||
template_text="Your template text", | ||
template_instructions="Instructions for the template", | ||
# must have one of input_files or retriever_id | ||
input_files=["data1.pdf", "data2.pdf"], | ||
retriever_id="retriever-id", | ||
) | ||
``` | ||
|
||
## Working with Reports | ||
|
||
The typical workflow for a report involves: | ||
|
||
1. Creating the report | ||
2. Waiting for and approving the plan | ||
3. Waiting for report generation | ||
4. Making edits to the report | ||
|
||
Here's a complete example: | ||
|
||
```python | ||
# Create a report | ||
report = client.create_report( | ||
"Quarterly Analysis", input_files=["q1_data.pdf", "q2_data.pdf"] | ||
) | ||
|
||
# Wait for the plan to be ready | ||
plan = report.wait_for_plan() | ||
|
||
# Option 1: Directly approve the plan | ||
report.update_plan(action="approve") | ||
|
||
# Option 2: Suggest and review edits to the plan | ||
suggestions = report.suggest_edits( | ||
"Can you add a section about market trends?" | ||
) | ||
for suggestion in suggestions: | ||
print(suggestion) | ||
|
||
# Accept or reject the suggestion | ||
if input("Accept? (y/n): ").lower() == "y": | ||
report.accept_edit(suggestion) | ||
else: | ||
report.reject_edit(suggestion) | ||
|
||
# Wait for the report to complete | ||
report = report.wait_for_completion() | ||
|
||
# Make edits to the final report | ||
suggestions = report.suggest_edits("Make the executive summary more concise") | ||
|
||
# Review and accept/reject suggestions as above | ||
... | ||
``` | ||
|
||
## Additional Features | ||
|
||
- **Async Support**: All methods have async counterparts: `create_report` -> `acreate_report`, `wait_for_plan` -> `await_for_plan`, etc. | ||
- **Automatic Chat History**: The SDK automatically keeps track of chat history for each suggestion, unless you specify `auto_history=False` in `suggest_edits`. | ||
- **Custom HTTP Client**: You can provide your own `httpx.AsyncClient` to the `LlamaReport` class. | ||
- **Project and Organization IDs**: You can specify `project_id` and `organization_id` to use a specific project or organization. |