Skip to content

Commit

Permalink
Merge pull request #255 from bespokelabsai/mahesh/refactor_llm
Browse files Browse the repository at this point in the history
Add a SimpleLLM interface, and update documentation.
  • Loading branch information
madiator authored Dec 14, 2024
2 parents 092d2d2 + c14233d commit 3096a29
Show file tree
Hide file tree
Showing 9 changed files with 155 additions and 122 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.venv
.DS_Store
__pycache__
.vscode

Expand Down
76 changes: 61 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
</a>
</p>

### Overview
## Overview

Bespoke Curator makes it very easy to create high-quality synthetic data at scale, which you can use to finetune models or use for structured data extraction at scale.

Expand All @@ -38,56 +38,99 @@ Bespoke Curator is an open-source project:
* A Curator Viewer which makes it easy to view the datasets, thus aiding in the dataset creation.
* We will also be releasing high-quality datasets that should move the needle on post-training.

### Key Features
## Key Features

1. **Programmability and Structured Outputs**: Synthetic data generation is lot more than just using a single prompt -- it involves calling LLMs multiple times and orchestrating control-flow. Curator treats structured outputs as first class citizens and helps you design complex pipelines.
2. **Built-in Performance Optimization**: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!
3. **Intelligent Caching and Fault Recovery**: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.
4. **Native HuggingFace Dataset Integration**: Work directly on HuggingFace Dataset objects throughout your pipeline. Your synthetic data is immediately ready for fine-tuning!
5. **Interactive Curator Viewer**: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.

### Installation
## Installation

```bash
pip install bespokelabs-curator
```

### Usage
## Usage
To run the examples below, make sure to set your OpenAI API key in
the environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.

### Hello World with `SimpleLLM`: A simple interface for calling LLMs

```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
# Or you can pass a list of prompts to generate multiple responses.
poems = llm(["Write a poem about the importance of data in AI.",
"Write a haiku about the importance of data in AI."])
print(poems)
```
Note that retries and caching are enabled by default.
So now if you run the same prompt again, you will get the same response, pretty much instantly.
You can delete the cache at `~/.cache/curator`.

#### Use LiteLLM backend for calling other models
You can use the [LiteLLM](https://docs.litellm.ai/docs/providers) backend for calling other models.

```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="claude-3-5-sonnet-20240620", backend="litellm")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
```

### Visualize in Curator Viewer
Run `curator-viewer` on the command line to see the dataset in the viewer.

You can click on a run and then click on a specific row to see the LLM request and response.
![Curator Responses](docs/curator-responses.png)
More examples below.

### `LLM`: A more powerful interface for synthetic data generation

Let's use structured outputs to generate poems.
```python
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import List

# Create a dataset object for the topics you want to create the poems.
topics = Dataset.from_dict({"topic": [
"Urban loneliness in a bustling city",
"Beauty of Bespoke Labs's Curator library"
]})
```

# Define a class to encapsulate a list of poems.
Define a class to encapsulate a list of poems.
```python
class Poem(BaseModel):
poem: str = Field(description="A poem.")

class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
```


# We define an `LLM` object that generates poems which gets applied to the topics dataset.
We define an `LLM` object that generates poems which gets applied to the topics dataset.
```python
poet = curator.LLM(
# `prompt_func` takes a row of the dataset as input.
# `row` is a dictionary with a single key 'topic' in this case.
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
# `row` is the input row, and `poems` is the `Poems` class which
# is parsed from the structured output from the LLM.
parse_func=lambda row, poems: [
{"topic": row["topic"], "poem": p.poem} for p in poems.poems_list
],
)
```
Here:
* `prompt_func` takes a row of the dataset as input and returns the prompt for the LLM.
* `response_format` is the structured output class we defined above.
* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Now we can apply the `LLM` object to the dataset, which reads very pythonic.
```python
poem = poet(topics)
print(poem.to_pandas())
# Example output:
Expand All @@ -102,9 +145,6 @@ and we can scale this up to create tens of thousands of diverse poems.
You can see a more detailed example in the [examples/poem.py](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples/poem.py) file,
and other examples in the [examples](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples) directory.

To run the examples, make sure to set your OpenAI API key in
the environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.

See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.

Expand All @@ -118,6 +158,12 @@ curator-viewer

This will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.

The dataset viewer shows all the different runs you have made.
![Curator Runs](docs/curator-runs.png)

You can also see the dataset and the responses from the LLM.
![Curator Dataset](docs/curator-dataset.png)


Optional parameters to run the viewer on a different host and port:
```bash
Expand Down
Binary file added docs/curator-dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/curator-responses.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/curator-runs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions examples/simple_poem.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
"""Curator example that uses `SimpleLLM` to generate poems.
Please see the poem.py for more complex use cases.
"""

from bespokelabs import curator

# Use GPT-4o-mini for this example.
llm = curator.SimpleLLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)

# Use Claude 3.5 Sonnet for this example.
llm = curator.SimpleLLM(model_name="claude-3-5-sonnet-20240620", backend="litellm")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)

# Note that we can also pass a list of prompts to generate multiple responses.
poems = llm(
[
"Write a sonnet about the importance of data in AI.",
"Write a haiku about the importance of data in AI.",
]
)
print(poems)
1 change: 1 addition & 0 deletions src/bespokelabs/curator/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
from .dataset import Dataset
from .llm.llm import LLM
from .llm.simple_llm import SimpleLLM
141 changes: 34 additions & 107 deletions src/bespokelabs/curator/llm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,113 +37,6 @@
class LLM:
"""Interface for prompting LLMs."""

def __init__(
self,
model_name: str,
prompt_func: Callable[[Union[Dict[str, Any], BaseModel]], Dict[str, str]],
parse_func: Optional[
Callable[
[
_DictOrBaseModel,
_DictOrBaseModel,
],
T,
]
] = None,
response_format: Optional[Type[BaseModel]] = None,
backend: Optional[str] = None,
max_requests_per_minute: Optional[int] = None,
max_tokens_per_minute: Optional[int] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
max_retries: Optional[int] = None,
require_all_responses: Optional[bool] = None,
):
"""Initialize a LLM.
Args:
model_name: The name of the LLM to use
prompt_func: A function that takes a single row
and returns either a string (assumed to be a user prompt) or messages list
parse_func: A function that takes the input row and
response object and returns the parsed output
response_format: A Pydantic model specifying the
response format from the LLM.
backend: The backend to use ("openai" or "litellm"). If None, will be auto-determined
max_requests_per_minute: Maximum requests per minute (not supported in batch mode)
max_tokens_per_minute: Maximum tokens per minute (not supported in batch mode)
temperature: The temperature to use for the LLM
top_p: The top_p to use for the LLM
presence_penalty: The presence_penalty to use for the LLM
frequency_penalty: The frequency_penalty to use for the LLM
max_retries: The maximum number of retries to use for the LLM. If 0, will only try a request once.
require_all_responses: Whether to require all responses
"""
self.prompt_formatter = PromptFormatter(
model_name, prompt_func, parse_func, response_format
)

# Initialize context manager state
self._batch_config = None
self._original_request_processor = None

# Store model parameters
self.temperature = temperature
self.top_p = top_p
self.presence_penalty = presence_penalty
self.frequency_penalty = frequency_penalty
self.model_name = model_name

# Auto-determine backend if not specified
if backend is not None:
self.backend = backend
else:
self.backend = self._determine_backend(model_name, response_format)

# Initialize request processor
self._setup_request_processor(
max_requests_per_minute=max_requests_per_minute,
max_tokens_per_minute=max_tokens_per_minute,
max_retries=max_retries,
require_all_responses=require_all_responses,
)

@staticmethod
def _determine_backend(
model_name: str, response_format: Optional[Type[BaseModel]] = None
) -> str:
"""Determine which backend to use based on model name and response format.
Args:
model_name (str): Name of the model
response_format (Optional[Type[BaseModel]]): Response format if specified
Returns:
str: Backend to use ("openai" or "litellm")
"""
model_name = model_name.lower()

# GPT-4o models with response format should use OpenAI
if (
response_format
and OpenAIOnlineRequestProcessor(model_name).check_structured_output_support()
):
logger.info(f"Requesting structured output from {model_name}, using OpenAI backend")
return "openai"

# GPT models and O1 models without response format should use OpenAI
if not response_format and any(x in model_name for x in ["gpt-", "o1-preview", "o1-mini"]):
logger.info(f"Requesting text output from {model_name}, using OpenAI backend")
return "openai"

# Default to LiteLLM for all other cases
logger.info(
f"Requesting {f'structured' if response_format else 'text'} output from {model_name}, using LiteLLM backend"
)
return "litellm"

def __init__(
self,
model_name: str,
Expand Down Expand Up @@ -262,6 +155,40 @@ def __init__(
else:
raise ValueError(f"Unknown backend: {self.backend}")

@staticmethod
def _determine_backend(
model_name: str, response_format: Optional[Type[BaseModel]] = None
) -> str:
"""Determine which backend to use based on model name and response format.
Args:
model_name (str): Name of the model
response_format (Optional[Type[BaseModel]]): Response format if specified
Returns:
str: Backend to use ("openai" or "litellm")
"""
model_name = model_name.lower()

# GPT-4o models with response format should use OpenAI
if (
response_format
and OpenAIOnlineRequestProcessor(model_name).check_structured_output_support()
):
logger.info(f"Requesting structured output from {model_name}, using OpenAI backend")
return "openai"

# GPT models and O1 models without response format should use OpenAI
if not response_format and any(x in model_name for x in ["gpt-", "o1-preview", "o1-mini"]):
logger.info(f"Requesting text output from {model_name}, using OpenAI backend")
return "openai"

# Default to LiteLLM for all other cases
logger.info(
f"Requesting {f'structured' if response_format else 'text'} output from {model_name}, using LiteLLM backend"
)
return "litellm"

def __call__(
self,
dataset: Optional[Iterable] = None,
Expand Down
33 changes: 33 additions & 0 deletions src/bespokelabs/curator/llm/simple_llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from bespokelabs.curator.llm.llm import LLM
from datasets import Dataset
from typing import Union, List


class SimpleLLM:
"""A simpler interface for the LLM class.
Usage:
llm = SimpleLLM(model_name="gpt-4o-mini")
llm("Do you know about the bitter lesson?")
llm(["What is the capital of France?", "What is the capital of Germany?"])
For more complex use cases (e.g. structured outputs and custom prompt functions), see the LLM class.
"""

def __init__(self, model_name: str, backend: str = "openai"):
self._model_name = model_name
self._backend = backend

def __call__(self, prompt: Union[str, List[str]]) -> Union[str, List[str]]:
prompt_list = [prompt] if isinstance(prompt, str) else prompt
dataset: Dataset = Dataset.from_dict({"prompt": prompt_list})

llm = LLM(
prompt_func=lambda row: row["prompt"],
model_name=self._model_name,
response_format=None,
backend=self._backend,
)
response = llm(dataset)
if isinstance(prompt, str):
return response["response"][0]
return response["response"]

0 comments on commit 3096a29

Please sign in to comment.