Skip to content

Commit

Permalink
Merge pull request #272 from bespokelabsai/dev
Browse files Browse the repository at this point in the history
0.1.12 Release
  • Loading branch information
RyanMarten authored Dec 17, 2024
2 parents 450e934 + 5013908 commit 5dbb913
Show file tree
Hide file tree
Showing 50 changed files with 1,326 additions and 517 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.venv
.DS_Store
__pycache__
.vscode

Expand Down
85 changes: 67 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,12 @@
<a href="https://discord.gg/KqpXvpzVBS">
<img alt="Discord" src="https://img.shields.io/discord/1230990265867698186">
</a>
<a href="https://github.com/psf/black">
<img alt="Code style: black" src="https://img.shields.io/badge/Code%20style-black-000000.svg">
</a>
</p>

### Overview
## Overview

Bespoke Curator makes it very easy to create high-quality synthetic data at scale, which you can use to finetune models or use for structured data extraction at scale.

Expand All @@ -35,56 +38,99 @@ Bespoke Curator is an open-source project:
* A Curator Viewer which makes it easy to view the datasets, thus aiding in the dataset creation.
* We will also be releasing high-quality datasets that should move the needle on post-training.

### Key Features
## Key Features

1. **Programmability and Structured Outputs**: Synthetic data generation is lot more than just using a single prompt -- it involves calling LLMs multiple times and orchestrating control-flow. Curator treats structured outputs as first class citizens and helps you design complex pipelines.
2. **Built-in Performance Optimization**: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!
3. **Intelligent Caching and Fault Recovery**: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.
4. **Native HuggingFace Dataset Integration**: Work directly on HuggingFace Dataset objects throughout your pipeline. Your synthetic data is immediately ready for fine-tuning!
5. **Interactive Curator Viewer**: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.

### Installation
## Installation

```bash
pip install bespokelabs-curator
```

### Usage
## Usage
To run the examples below, make sure to set your OpenAI API key in
the environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.

### Hello World with `SimpleLLM`: A simple interface for calling LLMs

```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
# Or you can pass a list of prompts to generate multiple responses.
poems = llm(["Write a poem about the importance of data in AI.",
"Write a haiku about the importance of data in AI."])
print(poems)
```
Note that retries and caching are enabled by default.
So now if you run the same prompt again, you will get the same response, pretty much instantly.
You can delete the cache at `~/.cache/curator`.

#### Use LiteLLM backend for calling other models
You can use the [LiteLLM](https://docs.litellm.ai/docs/providers) backend for calling other models.

```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="claude-3-5-sonnet-20240620", backend="litellm")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
```

### Visualize in Curator Viewer
Run `curator-viewer` on the command line to see the dataset in the viewer.

You can click on a run and then click on a specific row to see the LLM request and response.
![Curator Responses](docs/curator-responses.png)
More examples below.

### `LLM`: A more powerful interface for synthetic data generation

Let's use structured outputs to generate poems.
```python
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import List

# Create a dataset object for the topics you want to create the poems.
topics = Dataset.from_dict({"topic": [
"Urban loneliness in a bustling city",
"Beauty of Bespoke Labs's Curator library"
]})
```

# Define a class to encapsulate a list of poems.
Define a class to encapsulate a list of poems.
```python
class Poem(BaseModel):
poem: str = Field(description="A poem.")

class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
```


# We define a Prompter that generates poems which gets applied to the topics dataset.
poet = curator.Prompter(
# `prompt_func` takes a row of the dataset as input.
# `row` is a dictionary with a single key 'topic' in this case.
We define an `LLM` object that generates poems which gets applied to the topics dataset.
```python
poet = curator.LLM(
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
# `row` is the input row, and `poems` is the `Poems` class which
# is parsed from the structured output from the LLM.
parse_func=lambda row, poems: [
{"topic": row["topic"], "poem": p.poem} for p in poems.poems_list
],
)
```
Here:
* `prompt_func` takes a row of the dataset as input and returns the prompt for the LLM.
* `response_format` is the structured output class we defined above.
* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Now we can apply the `LLM` object to the dataset, which reads very pythonic.
```python
poem = poet(topics)
print(poem.to_pandas())
# Example output:
Expand All @@ -94,14 +140,11 @@ print(poem.to_pandas())
# 2 Beauty of Bespoke Labs's Curator library In whispers of design and crafted grace,\nBesp...
# 3 Beauty of Bespoke Labs's Curator library In the hushed breath of parchment and ink,\nBe...
```
Note that `topics` can be created with `curator.Prompter` as well,
Note that `topics` can be created with `curator.LLM` as well,
and we can scale this up to create tens of thousands of diverse poems.
You can see a more detailed example in the [examples/poem.py](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples/poem.py) file,
and other examples in the [examples](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples) directory.

To run the examples, make sure to set your OpenAI API key in
the environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.

See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.

Expand All @@ -115,6 +158,12 @@ curator-viewer

This will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.

The dataset viewer shows all the different runs you have made.
![Curator Runs](docs/curator-runs.png)

You can also see the dataset and the responses from the LLM.
![Curator Dataset](docs/curator-dataset.png)


Optional parameters to run the viewer on a different host and port:
```bash
Expand Down Expand Up @@ -152,4 +201,4 @@ npm -v # should print `10.9.0`
```

## Contributing
Contributions are welcome!
Contributions are welcome!
9 changes: 2 additions & 7 deletions bespoke-dataset-viewer/app/dataset/[runHash]/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,6 @@ export default async function DatasetPage({
const { runHash } = await params
const { batchMode } = await searchParams
const isBatchMode = batchMode === '1'
return (
<html lang="en" suppressHydrationWarning>
<body>
<DatasetViewer runHash={runHash} batchMode={isBatchMode} />
</body>
</html>
)

return <DatasetViewer runHash={runHash} batchMode={isBatchMode} />
}
9 changes: 5 additions & 4 deletions bespoke-dataset-viewer/app/layout.tsx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import type { Metadata } from "next";
import "./globals.css";

import { Toaster } from "@/components/ui/toaster"

export const metadata: Metadata = {
title: "Curator Viewer",
Expand All @@ -13,10 +13,11 @@ export default function RootLayout({
children: React.ReactNode
}) {
return (
<html lang="en" suppressHydrationWarning>
<body suppressHydrationWarning>
<html lang="en" suppressHydrationWarning>
<body suppressHydrationWarning>
{children}
<Toaster />
</body>
</html>
)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,34 @@ import { Copy } from "lucide-react"
import { DataItem } from "@/types/dataset"
import { useCallback } from "react"
import { Sheet, SheetContent } from "@/components/ui/sheet"
import { useToast } from "@/components/ui/use-toast"

interface DetailsSidebarProps {
item: DataItem | null
onClose: () => void
}

export function DetailsSidebar({ item, onClose }: DetailsSidebarProps) {
const { toast } = useToast()

const copyToClipboard = useCallback(async (text: string) => {
try {
await navigator.clipboard.writeText(text)
alert("Copied to clipboard!")
toast({
title: "Success",
description: "Copied to clipboard!",
duration: 2000,
})
} catch (err) {
console.error("Failed to copy:", err)
alert("Failed to copy to clipboard")
toast({
variant: "destructive",
title: "Error",
description: "Failed to copy to clipboard",
duration: 2000,
})
}
}, [])
}, [toast])

if (!item) return null

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
# We define a Prompter that generates poems which gets applied to the topics dataset.
poet = curator.Prompter(
# We define an LLM object that generates poems which gets applied to the topics dataset.
poet = curator.LLM(
# prompt_func takes a row of the dataset as input.
# row is a dictionary with a single key 'topic' in this case.
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
Expand Down
2 changes: 1 addition & 1 deletion bespoke-dataset-viewer/components/ui/use-toast.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import type {
} from "@/components/ui/toast"

const TOAST_LIMIT = 1
const TOAST_REMOVE_DELAY = 1000000
const TOAST_REMOVE_DELAY = 3000

type ToasterToast = ToastProps & {
id: string
Expand Down
8 changes: 8 additions & 0 deletions bespoke-dataset-viewer/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions bespoke-dataset-viewer/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
},
"devDependencies": {
"@types/node": "^20",
"@types/prismjs": "^1.26.5",
"@types/react": "^18",
"@types/react-dom": "^18",
"eslint": "^8",
Expand Down
2 changes: 1 addition & 1 deletion build_pkg.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ def nextjs_build():
def run_pytest():
print("Running pytest")
try:
run_command("pytest", cwd="tests")
run_command("pytest")
except subprocess.CalledProcessError:
print("Pytest failed. Aborting build.")
sys.exit(1)
Expand Down
Binary file added docs/curator-dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/curator-responses.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/curator-runs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions examples/camel.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,14 @@ class QAs(BaseModel):
qas: List[QA] = Field(description="A list of QAs")


subject_prompter = curator.Prompter(
subject_prompter = curator.LLM(
prompt_func=lambda: f"Generate a diverse list of 3 subjects. Keep it high-level (e.g. Math, Science).",
parse_func=lambda _, subjects: [subject for subject in subjects.subjects],
model_name="gpt-4o-mini",
response_format=Subjects,
)
subject_dataset = subject_prompter()
subsubject_prompter = curator.Prompter(
subsubject_prompter = curator.LLM(
prompt_func=lambda subject: f"For the given subject {subject}. Generate 3 diverse subsubjects. No explanation.",
parse_func=lambda subject, subsubjects: [
{"subject": subject["subject"], "subsubject": subsubject.subject}
Expand All @@ -40,7 +40,7 @@ class QAs(BaseModel):
)
subsubject_dataset = subsubject_prompter(subject_dataset)

qa_prompter = curator.Prompter(
qa_prompter = curator.LLM(
prompt_func=lambda subsubject: f"For the given subsubject {subsubject}. Generate 3 diverse questions and answers. No explanation.",
model_name="gpt-4o-mini",
response_format=QAs,
Expand Down
2 changes: 1 addition & 1 deletion examples/distill.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def parse_func(row, response):
return {"instruction": instruction, "new_response": response}


distill_prompter = curator.Prompter(
distill_prompter = curator.LLM(
prompt_func=prompt_func,
parse_func=parse_func,
model_name="gpt-4o-mini",
Expand Down
2 changes: 1 addition & 1 deletion examples/litellm_recipe_prompting.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def main():
# 3. Set environment variable: GEMINI_API_KEY
#############################################

recipe_prompter = curator.Prompter(
recipe_prompter = curator.LLM(
model_name="gemini/gemini-1.5-flash",
prompt_func=lambda row: f"Generate a random {row['cuisine']} recipe. Be creative but keep it realistic.",
parse_func=lambda row, response: {
Expand Down
4 changes: 2 additions & 2 deletions examples/litellm_recipe_structured_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def main():
# 2. Generate an API key or use an existing API key
# 3. Set environment variable: ANTHROPIC_API_KEY
#############################################
cuisines_generator = curator.Prompter(
cuisines_generator = curator.LLM(
prompt_func=lambda: f"Generate 10 diverse cuisines.",
model_name="claude-3-5-haiku-20241022",
response_format=Cuisines,
Expand All @@ -44,7 +44,7 @@ def main():
# 2. Generate an API key or use an existing API key
# 3. Set environment variable: GEMINI_API_KEY
#############################################
recipe_prompter = curator.Prompter(
recipe_prompter = curator.LLM(
model_name="gemini/gemini-1.5-flash",
prompt_func=lambda row: f"Generate a random {row['cuisine']} recipe. Be creative but keep it realistic.",
parse_func=lambda row, response: {
Expand Down
2 changes: 1 addition & 1 deletion examples/persona-hub/synthesize.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def get_generator(template):
def prompt_func(row):
return template.format(persona=row["persona"])

generator = curator.Prompter(
generator = curator.LLM(
prompt_func=prompt_func,
model_name="gpt-4o",
temperature=0.7,
Expand Down
6 changes: 3 additions & 3 deletions examples/poem.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class Topics(BaseModel):


# We define a prompter that generates topics.
topic_generator = curator.Prompter(
topic_generator = curator.LLM(
prompt_func=lambda: "Generate 10 diverse topics that are suitable for writing poems about.",
model_name="gpt-4o-mini",
response_format=Topics,
Expand All @@ -35,8 +35,8 @@ class Poems(BaseModel):
poems_list: List[str] = Field(description="A list of poems.")


# We define a prompter that generates poems which gets applied to the topics dataset.
poet = curator.Prompter(
# We define an `LLM` object that generates poems which gets applied to the topics dataset.
poet = curator.LLM(
# The prompt_func takes a row of the dataset as input.
# The row is a dictionary with a single key 'topic' in this case.
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
Expand Down
Loading

0 comments on commit 5dbb913

Please sign in to comment.