Skip to content

Commit 9bac631

Browse files
docs(template): add readme documentation and fix bot comments
1 parent d12bf7f commit 9bac631

File tree

3 files changed

+119
-51
lines changed

3 files changed

+119
-51
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Kernel Python Template - Lead Scraper
2+
3+
This is a Kernel application that scrapes lead data from any website using Anthropic Computer Use with Kernel's Computer Controls API.
4+
5+
The application navigates to a target website, follows user instructions to find leads, and extracts structured data into JSON and CSV formats.
6+
7+
## Setup
8+
9+
1. Get your API keys:
10+
- **Kernel**: [dashboard.onkernel.com](https://dashboard.onkernel.com)
11+
- **Anthropic**: [console.anthropic.com](https://console.anthropic.com)
12+
13+
2. Deploy the app:
14+
```bash
15+
kernel login
16+
cp .env.example .env # Add your ANTHROPIC_API_KEY
17+
kernel deploy main.py --env-file .env
18+
```
19+
20+
## Usage
21+
22+
Scrape leads from any website by providing a URL and extraction instructions:
23+
24+
```bash
25+
# Scrape attorneys from a bar association directory
26+
kernel invoke lead-scraper scrape-leads --payload '{
27+
"url": "https://www.osbar.org/members/membersearch_start.asp",
28+
"instructions": "Find all active attorney members in Portland. Extract name, email, phone, and firm name.",
29+
"max_results": 10
30+
}'
31+
32+
# Scrape business listings with session recording
33+
kernel invoke lead-scraper scrape-leads --payload '{
34+
"url": "https://example-directory.com/restaurants",
35+
"instructions": "For each restaurant, get the name, address, phone number, and website URL.",
36+
"max_results": 15,
37+
"record_replay": true
38+
}'
39+
40+
# Scrape team members from a company page
41+
kernel invoke lead-scraper scrape-leads --payload '{
42+
"url": "https://example.com/about/team",
43+
"instructions": "Extract all team members with their name, title, and email address.",
44+
"max_results": 20
45+
}'
46+
```
47+
48+
### Parameters
49+
50+
| Parameter | Type | Required | Description |
51+
|-----------|------|----------|-------------|
52+
| `url` | string | Yes | The website URL to scrape leads from |
53+
| `instructions` | string | Yes | Natural language description of what data to extract |
54+
| `max_results` | integer | No | Maximum number of leads to extract (1-100). Defaults to 3. |
55+
| `record_replay` | boolean | No | Set to `true` to record a video replay of the browser session. |
56+
57+
### Response
58+
59+
The response includes:
60+
- `leads`: Array of extracted lead objects with dynamic fields based on the data found
61+
- `total_found`: Number of leads successfully extracted
62+
- `csv_data`: CSV-formatted string of all leads for download
63+
64+
Example response:
65+
```json
66+
{
67+
"leads": [
68+
{
69+
"name": "John Smith",
70+
"email": "john@smithlaw.com",
71+
"phone": "(503) 555-1234",
72+
"company": "Smith & Associates",
73+
"address": "123 Main St, Portland, OR",
74+
"website": "https://smithlaw.com"
75+
}
76+
],
77+
"total_found": 1,
78+
"csv_data": "address,company,email,name,phone,website\n\"123 Main St, Portland, OR\",Smith & Associates,john@smithlaw.com,John Smith,(503) 555-1234,https://smithlaw.com\n"
79+
}
80+
```
81+
82+
## Recording Replays
83+
84+
> **Note:** Replay recording is only available to Kernel users on paid plans.
85+
86+
Add `"record_replay": true` to your payload to capture a video of the browser session:
87+
88+
```bash
89+
kernel invoke lead-scraper scrape-leads --payload '{"url": "https://example.com", "instructions": "...", "record_replay": true}'
90+
```
91+
92+
## How It Works
93+
94+
This application uses Anthropic's Computer Use capability to visually interact with websites:
95+
96+
1. **Browser Session**: Creates a Kernel browser session with stealth mode enabled
97+
2. **Visual Navigation**: Uses Anthropic Claude to visually navigate the target website
98+
3. **Lead Discovery**: Follows user instructions to find and identify leads on list pages
99+
4. **Detail Enrichment**: Opens individual profile pages to extract additional fields
100+
5. **Progressive Collection**: Accumulates data without overwriting previously found values
101+
6. **Data Export**: Formats results as JSON and generates CSV with dynamic columns
102+
103+
## Known Limitations
104+
105+
### Site-Specific Challenges
106+
107+
- **CAPTCHAs**: Some sites may present CAPTCHAs that block automated access
108+
- **Login Walls**: Sites requiring authentication cannot be scraped without additional setup
109+
- **Rate Limiting**: Aggressive scraping may trigger rate limits or blocks
110+
111+
### Dynamic Content
112+
113+
Modern websites may have dynamic content, popups, or cookie banners. The model attempts to handle these automatically but may occasionally need more specific instructions.
114+
115+
## Resources
116+
117+
- [Anthropic Computer Use Documentation](https://docs.anthropic.com/en/docs/build-with-claude/computer-use)
118+
- [Kernel Documentation](https://www.kernel.sh/docs/quickstart)

pkg/templates/python/lead-scraper/loop.py

Lines changed: 0 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -167,56 +167,6 @@ async def sampling_loop(
167167

168168
messages.append({"content": tool_result_content, "role": "user"})
169169

170-
171-
def _maybe_filter_to_n_most_recent_images(
172-
messages: list[BetaMessageParam],
173-
images_to_keep: int,
174-
min_removal_threshold: int,
175-
):
176-
"""
177-
With the assumption that images are screenshots that are of diminishing value as
178-
the conversation progresses, remove all but the final `images_to_keep` tool_result
179-
images in place, with a chunk of min_removal_threshold to reduce the amount we
180-
break the implicit prompt cache.
181-
"""
182-
if images_to_keep is None:
183-
return messages
184-
185-
tool_result_blocks = cast(
186-
list[BetaToolResultBlockParam],
187-
[
188-
item
189-
for message in messages
190-
for item in (
191-
message["content"] if isinstance(message["content"], list) else []
192-
)
193-
if isinstance(item, dict) and item.get("type") == "tool_result"
194-
],
195-
)
196-
197-
total_images = sum(
198-
1
199-
for tool_result in tool_result_blocks
200-
for content in tool_result.get("content", [])
201-
if isinstance(content, dict) and content.get("type") == "image"
202-
)
203-
204-
images_to_remove = total_images - images_to_keep
205-
# for better cache behavior, we want to remove in chunks
206-
images_to_remove -= images_to_remove % min_removal_threshold
207-
208-
for tool_result in tool_result_blocks:
209-
if isinstance(tool_result.get("content"), list):
210-
new_content = []
211-
for content in tool_result.get("content", []):
212-
if isinstance(content, dict) and content.get("type") == "image":
213-
if images_to_remove > 0:
214-
images_to_remove -= 1
215-
continue
216-
new_content.append(content)
217-
tool_result["content"] = new_content
218-
219-
220170
def _response_to_params(
221171
response: BetaMessage,
222172
) -> list[BetaContentBlockParam]:

pkg/templates/python/lead-scraper/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"url": "https://www.osbar.org/members/membersearch_start.asp",
1212
"instructions": "Find all active attorney members in Portland. For each, extract as many information as possible.",
1313
"max_results": 3,
14-
"record_play": false
14+
"record_replay": false
1515
}'
1616
"""
1717

0 commit comments

Comments
 (0)