You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+44-47Lines changed: 44 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
</h1>
5
5
6
6
<palign="center">
7
-
<strong>Use LLMs to robustly extract structured data from HTML and markdown</strong>
7
+
<strong>Using LLMs to Robustly Extract Web Data</strong>
8
8
</p>
9
9
10
10
<divalign="center">
@@ -29,47 +29,32 @@
29
29
</p>
30
30
</div>
31
31
32
-
## How It Works
32
+
## Overview
33
+
Lightfeed is a robust LLM-based web extraction library written in Typescript. Use natural language prompts to navigate web pages and extract structured data. Looking to create pipelines or databases based on web data? Go to [lightfeed.ai](https://lightfeed.ai) and start for free!
33
34
34
-
1.**Browser Loading (New!)**: Use the new `Browser` class to load web pages with Stealth Playwright, handling JavaScript-rendered content with built-in anti-bot patches. Choose between local, serverless, or remote browser configurations for maximum flexibility.
35
+
### Features
35
36
36
-
2.**HTML to Markdown Conversion**: HTML content (either from a browser or direct HTML string) is converted to clean, LLM-friendly markdown. This step can optionally extract only the main content, include images, and clean URLs by removing tracking parameters. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone.
37
+
- 🤖 [**Browser Automation**](#browser-automation) - Run Playwright browsers locally, serverless in the cloud, or connect to a remote browser server. Avoid detection with built-in anti-bot patches and proxy configuration.
37
38
38
-
3.**LLM Processing**: The markdown is sent to an LLM in JSON mode (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema or enrich existing data objects. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
39
+
- 🧹 [**LLM-ready Markdown**](#html-to-markdown-conversion) - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
39
40
40
-
4.**JSON Sanitization**: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
41
+
- ⚡️ [**LLM Extraction**](#llm-extraction-function) - Use LLMs in JSON mode to extract structured data accordingly to input Zod schema. Token usage limit and tracking included.
41
42
42
-
5.**URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details.
43
+
- 🛠️ [**JSON Recovery**](#json-recovery) - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
43
44
44
-
## Why use an LLM extractor?
45
-
- 💡 Understands natural language criteria and context to extract the data you need, not just raw content as displayed
46
-
- 🚀 One solution works across all websites — no need to build custom scrapers for each site
47
-
- 🔁 Resilient to website changes, e.g., HTML structure, CSS selectors, or page layout
48
-
- ✅ LLMs are becoming more accurate and cost-effective
While this library provides a robust foundation for data extraction, you might want to consider [lightfeed.ai](https://lightfeed.ai) if you need:
59
-
60
-
- ⚡️ **Database with API**: Manage data in a production-ready vector database with real-time API
61
-
- 📊 **Deduplication and Value History**: Maintain consistent data with automatic change tracking
62
-
- 🤖 **AI Enrichment**: Enrich any data point — contact info, product details, company intelligence, and more
63
-
- ⏰ **Workflow Automation**: Set up intelligent data pipelines that run automatically on your schedule
64
-
- 📍 **Geolocation Targeting**: Capture region-specific price, inventory and campaign data for competitive intelligence
65
-
66
53
## Usage
67
54
68
-
### E-commerce Product Extraction with Stealth Browser
55
+
### E-commerce Product Extraction
69
56
70
-
This example demonstrates extracting structured product data from a real e-commerce website using a stealth Playwright browser that handles JavaScript rendering and bypasses anti-bot detection. We use a local browser configuration here, but you can also use [serverless or remote browsers](#browser-loading) for production deployments.
71
-
72
-
> **💡 Try it yourself:** Run `npm run test:browser` to execute this example, or view the complete code in `src/dev/testBrowserExtraction.ts`
57
+
This example demonstrates extracting structured product data from a real e-commerce website using a local headed Playwright browser. For production environments, you can use a Playwright browser in [serverless](#serverless-browser) or [remote](#remote-browser) mode.
> Run `npm run test:browser` to execute this example, or view the complete code in [testBrowserExtraction.ts](src/dev/testBrowserExtraction.ts).
160
+
173
161
### Extracting from Markdown or Plain Text
174
162
175
163
You can also extract structured data directly from HTML, Markdown or text string:
@@ -343,17 +331,17 @@ const result = await extract({
343
331
> [!NOTE]
344
332
> Currently, URL cleaning supports Amazon product URLs (amazon.com, amazon.ca) by removing `/ref=` parameters and everything after. The feature is designed to be extensible for other e-commerce platforms in the future.
345
333
346
-
## API Keys
334
+
## LLM Extraction Function
347
335
348
-
The library will check for API keys in the following order:
336
+
### LLM API Keys
337
+
338
+
The library will check for LLM API keys in the following order:
349
339
350
340
1. Directly provided API key parameter (`googleApiKey` or `openaiApiKey`)
351
341
2. Environment variables (`GOOGLE_API_KEY` or `OPENAI_API_KEY`)
352
342
353
343
While the library can use environment variables, it's recommended to explicitly provide API keys in production code for better control and transparency.
Main function to extract structured data from content.
@@ -371,12 +359,12 @@ Main function to extract structured data from content.
371
359
|`googleApiKey`|`string`| Google Gemini API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY`|
372
360
|`openaiApiKey`|`string`| OpenAI API key (if using OpenAI provider) | From env `OPENAI_API_KEY`|
373
361
|`temperature`|`number`| Temperature for the LLM (0-1) |`0`|
374
-
|`htmlExtractionOptions`|`HTMLExtractionOptions`| HTML-specific options for content extraction (see below) |`{}`|
362
+
|`htmlExtractionOptions`|`HTMLExtractionOptions`| HTML-specific options for content extraction [see below](#htmlextractionoptions)|`{}`|
375
363
|`sourceUrl`|`string`| URL of the HTML content, required when format is HTML to properly handle relative URLs | Required for HTML format |
376
364
|`maxInputTokens`|`number`| Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. When specified, content will be truncated if the total prompt size exceeds this limit. |`undefined`|
377
365
|`extractionContext`|`Record<string, any>`| Extraction context that provides additional information for the extraction process. Can include partial data objects to enrich, metadata like URLs/locations, or any contextual information relevant to the extraction task. |`undefined`|
The `Browser` class provides a clean interface for loading web pages with Playwright. Use it with direct Playwright calls to load HTML content before extracting structured data.
405
393
406
-
#### Constructor
407
-
394
+
**Constructor**
408
395
```typescript
409
396
const browser =newBrowser(config?:BrowserConfig)
410
397
```
411
398
412
-
#### Methods
413
-
399
+
**Methods**
414
400
| Method | Description | Returns |
415
401
|--------|-------------|---------|
416
402
|`start()`| Start the browser instance |`Promise<void>`|
@@ -419,7 +405,7 @@ const browser = new Browser(config?: BrowserConfig)
419
405
|`newContext()`| Create a new browser context (browser must be started) |`Promise<BrowserContext>`|
420
406
|`isStarted()`| Check if the browser is currently running |`boolean`|
421
407
422
-
####Local Browser
408
+
### Local Browser
423
409
424
410
Use your local Chrome browser for development and testing. Perfect for:
425
411
- Local development and debugging
@@ -435,7 +421,7 @@ const browser = new Browser({
435
421
});
436
422
```
437
423
438
-
####Serverless
424
+
### Serverless Browser
439
425
440
426
Perfect for AWS Lambda and other serverless environments. Uses [@sparticuz/chromium](https://github.com/Sparticuz/chromium) to run Chrome in serverless environments with minimal cold start times and memory usage. Supports proxy configuration for geo-tracking and unblocking.
441
427
@@ -466,7 +452,7 @@ const browser = new Browser({
466
452
});
467
453
```
468
454
469
-
####Remote Browser
455
+
### Remote Browser
470
456
471
457
Connect to any remote browser instance via WebSocket. Great for:
472
458
- Brightdata's Scraping Browser
@@ -485,7 +471,7 @@ const browser = new Browser({
485
471
});
486
472
```
487
473
488
-
###HTML to Markdown Conversion
474
+
## HTML to Markdown Conversion
489
475
490
476
The `convertHtmlToMarkdown` utility function allows you to convert HTML content to markdown without performing extraction.
491
477
@@ -494,19 +480,19 @@ The `convertHtmlToMarkdown` utility function allows you to convert HTML content
0 commit comments