Skip to content

Commit 7ef282c

Browse files
authored
Update README.md (#32)
1 parent b3c4647 commit 7ef282c

File tree

1 file changed

+44
-47
lines changed

1 file changed

+44
-47
lines changed

README.md

Lines changed: 44 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
</h1>
55

66
<p align="center">
7-
<strong>Use LLMs to robustly extract structured data from HTML and markdown</strong>
7+
<strong>Using LLMs to Robustly Extract Web Data</strong>
88
</p>
99

1010
<div align="center">
@@ -29,47 +29,32 @@
2929
</p>
3030
</div>
3131

32-
## How It Works
32+
## Overview
33+
Lightfeed is a robust LLM-based web extraction library written in Typescript. Use natural language prompts to navigate web pages and extract structured data. Looking to create pipelines or databases based on web data? Go to [lightfeed.ai](https://lightfeed.ai) and start for free!
3334

34-
1. **Browser Loading (New!)**: Use the new `Browser` class to load web pages with Stealth Playwright, handling JavaScript-rendered content with built-in anti-bot patches. Choose between local, serverless, or remote browser configurations for maximum flexibility.
35+
### Features
3536

36-
2. **HTML to Markdown Conversion**: HTML content (either from a browser or direct HTML string) is converted to clean, LLM-friendly markdown. This step can optionally extract only the main content, include images, and clean URLs by removing tracking parameters. See [HTML to Markdown Conversion](#html-to-markdown-conversion) section for details. The `convertHtmlToMarkdown` function can also be used standalone.
37+
- 🤖 [**Browser Automation**](#browser-automation) - Run Playwright browsers locally, serverless in the cloud, or connect to a remote browser server. Avoid detection with built-in anti-bot patches and proxy configuration.
3738

38-
3. **LLM Processing**: The markdown is sent to an LLM in JSON mode (Google Gemini 2.5 flash or OpenAI GPT-4o mini by default) with a prompt to extract structured data according to your Zod schema or enrich existing data objects. You can set a maximum input token limit to control costs or avoid exceeding the model's context window, and the function will return token usage metrics for each LLM call.
39+
- 🧹 [**LLM-ready Markdown**](#html-to-markdown-conversion) - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
3940

40-
4. **JSON Sanitization**: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays. See [JSON Sanitization](#json-sanitization) for details.
41+
- ⚡️ [**LLM Extraction**](#llm-extraction-function) - Use LLMs in JSON mode to extract structured data accordingly to input Zod schema. Token usage limit and tracking included.
4142

42-
5. **URL Validation**: All extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links. See [URL Validation](#url-validation) section for details.
43+
- 🛠️ [**JSON Recovery**](#json-recovery) - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
4344

44-
## Why use an LLM extractor?
45-
- 💡 Understands natural language criteria and context to extract the data you need, not just raw content as displayed
46-
- 🚀 One solution works across all websites — no need to build custom scrapers for each site
47-
- 🔁 Resilient to website changes, e.g., HTML structure, CSS selectors, or page layout
48-
- ✅ LLMs are becoming more accurate and cost-effective
45+
- 🔗 [**URL Validation**](#url-validation) - Handle relative URLs, remove invalid ones, and repair markdown-escaped links.
4946

5047
## Installation
5148

5249
```bash
5350
npm install @lightfeed/extractor
5451
```
5552

56-
## Hosted Version
57-
58-
While this library provides a robust foundation for data extraction, you might want to consider [lightfeed.ai](https://lightfeed.ai) if you need:
59-
60-
- ⚡️ **Database with API**: Manage data in a production-ready vector database with real-time API
61-
- 📊 **Deduplication and Value History**: Maintain consistent data with automatic change tracking
62-
- 🤖 **AI Enrichment**: Enrich any data point — contact info, product details, company intelligence, and more
63-
-**Workflow Automation**: Set up intelligent data pipelines that run automatically on your schedule
64-
- 📍 **Geolocation Targeting**: Capture region-specific price, inventory and campaign data for competitive intelligence
65-
6653
## Usage
6754

68-
### E-commerce Product Extraction with Stealth Browser
55+
### E-commerce Product Extraction
6956

70-
This example demonstrates extracting structured product data from a real e-commerce website using a stealth Playwright browser that handles JavaScript rendering and bypasses anti-bot detection. We use a local browser configuration here, but you can also use [serverless or remote browsers](#browser-loading) for production deployments.
71-
72-
> **💡 Try it yourself:** Run `npm run test:browser` to execute this example, or view the complete code in `src/dev/testBrowserExtraction.ts`
57+
This example demonstrates extracting structured product data from a real e-commerce website using a local headed Playwright browser. For production environments, you can use a Playwright browser in [serverless](#serverless-browser) or [remote](#remote-browser) mode.
7358

7459
```typescript
7560
import { extract, ContentFormat, LLMProvider, Browser } from "@lightfeed/extractor";
@@ -170,6 +155,9 @@ try {
170155
*/
171156
```
172157

158+
> [!TIP]
159+
> Run `npm run test:browser` to execute this example, or view the complete code in [testBrowserExtraction.ts](src/dev/testBrowserExtraction.ts).
160+
173161
### Extracting from Markdown or Plain Text
174162

175163
You can also extract structured data directly from HTML, Markdown or text string:
@@ -343,17 +331,17 @@ const result = await extract({
343331
> [!NOTE]
344332
> Currently, URL cleaning supports Amazon product URLs (amazon.com, amazon.ca) by removing `/ref=` parameters and everything after. The feature is designed to be extensible for other e-commerce platforms in the future.
345333
346-
## API Keys
334+
## LLM Extraction Function
347335

348-
The library will check for API keys in the following order:
336+
### LLM API Keys
337+
338+
The library will check for LLM API keys in the following order:
349339

350340
1. Directly provided API key parameter (`googleApiKey` or `openaiApiKey`)
351341
2. Environment variables (`GOOGLE_API_KEY` or `OPENAI_API_KEY`)
352342

353343
While the library can use environment variables, it's recommended to explicitly provide API keys in production code for better control and transparency.
354344

355-
## API Reference
356-
357345
### `extract<T>(options: ExtractorOptions<T>): Promise<ExtractorResult<T>>`
358346

359347
Main function to extract structured data from content.
@@ -371,12 +359,12 @@ Main function to extract structured data from content.
371359
| `googleApiKey` | `string` | Google Gemini API key (if using Google Gemini provider) | From env `GOOGLE_API_KEY` |
372360
| `openaiApiKey` | `string` | OpenAI API key (if using OpenAI provider) | From env `OPENAI_API_KEY` |
373361
| `temperature` | `number` | Temperature for the LLM (0-1) | `0` |
374-
| `htmlExtractionOptions` | `HTMLExtractionOptions` | HTML-specific options for content extraction (see below) | `{}` |
362+
| `htmlExtractionOptions` | `HTMLExtractionOptions` | HTML-specific options for content extraction [see below](#htmlextractionoptions) | `{}` |
375363
| `sourceUrl` | `string` | URL of the HTML content, required when format is HTML to properly handle relative URLs | Required for HTML format |
376364
| `maxInputTokens` | `number` | Maximum number of input tokens to send to the LLM. Uses a rough conversion of 4 characters per token. When specified, content will be truncated if the total prompt size exceeds this limit. | `undefined` |
377365
| `extractionContext` | `Record<string, any>` | Extraction context that provides additional information for the extraction process. Can include partial data objects to enrich, metadata like URLs/locations, or any contextual information relevant to the extraction task. | `undefined` |
378366

379-
#### HTML Extraction Options
367+
#### htmlExtractionOptions
380368

381369
| Option | Type | Description | Default |
382370
|--------|------|-------------|---------|
@@ -399,18 +387,16 @@ interface ExtractorResult<T> {
399387
}
400388
```
401389

402-
### Browser Loading
390+
## Browser Automation
403391

404392
The `Browser` class provides a clean interface for loading web pages with Playwright. Use it with direct Playwright calls to load HTML content before extracting structured data.
405393

406-
#### Constructor
407-
394+
**Constructor**
408395
```typescript
409396
const browser = new Browser(config?: BrowserConfig)
410397
```
411398

412-
#### Methods
413-
399+
**Methods**
414400
| Method | Description | Returns |
415401
|--------|-------------|---------|
416402
| `start()` | Start the browser instance | `Promise<void>` |
@@ -419,7 +405,7 @@ const browser = new Browser(config?: BrowserConfig)
419405
| `newContext()` | Create a new browser context (browser must be started) | `Promise<BrowserContext>` |
420406
| `isStarted()` | Check if the browser is currently running | `boolean` |
421407

422-
#### Local Browser
408+
### Local Browser
423409

424410
Use your local Chrome browser for development and testing. Perfect for:
425411
- Local development and debugging
@@ -435,7 +421,7 @@ const browser = new Browser({
435421
});
436422
```
437423

438-
#### Serverless
424+
### Serverless Browser
439425

440426
Perfect for AWS Lambda and other serverless environments. Uses [@sparticuz/chromium](https://github.com/Sparticuz/chromium) to run Chrome in serverless environments with minimal cold start times and memory usage. Supports proxy configuration for geo-tracking and unblocking.
441427

@@ -466,7 +452,7 @@ const browser = new Browser({
466452
});
467453
```
468454

469-
#### Remote Browser
455+
### Remote Browser
470456

471457
Connect to any remote browser instance via WebSocket. Great for:
472458
- Brightdata's Scraping Browser
@@ -485,7 +471,7 @@ const browser = new Browser({
485471
});
486472
```
487473

488-
### HTML to Markdown Conversion
474+
## HTML to Markdown Conversion
489475

490476
The `convertHtmlToMarkdown` utility function allows you to convert HTML content to markdown without performing extraction.
491477

@@ -494,19 +480,19 @@ The `convertHtmlToMarkdown` utility function allows you to convert HTML content
494480
convertHtmlToMarkdown(html: string, options?: HTMLExtractionOptions, sourceUrl?: string): string
495481
```
496482

497-
#### Parameters
483+
### Parameters
498484

499485
| Parameter | Type | Description | Default |
500486
|-----------|------|-------------|---------|
501487
| `html` | `string` | HTML content to convert to markdown | Required |
502488
| `options` | `HTMLExtractionOptions` | See [HTML Extraction Options](#html-extraction-options) | `undefined` |
503489
| `sourceUrl` | `string` | URL of the HTML content, used to properly convert relative URLs to absolute URLs | `undefined` |
504490

505-
#### Return Value
491+
### Return Value
506492

507493
The function returns a string containing the markdown conversion of the HTML content.
508494

509-
#### Example
495+
### Example
510496

511497
```typescript
512498
import { convertHtmlToMarkdown, HTMLExtractionOptions } from "@lightfeed/extractor";
@@ -543,7 +529,7 @@ console.log(markdownWithOptions);
543529
// Output: "![Logo](https://example.com/images/logo.png)[About](https://example.com/about)[Amazon Product](https://www.amazon.com/product/dp/B123)"
544530
```
545531

546-
### JSON Sanitization
532+
## JSON Recovery
547533

548534
The `safeSanitizedParser` utility function helps sanitize and recover partial data from LLM outputs that may not perfectly conform to your schema.
549535

@@ -648,7 +634,7 @@ This utility is especially useful when:
648634
- Objects contain invalid values that don't match constraints
649635
- You want to recover as much valid data as possible while safely removing problematic parts
650636

651-
### URL Validation
637+
## URL Validation
652638

653639
The library provides robust URL validation and handling through Zod's `z.string().url()` validator:
654640

@@ -667,7 +653,7 @@ const result = await extract({
667653
});
668654
```
669655

670-
#### How URL Validation Works
656+
### How URL Validation Works
671657

672658
Our URL validation system provides several key benefits:
673659

@@ -767,6 +753,17 @@ npm run test -- -t "should convert forum/tech-0 to markdown"
767753

768754
The `-t` flag uses pattern matching, so you can be as specific or general as needed to select the tests you want to run.
769755

756+
## Hosted Version
757+
758+
While this library provides a robust foundation for data extraction, you might want to consider [lightfeed.ai](https://lightfeed.ai) if you need:
759+
760+
- 🤖 **AI Enrichment** - Enrich any data point: contact info, product details, company intelligence, and more
761+
- 📍 **Geolocation Targeting** - Capture region-specific price, inventory and campaign data for competitive intelligence
762+
-**Workflow Automation** - Set up intelligent data pipelines that run automatically on your schedule
763+
- 📊 **Deduplication and Value History** - Maintain consistent data with automatic change tracking
764+
- ⚡️ **Database with API** - Manage data in a production-ready vector database with real-time API
765+
- 🥷 **Premium Proxies and Anti-bot** - Automatically handle CAPTCHAs and proxy rotation without intervention
766+
770767
## Support
771768

772769
If you need direct assistance with your implementation:

0 commit comments

Comments
 (0)