Skip to content

Commit f942f19

Browse files
feat: add local html parser (#1351)
* feat: add local html parser * format with ruff * format with ruff * fix package name * run ruff * add comments to explain regex * add localhtmlparser flag option to ps1 py sh * add tests * add beautiful soup types * run black again * docs: add HTML parser to low cost * docs: add local parsers info * docs: remove spaces and add to TOC * Update docs/deploy_lowcost.md * Update tests/test_htmlparser.py * add more tests to cover all cases * Move to deploy features branch * Reformat the args * Reformat the args * Add output to indicate parser used * Coverage for verbose --------- Co-authored-by: Pamela Fox <[email protected]>
1 parent e3de788 commit f942f19

13 files changed

+201
-18
lines changed

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@ urlFragment: azure-search-openai-demo
3333
- [VS Code Dev Containers](#vs-code-dev-containers)
3434
- [Local environment](#local-environment)
3535
- [Deploying](#deploying)
36-
- [Using existing Azure resources](#using-existing-azure-resources)
3736
- [Deploying again](#deploying-again)
3837
- [Sharing environments](#sharing-environments)
3938
- [Enabling optional features](#enabling-optional-features)
@@ -176,7 +175,6 @@ It will look like the following:
176175

177176
> NOTE: It may take 5-10 minutes after you see 'SUCCESS' for the application to be fully deployed. If you see a "Python Developer" welcome screen or an error page, then wait a bit and refresh the page. See [guide on debugging App Service deployments](docs/appservice.md).
178177

179-
180178
### Deploying again
181179

182180
If you've only changed the backend/frontend code in the `app` folder, then you don't need to re-provision the Azure resources. You can just run:

docs/deploy_features.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ You should typically enable these features before running `azd up`. Once you've
1010
* [Enabling authentication](#enabling-authentication)
1111
* [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
1212
* [Enabling CORS for an alternate frontend](#enabling-cors-for-an-alternate-frontend)
13+
* [Using local parsers](#using-local-parsers)
1314

1415
## Using GPT-4
1516

@@ -56,3 +57,10 @@ For an alternate frontend that's written in Web Components and deployed to Stati
5657
[azure-search-openai-javascript](https://github.com/Azure-Samples/azure-search-openai-javascript) and its guide
5758
on [using a different backend](https://github.com/Azure-Samples/azure-search-openai-javascript#using-a-different-backend).
5859
Both these repositories adhere to the same [HTTP protocol for RAG chat apps](https://github.com/Azure-Samples/ai-chat-app-protocol).
60+
61+
## Using local parsers
62+
63+
If you want to decrease the charges by using local parsers instead of Azure Document Intelligence, you can set environment variables before running the [data ingestion script](/docs/data_ingestion.md). Note that local parsers will generally be not as sophisticated.
64+
65+
1. Run `azd env set USE_LOCAL_PDF_PARSER true` to use the local PDF parser.
66+
1. Run `azd env set USE_LOCAL_HTML_PARSER true` to use the local HTML parser.

docs/deploy_lowcost.md

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,21 +29,35 @@ However, if your goal is to minimize costs while prototyping your application, f
2929
3. The free tier does not support Managed Identity (keyless API access),
3030
so the Bicep will use Azure Key Vault to securely store the key instead.
3131

32-
4. Use the free tier of Azure Document Intelligence (used in analyzing PDFs):
32+
4. Use the free tier of Azure Document Intelligence (used in analyzing files):
3333

34+
3435
```shell
3536
azd env set AZURE_DOCUMENTINTELLIGENCE_SKU F0
3637
```
3738

38-
Limitation: The free tier will only scan the first two pages of each PDF.
39-
In our sample documents, those first two pages are just title pages,
40-
so you won't be able to get answers from the documents.
41-
You can either use your own documents that are only 2-pages long,
42-
or you can use a local Python package for PDF parsing by setting:
39+
**Limitation for PDF files:**
4340

44-
```shell
45-
azd env set USE_LOCAL_PDF_PARSER true
46-
```
41+
The free tier will only scan the first two pages of each PDF.
42+
In our sample documents, those first two pages are just title pages,
43+
so you won't be able to get answers from the documents.
44+
You can either use your own documents that are only 2-pages long,
45+
or you can use a local Python package for PDF parsing by setting:
46+
47+
```shell
48+
azd env set USE_LOCAL_PDF_PARSER true
49+
```
50+
51+
**Limitation for HTML files:**
52+
53+
The free tier will only scan the first two pages of each HTML file.
54+
So, you might not get very accurate answers from the files.
55+
You can either use your own files that are only 2-pages long,
56+
or you can use a local Python package for HTML parsing by setting:
57+
58+
```shell
59+
azd env set USE_LOCAL_HTML_PARSER true
60+
```
4761
4862
5. Turn off Azure Monitor (Application Insights):
4963

scripts/prepdocs.ps1

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,10 @@ if ($env:USE_LOCAL_PDF_PARSER -eq $true) {
5757
$localPdfParserArg = "--localpdfparser"
5858
}
5959

60+
if ($env:USE_LOCAL_HTML_PARSER -eq $true) {
61+
$localHtmlParserArg = "--localhtmlparser"
62+
}
63+
6064
if ($env:AZURE_TENANT_ID) {
6165
$tenantArg = "--tenantid $env:AZURE_TENANT_ID"
6266
}
@@ -80,7 +84,7 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
8084
"$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg " + `
8185
"$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg " + `
8286
"$tenantArg $aclArg " + `
83-
"$disableVectorsArg $localPdfParserArg " + `
87+
"$disableVectorsArg $localPdfParserArg $localHtmlParserArg " + `
8488
"$keyVaultName " + `
8589
"$integratedVectorizationArg "
8690

scripts/prepdocs.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
)
1717
from prepdocslib.fileprocessor import FileProcessor
1818
from prepdocslib.filestrategy import FileStrategy
19+
from prepdocslib.htmlparser import LocalHTMLParser
1920
from prepdocslib.integratedvectorizerstrategy import (
2021
IntegratedVectorizerStrategy,
2122
)
@@ -48,6 +49,7 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
4849
verbose=args.verbose,
4950
)
5051

52+
html_parser: Parser
5153
pdf_parser: Parser
5254
doc_int_parser: DocumentAnalysisParser
5355

@@ -64,13 +66,17 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
6466
verbose=args.verbose,
6567
)
6668
if args.localpdfparser or args.documentintelligenceservice is None:
67-
pdf_parser = LocalPdfParser()
69+
pdf_parser = LocalPdfParser(verbose=args.verbose)
6870
else:
6971
pdf_parser = doc_int_parser
72+
if args.localhtmlparser or args.documentintelligenceservice is None:
73+
html_parser = LocalHTMLParser(verbose=args.verbose)
74+
else:
75+
html_parser = doc_int_parser
7076
sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=args.searchimages)
7177
file_processors = {
7278
".pdf": FileProcessor(pdf_parser, sentence_text_splitter),
73-
".html": FileProcessor(doc_int_parser, sentence_text_splitter),
79+
".html": FileProcessor(html_parser, sentence_text_splitter),
7480
".json": FileProcessor(JsonParser(), SimpleTextSplitter()),
7581
".docx": FileProcessor(doc_int_parser, sentence_text_splitter),
7682
".pptx": FileProcessor(doc_int_parser, sentence_text_splitter),
@@ -353,6 +359,11 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
353359
action="store_true",
354360
help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
355361
)
362+
parser.add_argument(
363+
"--localhtmlparser",
364+
action="store_true",
365+
help="Use Beautiful soap local HTML parser instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
366+
)
356367
parser.add_argument(
357368
"--documentintelligenceservice",
358369
required=False,

scripts/prepdocs.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ if [ "$USE_LOCAL_PDF_PARSER" = true ]; then
6262
localPdfParserArg="--localpdfparser"
6363
fi
6464

65+
if [ "$USE_LOCAL_HTML_PARSER" = true ]; then
66+
localHtmlParserArg="--localhtmlparser"
67+
fi
68+
6569
if [ -n "$AZURE_TENANT_ID" ]; then
6670
tenantArg="--tenantid $AZURE_TENANT_ID"
6771
fi
@@ -82,6 +86,6 @@ $searchAnalyzerNameArg $searchSecretNameArg \
8286
$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
8387
$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
8488
$tenantArg $aclArg \
85-
$disableVectorsArg $localPdfParserArg \
89+
$disableVectorsArg $localPdfParserArg $localHtmlParserArg \
8690
$keyVaultName \
8791
$integratedVectorizationArg

scripts/prepdocslib/htmlparser.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
import re
2+
from typing import IO, AsyncGenerator
3+
4+
from bs4 import BeautifulSoup
5+
6+
from .page import Page
7+
from .parser import Parser
8+
9+
10+
def cleanup_data(data: str) -> str:
11+
"""Cleans up the given content using regexes
12+
Args:
13+
data: (str): The data to clean up.
14+
Returns:
15+
str: The cleaned up data.
16+
"""
17+
# match two or more newlines and replace them with one new line
18+
output = re.sub(r"\n{2,}", "\n", data)
19+
# match two or more spaces that are not newlines and replace them with one space
20+
output = re.sub(r"[^\S\n]{2,}", " ", output)
21+
# match two or more hyphens and replace them with two hyphens
22+
output = re.sub(r"-{2,}", "--", output)
23+
24+
return output.strip()
25+
26+
27+
class LocalHTMLParser(Parser):
28+
"""Parses HTML text into Page objects."""
29+
30+
async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
31+
"""Parses the given content.
32+
To learn more, please visit https://pypi.org/project/beautifulsoup4/
33+
Args:
34+
content (IO): The content to parse.
35+
Returns:
36+
Page: The parsed html Page.
37+
"""
38+
if self.verbose:
39+
print(f"\tExtracting text from '{content.name}' using local HTML parser (BeautifulSoup)")
40+
41+
data = content.read()
42+
soup = BeautifulSoup(data, "html.parser")
43+
44+
# Get text only from html file
45+
result = soup.get_text()
46+
47+
yield Page(0, 0, text=cleanup_data(result))

scripts/prepdocslib/parser.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,12 @@ class Parser(ABC):
99
Abstract parser that parses content into Page objects
1010
"""
1111

12+
def __init__(
13+
self,
14+
verbose: bool = False,
15+
):
16+
self.verbose = verbose
17+
1218
async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
1319
if False:
1420
yield # pragma: no cover - this is necessary for mypy to type check

scripts/prepdocslib/pdfparser.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ class LocalPdfParser(Parser):
1818
"""
1919

2020
async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
21+
if self.verbose:
22+
print(f"\tExtracting text from '{content.name}' using local PDF parser (pypdf)")
23+
2124
reader = PdfReader(content)
2225
pages = reader.pages
2326
offset = 0
@@ -29,7 +32,7 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
2932

3033
class DocumentAnalysisParser(Parser):
3134
"""
32-
Concrete parser backed by Azure AI Document Intelligence that can parse PDFS into pages
35+
Concrete parser backed by Azure AI Document Intelligence that can parse many document formats into pages
3336
To learn more, please visit https://learn.microsoft.com/azure/ai-services/document-intelligence/overview
3437
"""
3538

scripts/requirements.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ azure-search-documents==11.6.0b1
55
azure-ai-documentintelligence
66
azure-storage-blob
77
azure-storage-file-datalake
8+
beautifulsoup4
89
openai[datalib]>=1.3.5
910
tiktoken
1011
tenacity
1112
azure-keyvault-secrets
1213
Pillow
1314
PyMuPDF
15+
types-beautifulsoup4
1416
types-Pillow
1517
cryptography
1618
python-jose[cryptography]

0 commit comments

Comments
 (0)