feat: add local html parser (#1351)

john0isaac · pamelafox · web-flow · commit f942f198d102 · 2024-03-01T13:32:19.000-08:00
* feat: add local html parser

* format with ruff

* format with ruff

* fix package name

* run ruff

* add comments to explain regex

* add localhtmlparser flag option to ps1 py sh

* add tests

* add beautiful soup types

* run black again

* docs: add HTML parser to low cost

* docs: add local parsers info

* docs: remove spaces and add to TOC

* Update docs/deploy_lowcost.md

* Update tests/test_htmlparser.py

* add more tests to cover all cases

* Move to deploy features branch

* Reformat the args

* Reformat the args

* Add output to indicate parser used

* Coverage for verbose

---------

Co-authored-by: Pamela Fox &lt;pamela.fox@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -33,7 +33,6 @@ urlFragment: azure-search-openai-demo
     - [VS Code Dev Containers](#vs-code-dev-containers)
     - [Local environment](#local-environment)
   - [Deploying](#deploying)
-  - [Using existing Azure resources](#using-existing-azure-resources)
   - [Deploying again](#deploying-again)
 - [Sharing environments](#sharing-environments)
 - [Enabling optional features](#enabling-optional-features)
@@ -176,7 +175,6 @@ It will look like the following:
 
 > NOTE: It may take 5-10 minutes after you see 'SUCCESS' for the application to be fully deployed. If you see a "Python Developer" welcome screen or an error page, then wait a bit and refresh the page. See [guide on debugging App Service deployments](docs/appservice.md).
 
-
 ### Deploying again
 
 If you've only changed the backend/frontend code in the `app` folder, then you don't need to re-provision the Azure resources. You can just run:
diff --git a/docs/deploy_features.md b/docs/deploy_features.md
@@ -10,6 +10,7 @@ You should typically enable these features before running `azd up`. Once you've
 * [Enabling authentication](#enabling-authentication)
 * [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
 * [Enabling CORS for an alternate frontend](#enabling-cors-for-an-alternate-frontend)
+* [Using local parsers](#using-local-parsers)
 
 ## Using GPT-4
 
@@ -56,3 +57,10 @@ For an alternate frontend that's written in Web Components and deployed to Stati
 [azure-search-openai-javascript](https://github.com/Azure-Samples/azure-search-openai-javascript) and its guide
 on [using a different backend](https://github.com/Azure-Samples/azure-search-openai-javascript#using-a-different-backend).
 Both these repositories adhere to the same [HTTP protocol for RAG chat apps](https://github.com/Azure-Samples/ai-chat-app-protocol).
+
+## Using local parsers
+
+If you want to decrease the charges by using local parsers instead of Azure Document Intelligence, you can set environment variables before running the [data ingestion script](/docs/data_ingestion.md). Note that local parsers will generally be not as sophisticated.
+
+1. Run `azd env set USE_LOCAL_PDF_PARSER true` to use the local PDF parser.
+1. Run `azd env set USE_LOCAL_HTML_PARSER true` to use the local HTML parser.
diff --git a/docs/deploy_lowcost.md b/docs/deploy_lowcost.md
@@ -29,21 +29,35 @@ However, if your goal is to minimize costs while prototyping your application, f
     3. The free tier does not support Managed Identity (keyless API access),
     so the Bicep will use Azure Key Vault to securely store the key instead.
 
-4. Use the free tier of Azure Document Intelligence (used in analyzing PDFs):
+4. Use the free tier of Azure Document Intelligence (used in analyzing files):
 
+    
     ```shell
     azd env set AZURE_DOCUMENTINTELLIGENCE_SKU F0
     ```
 
-    Limitation: The free tier will only scan the first two pages of each PDF.
-    In our sample documents, those first two pages are just title pages,
-    so you won't be able to get answers from the documents.
-    You can either use your own documents that are only 2-pages long,
-    or you can use a local Python package for PDF parsing by setting:
+    **Limitation for PDF files:**
 
-    ```shell
-    azd env set USE_LOCAL_PDF_PARSER true
-    ```
+      The free tier will only scan the first two pages of each PDF.
+      In our sample documents, those first two pages are just title pages,
+      so you won't be able to get answers from the documents.
+      You can either use your own documents that are only 2-pages long,
+      or you can use a local Python package for PDF parsing by setting:
+
+      ```shell
+      azd env set USE_LOCAL_PDF_PARSER true
+      ```
+
+    **Limitation for HTML files:**
+
+      The free tier will only scan the first two pages of each HTML file.
+      So, you might not get very accurate answers from the files.
+      You can either use your own files that are only 2-pages long,
+      or you can use a local Python package for HTML parsing by setting:
+
+      ```shell
+      azd env set USE_LOCAL_HTML_PARSER true
+      ```
 
 5. Turn off Azure Monitor (Application Insights):
 
diff --git a/scripts/prepdocs.ps1 b/scripts/prepdocs.ps1
@@ -57,6 +57,10 @@ if ($env:USE_LOCAL_PDF_PARSER -eq $true) {
   $localPdfParserArg = "--localpdfparser"
 }
 
+if ($env:USE_LOCAL_HTML_PARSER -eq $true) {
+  $localHtmlParserArg = "--localhtmlparser"
+}
+
 if ($env:AZURE_TENANT_ID) {
   $tenantArg = "--tenantid $env:AZURE_TENANT_ID"
 }
@@ -80,7 +84,7 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
 "$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg " + `
 "$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg  " + `
 "$tenantArg $aclArg " + `
-"$disableVectorsArg $localPdfParserArg " + `
+"$disableVectorsArg $localPdfParserArg $localHtmlParserArg " + `
 "$keyVaultName " + `
 "$integratedVectorizationArg "
 
diff --git a/scripts/prepdocs.py b/scripts/prepdocs.py
@@ -16,6 +16,7 @@
 )
 from prepdocslib.fileprocessor import FileProcessor
 from prepdocslib.filestrategy import FileStrategy
+from prepdocslib.htmlparser import LocalHTMLParser
 from prepdocslib.integratedvectorizerstrategy import (
     IntegratedVectorizerStrategy,
 )
@@ -48,6 +49,7 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
         verbose=args.verbose,
     )
 
+    html_parser: Parser
     pdf_parser: Parser
     doc_int_parser: DocumentAnalysisParser
 
@@ -64,13 +66,17 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
             verbose=args.verbose,
         )
     if args.localpdfparser or args.documentintelligenceservice is None:
-        pdf_parser = LocalPdfParser()
+        pdf_parser = LocalPdfParser(verbose=args.verbose)
     else:
         pdf_parser = doc_int_parser
+    if args.localhtmlparser or args.documentintelligenceservice is None:
+        html_parser = LocalHTMLParser(verbose=args.verbose)
+    else:
+        html_parser = doc_int_parser
     sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=args.searchimages)
     file_processors = {
         ".pdf": FileProcessor(pdf_parser, sentence_text_splitter),
-        ".html": FileProcessor(doc_int_parser, sentence_text_splitter),
+        ".html": FileProcessor(html_parser, sentence_text_splitter),
         ".json": FileProcessor(JsonParser(), SimpleTextSplitter()),
         ".docx": FileProcessor(doc_int_parser, sentence_text_splitter),
         ".pptx": FileProcessor(doc_int_parser, sentence_text_splitter),
@@ -353,6 +359,11 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
         action="store_true",
         help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
     )
+    parser.add_argument(
+        "--localhtmlparser",
+        action="store_true",
+        help="Use Beautiful soap local HTML parser instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
+    )
     parser.add_argument(
         "--documentintelligenceservice",
         required=False,
diff --git a/scripts/prepdocs.sh b/scripts/prepdocs.sh
@@ -62,6 +62,10 @@ if [ "$USE_LOCAL_PDF_PARSER" = true ]; then
   localPdfParserArg="--localpdfparser"
 fi
 
+if [ "$USE_LOCAL_HTML_PARSER" = true ]; then
+  localHtmlParserArg="--localhtmlparser"
+fi
+
 if [ -n "$AZURE_TENANT_ID" ]; then
   tenantArg="--tenantid $AZURE_TENANT_ID"
 fi
@@ -82,6 +86,6 @@ $searchAnalyzerNameArg $searchSecretNameArg \
 $searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
 $adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
 $tenantArg $aclArg \
-$disableVectorsArg $localPdfParserArg \
+$disableVectorsArg $localPdfParserArg $localHtmlParserArg \
 $keyVaultName \
 $integratedVectorizationArg
diff --git a/scripts/prepdocslib/htmlparser.py b/scripts/prepdocslib/htmlparser.py
@@ -0,0 +1,47 @@
+import re
+from typing import IO, AsyncGenerator
+
+from bs4 import BeautifulSoup
+
+from .page import Page
+from .parser import Parser
+
+
+def cleanup_data(data: str) -> str:
+    """Cleans up the given content using regexes
+    Args:
+        data: (str): The data to clean up.
+    Returns:
+        str: The cleaned up data.
+    """
+    # match two or more newlines and replace them with one new line
+    output = re.sub(r"\n{2,}", "\n", data)
+    # match two or more spaces that are not newlines and replace them with one space
+    output = re.sub(r"[^\S\n]{2,}", " ", output)
+    # match two or more hyphens and replace them with two hyphens
+    output = re.sub(r"-{2,}", "--", output)
+
+    return output.strip()
+
+
+class LocalHTMLParser(Parser):
+    """Parses HTML text into Page objects."""
+
+    async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
+        """Parses the given content.
+        To learn more, please visit https://pypi.org/project/beautifulsoup4/
+        Args:
+            content (IO): The content to parse.
+        Returns:
+            Page: The parsed html Page.
+        """
+        if self.verbose:
+            print(f"\tExtracting text from '{content.name}' using local HTML parser (BeautifulSoup)")
+
+        data = content.read()
+        soup = BeautifulSoup(data, "html.parser")
+
+        # Get text only from html file
+        result = soup.get_text()
+
+        yield Page(0, 0, text=cleanup_data(result))
diff --git a/scripts/prepdocslib/parser.py b/scripts/prepdocslib/parser.py
@@ -9,6 +9,12 @@ class Parser(ABC):
     Abstract parser that parses content into Page objects
     """
 
+    def __init__(
+        self,
+        verbose: bool = False,
+    ):
+        self.verbose = verbose
+
     async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
         if False:
             yield  # pragma: no cover - this is necessary for mypy to type check
diff --git a/scripts/prepdocslib/pdfparser.py b/scripts/prepdocslib/pdfparser.py
@@ -18,6 +18,9 @@ class LocalPdfParser(Parser):
     """
 
     async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
+        if self.verbose:
+            print(f"\tExtracting text from '{content.name}' using local PDF parser (pypdf)")
+
         reader = PdfReader(content)
         pages = reader.pages
         offset = 0
@@ -29,7 +32,7 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
 
 class DocumentAnalysisParser(Parser):
     """
-    Concrete parser backed by Azure AI Document Intelligence that can parse PDFS into pages
+    Concrete parser backed by Azure AI Document Intelligence that can parse many document formats into pages
     To learn more, please visit https://learn.microsoft.com/azure/ai-services/document-intelligence/overview
     """
 
diff --git a/scripts/requirements.in b/scripts/requirements.in
@@ -5,12 +5,14 @@ azure-search-documents==11.6.0b1
 azure-ai-documentintelligence
 azure-storage-blob
 azure-storage-file-datalake
+beautifulsoup4
 openai[datalib]>=1.3.5
 tiktoken
 tenacity
 azure-keyvault-secrets
 Pillow
 PyMuPDF
+types-beautifulsoup4
 types-Pillow
 cryptography
 python-jose[cryptography]
diff --git a/scripts/requirements.txt b/scripts/requirements.txt
@@ -42,6 +42,8 @@ azure-storage-blob==12.19.0
     #   azure-storage-file-datalake
 azure-storage-file-datalake==12.14.0
     # via -r requirements.in
+beautifulsoup4==4.12.3
+    # via -r requirements.in
 certifi==2024.2.2
     # via
     #   httpcore
@@ -124,7 +126,9 @@ pydantic==2.6.1
 pydantic-core==2.16.2
     # via pydantic
 pyjwt[crypto]==2.8.0
-    # via msal
+    # via
+    #   msal
+    #   pyjwt
 pymupdf==1.23.25
     # via -r requirements.in
 pymupdfb==1.23.22
@@ -157,12 +161,18 @@ sniffio==1.3.0
     #   anyio
     #   httpx
     #   openai
+soupsieve==2.5
+    # via beautifulsoup4
 tenacity==8.2.3
     # via -r requirements.in
 tiktoken==0.6.0
     # via -r requirements.in
 tqdm==4.66.2
     # via openai
+types-beautifulsoup4==4.12.0.20240229
+    # via -r requirements.in
+types-html5lib==1.1.11.20240228
+    # via types-beautifulsoup4
 types-pillow==10.2.0.20240213
     # via -r requirements.in
 types-pytz==2024.1.0.20240203
diff --git a/tests/test_htmlparser.py b/tests/test_htmlparser.py
@@ -0,0 +1,76 @@
+import io
+
+import pytest
+
+from scripts.prepdocslib.htmlparser import LocalHTMLParser
+
+
+@pytest.mark.asyncio
+async def test_htmlparser_remove_new_lines():
+    file = io.StringIO("<p><br><br><br><br><br>Test multiple new lines<br><br><br><br><br></p>")
+    file.name = "test.json"
+    htmlparser = LocalHTMLParser()
+    pages = [page async for page in htmlparser.parse(file)]
+    assert len(pages) == 1
+    assert pages[0].page_num == 0
+    assert pages[0].offset == 0
+    assert pages[0].text == "Test multiple new lines"
+
+
+@pytest.mark.asyncio
+async def test_htmlparser_remove_white_spaces():
+    file = io.StringIO("<p>              Test multiple white spaces                   </p>")
+    file.name = "test.json"
+    htmlparser = LocalHTMLParser()
+    pages = [page async for page in htmlparser.parse(file)]
+    assert len(pages) == 1
+    assert pages[0].page_num == 0
+    assert pages[0].offset == 0
+    assert pages[0].text == "Test multiple white spaces"
+
+
+@pytest.mark.asyncio
+async def test_htmlparser_remove_hyphens():
+    file = io.StringIO("<p>--- --------  --------   ----- Test hyphens ----------- --------- -----  ----</p>")
+    file.name = "test.json"
+    htmlparser = LocalHTMLParser()
+    pages = [page async for page in htmlparser.parse(file)]
+    assert len(pages) == 1
+    assert pages[0].page_num == 0
+    assert pages[0].offset == 0
+    assert pages[0].text == "-- -- -- -- Test hyphens -- -- -- --"
+
+
+@pytest.mark.asyncio
+async def test_htmlparser_full():
+    file = io.StringIO(
+        """
+        <html>
+            <head>
+                <title>Test title</title>
+            </head>
+            <body>
+                <!-- Test comment -->
+                <h1>Test header</h1>
+                <p>
+                Test paragraph one<br>
+                Test paragraph two<br><br>
+                Test paragraph three<br><br><br>
+                </p>
+                <p>
+                ---------- Test hyphens ----------
+                </p>
+            </body>
+        </html>
+        """
+    )
+    file.name = "test.json"
+    htmlparser = LocalHTMLParser(verbose=True)
+    pages = [page async for page in htmlparser.parse(file)]
+    assert len(pages) == 1
+    assert pages[0].page_num == 0
+    assert pages[0].offset == 0
+    assert (
+        pages[0].text
+        == "Test title\nTest header\n Test paragraph one\n Test paragraph two\n Test paragraph three\n -- Test hyphens --"
+    )
diff --git a/tests/test_prepdocslib_textsplitter.py b/tests/test_prepdocslib_textsplitter.py
@@ -28,7 +28,7 @@ def test_sentencetextsplitter_split_small_pages():
 @pytest.mark.asyncio
 async def test_sentencetextsplitter_list_parse_and_split(tmp_path):
     text_splitter = SentenceTextSplitter(False, True)
-    pdf_parser = LocalPdfParser()
+    pdf_parser = LocalPdfParser(verbose=True)
     for pdf in Path("data").glob("*.pdf"):
         shutil.copy(str(pdf.absolute()), tmp_path)