Extract JSON from OCR #83

ZeeshanZulfiqarAli · 2024-11-01T21:21:07Z

This PR adds the ability to parse markdown into a json object.

For the following markdown:

# Deloitte.

## Quality System Audit for BioTech Innovations (Pty) Ltd  
### Opening Meeting Sign-in Sheet

**Audit Date:** 02 October 2024  
**Time:** 06h30  
**Supplier:** BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa.  
**Contact Person:** Kathy Margaret  
**Phone Number:** +14 22 045 4952  

**Opening Meeting Agenda:**
- Introductions
- Review of audit agenda
- Confirmation of availability for required persons

**Opening Meeting Attendees:**

| No. | Print Name     | Job Title  | Email                  | Signature |
|-----|----------------|------------|------------------------|-----------|
| 1   | Anna Pojanvis  | CTO        | [email protected]        | a p       |
| 2   | Tyler Maran    | CEO        | [email protected]       |           |
| 3   | Kathy Margaret | Associate  | [email protected] |           |
| 4   | Mark Ding      | Eng        | [email protected]        |           |
| 5   |                |            |                        |           |

**QAC Auditor:** David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals).

---

**Page 1 of 7**

**DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC**  
450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567  
Website: [www.qaconsultants.com](http://www.qaconsultants.com) - Email: [email protected]

This JSON is produced:

[
  {
    "id": "03qdyA5ROy5EQB-WaKEDo",
    "page": 1,
    "type": "heading",
    "value": "Deloitte."
  },
  {
    "id": "TaSFQxq1WE5c0Z6Ey5Crz",
    "page": 1,
    "parentId": "03qdyA5ROy5EQB-WaKEDo",
    "type": "heading",
    "value": "Quality System Audit for BioTech Innovations (Pty) Ltd"
  },
  {
    "id": "m8N1Hi_wXBPZa9sNMX_qc",
    "page": 1,
    "parentId": "TaSFQxq1WE5c0Z6Ey5Crz",
    "type": "heading",
    "value": "Opening Meeting Sign-in Sheet"
  },
  {
    "id": "aM1oWUjreyW7I7qMhgmqq",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Audit Date:  02 October 2024 Time:  06h30 Supplier:  BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa. Contact Person:  Kathy Margaret Phone Number:  +14 22 045 4952"
  },
  {
    "id": "8h3dkSXVdr8smZZc-4ttN",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Opening Meeting Agenda:"
  },
  {
    "id": "z5VvXnAdLsXF2MZYRotd5",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "list",
    "value": [
      {
        "id": "D3TpFf6bwXAO2DbT0nxtj",
        "page": 1,
        "type": "text",
        "value": "Introductions"
      },
      {
        "id": "XvZKI0Gs5uCpRDcyBQELn",
        "page": 1,
        "type": "text",
        "value": "Review of audit agenda"
      },
      {
        "id": "RidWUv1CjLSdk6XvTbgwU",
        "page": 1,
        "type": "text",
        "value": "Confirmation of availability for required persons"
      }
    ]
  },
  {
    "id": "SwXj9Gx98CgnizEURc5YY",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Opening Meeting Attendees:"
  },
  {
    "id": "2BreVoV6ptcefEADutqJV",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "table",
    "value": {
      "headers": [
        {
          "value": "No.",
          "id": "_wFjO1laIvhXpCVJLHBEG"
        },
        {
          "value": "Print Name",
          "id": "DdKaPNWWWC0vllcgllwqK"
        },
        {
          "value": "Job Title",
          "id": "81_M1ohyeRavAtldkiBx1"
        },
        {
          "value": "Email",
          "id": "9HWpmkk7Eskz0RSwsJAS1"
        },
        {
          "value": "Signature",
          "id": "pXm6dctuxr9GFm0YFecrU"
        }
      ],
      "rows": [
        {
          "_wFjO1laIvhXpCVJLHBEG": "1",
          "DdKaPNWWWC0vllcgllwqK": "Anna Pojanvis",
          "81_M1ohyeRavAtldkiBx1": "CTO",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": "a p"
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "2",
          "DdKaPNWWWC0vllcgllwqK": "Tyler Maran",
          "81_M1ohyeRavAtldkiBx1": "CEO",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "3",
          "DdKaPNWWWC0vllcgllwqK": "Kathy Margaret",
          "81_M1ohyeRavAtldkiBx1": "Associate",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "4",
          "DdKaPNWWWC0vllcgllwqK": "Mark Ding",
          "81_M1ohyeRavAtldkiBx1": "Eng",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "5",
          "DdKaPNWWWC0vllcgllwqK": "",
          "81_M1ohyeRavAtldkiBx1": "",
          "9HWpmkk7Eskz0RSwsJAS1": "",
          "pXm6dctuxr9GFm0YFecrU": ""
        }
      ]
    }
  },
  {
    "id": "KmwROFSUArfod3PQ1R0Km",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "QAC Auditor:  David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals)."
  },
  {
    "id": "jMM4sM-05M1G5E734rBUQ",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Page 1 of 7"
  },
  {
    "id": "zQrMLcCMrpjIKXRmGEwoA",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC 450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567 Website:  www.qaconsultants.com  - Email:  [email protected]"
  }
]

tylermaran · 2024-11-01T23:09:38Z

node-zerox/src/openAI.ts

@@ -1,6 +1,76 @@
 import { CompletionArgs, CompletionResponse } from "./types";
 import { convertKeysToSnakeCase, encodeImageToBase64 } from "./utils";
 import axios from "axios";
+import { nanoid } from "nanoid";
+
+const markdownToJson = async (markdownString: string) => {


nit: i would move this to the utils folder

tylermaran · 2024-11-06T18:56:37Z

@ZeeshanZulfiqarAli can you make this optional. with something like a chunk parameter.

I think the response should probably always include the pages, and then have this as an optional output:

inputTokens: 25543,
  outputTokens: 210,
  pages: [],
  chunks: [],

Also if you come up with a better name than chunks I'm game.

const result = await zerox({
  // Required
  filePath: "path/to/file",
  openaiAPIKey: process.env.OPENAI_API_KEY,

  // Optional
  cleanup: true, // Clear images from tmp after run.
  chunk: false, // Return JSON array of elements on each page
});

batmanscode · 2025-01-23T00:03:23Z

Why is this doing image>markdown>JSON instead of image>JSON?

tylermaran · 2025-01-23T00:41:43Z

@batmanscode this is specifically for chunking the OCR results into a JSON array, rather than running JSON extraction.

Primarily for chunking / indexing use cases. We're parsing the Markdown to chunk the page into elements (i.e. h1, table etc.) and returning that as a JSON array.

Use unified to parse markdown

543ecbd

tylermaran reviewed Nov 1, 2024

View reviewed changes

ZeeshanZulfiqarAli added 2 commits November 4, 2024 19:20

Add support for lists

c608b96

Improve typing, add page number to nodes and a bit of clean up

fab05db

ZeeshanZulfiqarAli force-pushed the zeeshan/markdown-json branch from 8f2a9de to fab05db Compare November 4, 2024 17:27

ZeeshanZulfiqarAli added 3 commits November 5, 2024 19:53

Handle more markdown node types and refactor

219887f

Merge branch 'main' into zeeshan/markdown-json

835078e

Handle tables

26b58f1

ZeeshanZulfiqarAli marked this pull request as ready for review November 5, 2024 23:14

ZeeshanZulfiqarAli changed the title ~~WIP - Extract JSON from OCR~~ Extract JSON from OCR Nov 5, 2024

Remove console logs

1fd18a2

Add chunk option that returns JSON array

dc890bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract JSON from OCR #83

Extract JSON from OCR #83

ZeeshanZulfiqarAli commented Nov 1, 2024 •

edited

Loading

tylermaran Nov 1, 2024

tylermaran commented Nov 6, 2024

batmanscode commented Jan 23, 2025

tylermaran commented Jan 23, 2025

Extract JSON from OCR #83

Are you sure you want to change the base?

Extract JSON from OCR #83

Conversation

ZeeshanZulfiqarAli commented Nov 1, 2024 • edited Loading

tylermaran Nov 1, 2024

Choose a reason for hiding this comment

tylermaran commented Nov 6, 2024

batmanscode commented Jan 23, 2025

tylermaran commented Jan 23, 2025

ZeeshanZulfiqarAli commented Nov 1, 2024 •

edited

Loading