Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract JSON from OCR #83

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

ZeeshanZulfiqarAli
Copy link
Contributor

@ZeeshanZulfiqarAli ZeeshanZulfiqarAli commented Nov 1, 2024

This PR adds the ability to parse markdown into a json object.

For the following markdown:

# Deloitte.

## Quality System Audit for BioTech Innovations (Pty) Ltd  
### Opening Meeting Sign-in Sheet

**Audit Date:** 02 October 2024  
**Time:** 06h30  
**Supplier:** BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa.  
**Contact Person:** Kathy Margaret  
**Phone Number:** +14 22 045 4952  

**Opening Meeting Agenda:**
- Introductions
- Review of audit agenda
- Confirmation of availability for required persons

**Opening Meeting Attendees:**

| No. | Print Name     | Job Title  | Email                  | Signature |
|-----|----------------|------------|------------------------|-----------|
| 1   | Anna Pojanvis  | CTO        | [email protected]        | a p       |
| 2   | Tyler Maran    | CEO        | [email protected]       |           |
| 3   | Kathy Margaret | Associate  | [email protected] |           |
| 4   | Mark Ding      | Eng        | [email protected]        |           |
| 5   |                |            |                        |           |

**QAC Auditor:** David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals).

---

**Page 1 of 7**

**DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC**  
450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567  
Website: [www.qaconsultants.com](http://www.qaconsultants.com) - Email: [email protected]

This JSON is produced:

[
  {
    "id": "03qdyA5ROy5EQB-WaKEDo",
    "page": 1,
    "type": "heading",
    "value": "Deloitte."
  },
  {
    "id": "TaSFQxq1WE5c0Z6Ey5Crz",
    "page": 1,
    "parentId": "03qdyA5ROy5EQB-WaKEDo",
    "type": "heading",
    "value": "Quality System Audit for BioTech Innovations (Pty) Ltd"
  },
  {
    "id": "m8N1Hi_wXBPZa9sNMX_qc",
    "page": 1,
    "parentId": "TaSFQxq1WE5c0Z6Ey5Crz",
    "type": "heading",
    "value": "Opening Meeting Sign-in Sheet"
  },
  {
    "id": "aM1oWUjreyW7I7qMhgmqq",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Audit Date:  02 October 2024 Time:  06h30 Supplier:  BioTech Innovations (Pty) Ltd; 67 River Rd, Kensington, Johannesburg, Gauteng, 2094 South Africa. Contact Person:  Kathy Margaret Phone Number:  +14 22 045 4952"
  },
  {
    "id": "8h3dkSXVdr8smZZc-4ttN",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Opening Meeting Agenda:"
  },
  {
    "id": "z5VvXnAdLsXF2MZYRotd5",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "list",
    "value": [
      {
        "id": "D3TpFf6bwXAO2DbT0nxtj",
        "page": 1,
        "type": "text",
        "value": "Introductions"
      },
      {
        "id": "XvZKI0Gs5uCpRDcyBQELn",
        "page": 1,
        "type": "text",
        "value": "Review of audit agenda"
      },
      {
        "id": "RidWUv1CjLSdk6XvTbgwU",
        "page": 1,
        "type": "text",
        "value": "Confirmation of availability for required persons"
      }
    ]
  },
  {
    "id": "SwXj9Gx98CgnizEURc5YY",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Opening Meeting Attendees:"
  },
  {
    "id": "2BreVoV6ptcefEADutqJV",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "table",
    "value": {
      "headers": [
        {
          "value": "No.",
          "id": "_wFjO1laIvhXpCVJLHBEG"
        },
        {
          "value": "Print Name",
          "id": "DdKaPNWWWC0vllcgllwqK"
        },
        {
          "value": "Job Title",
          "id": "81_M1ohyeRavAtldkiBx1"
        },
        {
          "value": "Email",
          "id": "9HWpmkk7Eskz0RSwsJAS1"
        },
        {
          "value": "Signature",
          "id": "pXm6dctuxr9GFm0YFecrU"
        }
      ],
      "rows": [
        {
          "_wFjO1laIvhXpCVJLHBEG": "1",
          "DdKaPNWWWC0vllcgllwqK": "Anna Pojanvis",
          "81_M1ohyeRavAtldkiBx1": "CTO",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": "a p"
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "2",
          "DdKaPNWWWC0vllcgllwqK": "Tyler Maran",
          "81_M1ohyeRavAtldkiBx1": "CEO",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "3",
          "DdKaPNWWWC0vllcgllwqK": "Kathy Margaret",
          "81_M1ohyeRavAtldkiBx1": "Associate",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "4",
          "DdKaPNWWWC0vllcgllwqK": "Mark Ding",
          "81_M1ohyeRavAtldkiBx1": "Eng",
          "9HWpmkk7Eskz0RSwsJAS1": "[email protected]",
          "pXm6dctuxr9GFm0YFecrU": ""
        },
        {
          "_wFjO1laIvhXpCVJLHBEG": "5",
          "DdKaPNWWWC0vllcgllwqK": "",
          "81_M1ohyeRavAtldkiBx1": "",
          "9HWpmkk7Eskz0RSwsJAS1": "",
          "pXm6dctuxr9GFm0YFecrU": ""
        }
      ]
    }
  },
  {
    "id": "KmwROFSUArfod3PQ1R0Km",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "QAC Auditor:  David Thompson, Lead Quality Auditor, NTA Services on behalf of BioTech Innovations (Biopharmaceuticals)."
  },
  {
    "id": "jMM4sM-05M1G5E734rBUQ",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "Page 1 of 7"
  },
  {
    "id": "zQrMLcCMrpjIKXRmGEwoA",
    "page": 1,
    "parentId": "m8N1Hi_wXBPZa9sNMX_qc",
    "type": "text",
    "value": "DELOITTE QUALITY ASSURANCE CONSULTANTS, LLC 450 Oceanview Drive, Suite 200 - Santa Monica, CA 90405 - PHONE (800) 555-1234 (310) 555-7890 - FAX (310) 555-4567 Website:  www.qaconsultants.com  - Email:  [email protected]"
  }
]

@@ -1,6 +1,76 @@
import { CompletionArgs, CompletionResponse } from "./types";
import { convertKeysToSnakeCase, encodeImageToBase64 } from "./utils";
import axios from "axios";
import { nanoid } from "nanoid";

const markdownToJson = async (markdownString: string) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i would move this to the utils folder

@ZeeshanZulfiqarAli ZeeshanZulfiqarAli marked this pull request as ready for review November 5, 2024 23:14
@ZeeshanZulfiqarAli ZeeshanZulfiqarAli changed the title WIP - Extract JSON from OCR Extract JSON from OCR Nov 5, 2024
@tylermaran
Copy link
Contributor

@ZeeshanZulfiqarAli can you make this optional. with something like a chunk parameter.

I think the response should probably always include the pages, and then have this as an optional output:

inputTokens: 25543,
  outputTokens: 210,
  pages: [],
  chunks: [],

Also if you come up with a better name than chunks I'm game.

const result = await zerox({
  // Required
  filePath: "path/to/file",
  openaiAPIKey: process.env.OPENAI_API_KEY,

  // Optional
  cleanup: true, // Clear images from tmp after run.
  chunk: false, // Return JSON array of elements on each page
});

@batmanscode
Copy link

Why is this doing image>markdown>JSON instead of image>JSON?

@tylermaran
Copy link
Contributor

@batmanscode this is specifically for chunking the OCR results into a JSON array, rather than running JSON extraction.

Primarily for chunking / indexing use cases. We're parsing the Markdown to chunk the page into elements (i.e. h1, table etc.) and returning that as a JSON array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants