-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract JSON from OCR #83
base: main
Are you sure you want to change the base?
Conversation
node-zerox/src/openAI.ts
Outdated
@@ -1,6 +1,76 @@ | |||
import { CompletionArgs, CompletionResponse } from "./types"; | |||
import { convertKeysToSnakeCase, encodeImageToBase64 } from "./utils"; | |||
import axios from "axios"; | |||
import { nanoid } from "nanoid"; | |||
|
|||
const markdownToJson = async (markdownString: string) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i would move this to the utils
folder
8f2a9de
to
fab05db
Compare
@ZeeshanZulfiqarAli can you make this optional. with something like a I think the response should probably always include the pages, and then have this as an optional output:
Also if you come up with a better name than chunks I'm game.
|
Why is this doing image>markdown>JSON instead of image>JSON? |
@batmanscode this is specifically for chunking the OCR results into a JSON array, rather than running JSON extraction. Primarily for chunking / indexing use cases. We're parsing the Markdown to chunk the page into elements (i.e. |
This PR adds the ability to parse markdown into a json object.
For the following markdown:
This JSON is produced: