Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for docling-serve #12

Open
dolfim-ibm opened this issue Dec 9, 2024 · 5 comments
Open

Updates for docling-serve #12

dolfim-ibm opened this issue Dec 9, 2024 · 5 comments

Comments

@dolfim-ibm
Copy link
Contributor

Here are updates coming for docling-serve.

  1. Updates to the latest docling v2. New input and output formats, better processing, more options.

  2. Richer endpoints. The new options in docling allow for a richer API, for example:

    • One endpoint with multiple output formats. Which one is returned is controlled by the input payload options (see later)
    • One endpoint specialized for markdown which returns the markdown as plain text
    POST /convert
    {
        "document": {
            "markdown": "",
            "docling_document": {},
            "html": "",
        },
        "errors": [],
        "status": "enum",
        "timings": {},
    }
    
    POST /convert/markdown
    # text/markdown response
    
  3. Input options. Both endpoint will accept the following payload

    {
        "file_source": {
            "base64_string": "string",
            "filename": "string"
        },
        "http_source": {
            "url": "string",
            "headers": {}
        },
    
        "options": {
            "output_markdown": "bool, default false",
            "output_html": "bool, default false",
            "output_docling_document": "bool, default true",
            "do_ocr": "bool, default true",
            "ocr_engine": "enum (easyocr, tesseract, rapidocr",
            "ocr_lang": "optional[list[str]]",
            "do_table_structure": "bool, default true",
            "include_images": "bool, default ... Embedded page images in docling_document, embedded images in ",
            "images_scale": 2.0
        }
    }
  4. Persistent DocumentConverter. To avoid reloading models all the time, we should keep a global DocumentConverter initialized. However, the initialization of the class will set options like which OCR engine, i.e. we want to add a cache for multiple DocumentConverter one for each common set of input options.

@vishnoianil
Copy link
Collaborator

@dolfim-ibm @guimou @nerdalert For the community UI usecase, we will need (in long term) these API's to be async. We have two options

  1. Only expose the aysnc API's to start with.
  2. Expose two version of the API, one is for sync and one for async (provide websocket or just simple polling with the job id)

I am more inclined toward 1) so that we can avoid api explosion, but i do see some use cases where having a sync api can be helpful, like writing simple CLI client or simple demo scripts that uses docling-serve for doc conversion. What are your thoughts?

@guimou
Copy link

guimou commented Dec 9, 2024

Asyncs APIs will require some more thinking depending how you want to work. Two different avenues:

  1. Call to the API initially sends back a token, and there is a websocket endpoint to connect to (authenticating with the token), from where the server can push a "conversion ready" message or directly the result.
  2. Call to the API sends back the token, but then you let the client regularly ping another endpoint until the result (or an error is sent back.

In both cases, you can implement a queue system, with feedback on where you are in the queue, and eventually the progress of the conversion (if/when docling provides such data).
Option 2 is more crude. Option 1 is more evolved, but would allow for real-time feedback (like when you advance in the queue). However it puts more load on the "client" side as it's not anyone who knows how to work with web sockets.
And of course the two are doable simultaneously as anyway the tokens and queue themselves are handled separately.

Anyway, directly to your questions @vishnoianil:

Only expose the aysnc API's to start with.

That's not what I would start with as it's more complicated to consume (from a client perspective). Let's start by offering a sync API, then add async endpoints.

Expose two version of the API, one is for sync and one for async (provide websocket or just simple polling with the job id)

That's where I would go, using both Websockets and "standard" status as it's not that much more implementation.

As we discussed, my current implementation offers two endpoint for processing url(s) and file(s). We seemed to agree to continue it this way. So here is the API map I am proposing (I am adding a v1 prefix to allow for organized non-breaking evolution in the future):

  • /health -> simple health check for probes
  • /v1/convert/url -> converts a url or a set of urls (either in a list or comma-separated sting, that's my current behaviour)
  • /v1/convert/file -> converts a file or a set of files.

In both cases, the client can specify each option available in the docling CLI, plus some outputs options (direct markdown, files in a zip). Options are currently:

  • from_format (Optional[Union[List[str], str]]): Input format(s) to convert from. Allowed values: docx, pptx, html, image, pdf, asciidoc, md. Defaults to all formats.
  • to_format (Optional[Union[List[str], str]]): Output format(s) to convert to. Allowed values: md, json, text, doctags. Defaults to Markdown.
  • ocr (Optional[bool]): If enabled, the bitmap content will be processed using OCR. Defaults to true.
  • force_ocr (Optional[bool]): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to false.
  • ocr_engine (Optional[str]): OCR engine to use. Allowed values: easyocr, tesseract_cli, tesseract. Defaults to easyocr.
  • pdf_backend (Optional[str]): PDF backend to use. Allowed values: pypdfium2, dlparse_v1, dlparse_v2. Defaults to dlparse_v1.
  • table_mode (Optional[str]): Table mode to use. Allowed values: fast, accurate. Defaults to fast.
  • abort_on_error (Optional[bool]): If enabled, abort on error. Defaults to false.
  • return_as_file (Optional[bool]): If enabled, return the output as a file. Defaults to false.

Above is doable almost immediately (give me 1-2 days after we agree, this project is unfortunately not on the top of my list).

In a second phase I would introduce the async endpoints:

  • /v1/async_convert/url -> converts a url or a set of urls (either in a list or comma-separated sting, that's my current behaviour)
  • /v1/async_convert/file -> converts a file or a set of files.
  • /v1/status/{task_id} -> standard feedback (retries implemented client side)
  • /v1/ws_status/{task_id} -> websocket feedback (real-time feedback)

Waiting for comments/approval to go on.

@vishnoianil
Copy link
Collaborator

vishnoianil commented Dec 10, 2024

Asyncs APIs will require some more thinking depending how you want to work. Two different avenues:

  1. Call to the API initially sends back a token, and there is a websocket endpoint to connect to (authenticating with the token), from where the server can push a "conversion ready" message or directly the result.
  2. Call to the API sends back the token, but then you let the client regularly ping another endpoint until the result (or an error is sent back.

In both cases, you can implement a queue system, with feedback on where you are in the queue, and eventually the progress of the conversion (if/when docling provides such data). Option 2 is more crude. Option 1 is more evolved, but would allow for real-time feedback (like when you advance in the queue). However it puts more load on the "client" side as it's not anyone who knows how to work with web sockets. And of course the two are doable simultaneously as anyway the tokens and queue themselves are handled separately.

Makes sense to me. I think sync api -> async-api (with client polling) -> async-api (with websocket), seems like a reasonable evolution plan.

Anyway, directly to your questions @vishnoianil:

Only expose the aysnc API's to start with.

That's not what I would start with as it's more complicated to consume (from a client perspective). Let's start by offering a sync API, then add async endpoints.

Expose two version of the API, one is for sync and one for async (provide websocket or just simple polling with the job id)

That's where I would go, using both Websockets and "standard" status as it's not that much more implementation.

Sounds good. I think major issue with any async apis is the scaling of the apis. Running multiple instances of the docling + api server etc might need some more work. But i believe that's not our day 1 problem at this point of time, but something to keep in back of our mind.

As we discussed, my current implementation offers two endpoint for processing url(s) and file(s). We seemed to agree to continue it this way. So here is the API map I am proposing (I am adding a v1 prefix to allow for organized non-breaking evolution in the future):

  • /health -> simple health check for probes
  • /v1/convert/url -> converts a url or a set of urls (either in a list or comma-separated sting, that's my current behaviour)
  • /v1/convert/file -> converts a file or a set of files.

This looks good to me, minor suggestion, can be use /v1alpha/ instead of /v1, given that these are first cut api's and will need some time to stablize.

In both cases, the client can specify each option available in the docling CLI, plus some outputs options (direct markdown, files in a zip). Options are currently:

  • from_format (Optional[Union[List[str], str]]): Input format(s) to convert from. Allowed values: docx, pptx, html, image, pdf, asciidoc, md. Defaults to all formats.
  • to_format (Optional[Union[List[str], str]]): Output format(s) to convert to. Allowed values: md, json, text, doctags. Defaults to Markdown.
  • ocr (Optional[bool]): If enabled, the bitmap content will be processed using OCR. Defaults to true.
  • force_ocr (Optional[bool]): If enabled, replace any existing text with OCR-generated text over the full content. Defaults to false.
  • ocr_engine (Optional[str]): OCR engine to use. Allowed values: easyocr, tesseract_cli, tesseract. Defaults to easyocr.
  • pdf_backend (Optional[str]): PDF backend to use. Allowed values: pypdfium2, dlparse_v1, dlparse_v2. Defaults to dlparse_v1.
  • table_mode (Optional[str]): Table mode to use. Allowed values: fast, accurate. Defaults to fast.
  • abort_on_error (Optional[bool]): If enabled, abort on error. Defaults to false.
  • return_as_file (Optional[bool]): If enabled, return the output as a file. Defaults to false.

Above is doable almost immediately (give me 1-2 days after we agree, this project is unfortunately not on the top of my list).

In a second phase I would introduce the async endpoints:

  • /v1/async_convert/url -> converts a url or a set of urls (either in a list or comma-separated sting, that's my current behaviour)
  • /v1/async_convert/file -> converts a file or a set of files.
  • /v1/status/{task_id} -> standard feedback (retries implemented client side)
  • /v1/ws_status/{task_id} -> websocket feedback (real-time feedback)

I think the following url might be more cleaner :
/v1alpha/convert/url/async
/v1alpha/convert/file/async
/v1alpha/status/poll/{task_id}
/v1alpha/status/ws/{tasl_id}

But i think we can take the async api discussion through different issues.

Waiting for comments/approval to go on.

Details for the sync api's looks good to me. so +1 from me.

@guimou
Copy link

guimou commented Dec 10, 2024

@vishnoianil Thanks for the feedback. I'm going on with this plan then.

@willkara
Copy link

Taking a look at some of the response options for /convert, is there any reason I wouldn't be able to add in responses for different types like text, json and others included in the export commands already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants