Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for single-page jobs #4

Open
Balearica opened this issue Aug 29, 2024 · 0 comments
Open

Improve performance for single-page jobs #4

Balearica opened this issue Aug 29, 2024 · 0 comments

Comments

@Balearica
Copy link
Contributor

Balearica commented Aug 29, 2024

Overview

Scribe.js was originally built for the use-case of processing documents or books. For example, one of the original use-cases was converting a scan of a book into a native-text document using scribeocr.com.

As all the development focus has been on processing large documents with dozens of pages, processing simple, single-page documents currently takes longer than it needs to. Some examples are below.

Simple Benchmark

Below are basic runtime measurements for the scribe.recognize function used with 3 different documents from the test corpus. All units are in milliseconds. The source images are pasted at he bottom of this issue.

  • Example 1: Trivial Single-Word Image
    • Total runtime: 342
      • Recognition: 98
      • Font optimization: 238
      • OCR Comparison: 2
        • This includes creating both Tesseract Combined Temp and Tesseract Combined.
        • No differences were actually compared as everything matched.
  • Example 2: Simple Full-Page Layout, Shorter Recognition
    • Total runtime: 3536
      • Recognition: 2583
      • Font optimization: 859
      • OCR Comparison: 84
  • Example 3: Complex Full-Page Layout, Longer Recognition
    • Total runtime: 10333
      • Recognition: 8797
      • Font Optimization: 1076
      • OCR Comparison: 441

Possible Changes

As demonstrated by the timings above, the runtime for different images is attributable to different steps. For the simplest image the vast majority of runtime is attributable to font optimization, while for the most complex document, the vast majority of runtime is attributable to recognition.

Change 1: Skip Font Optimization Entirely for Small Inputs

In the case of the example above, there is no need to do anything after recognition (at least for .txt outputs), as there were no mismatches to judge between. While this is admittedly a special case, a broadly-applicable change would be to skip the font optimization step (although not font detection) for small inputs. Font optimization relies on having a certain amount of data points--data containing a single word or sentence is not sufficient for generating custom fonts, even putting aside runtime considerations,

Change 2: Add Option for Parallelizing Recognition at Job Level

For single-page jobs where runtime is driven by recognition (such as example 3 above), we can likely produce significant performance gains by adding an option for parallelizing steps within the same job. Note that this would make runtimes for large documents significantly slower, so this should only be an option and/or automatically enabled for single-page jobs. When running with large, multi-page jobs, the most efficient way of implementing parallel processing is to implement on the "coarse grained" level, where multiple pages are processed in parallel, which is what we do now.

Example Images

@Balearica Balearica changed the title Create parallel processing mode tuned for small jobs Improve performance for single-page jobs Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant