Improve performance for single-page jobs #4

Balearica · 2024-08-29T02:59:04Z

Overview

Scribe.js was originally built for the use-case of processing documents or books. For example, one of the original use-cases was converting a scan of a book into a native-text document using scribeocr.com.

As all the development focus has been on processing large documents with dozens of pages, processing simple, single-page documents currently takes longer than it needs to. Some examples are below.

Simple Benchmark

Below are basic runtime measurements for the scribe.recognize function used with 3 different documents from the test corpus. All units are in milliseconds. The source images are pasted at he bottom of this issue.

Example 1: Trivial Single-Word Image
- Total runtime: 342
  - Recognition: 98
  - Font optimization: 238
  - OCR Comparison: 2
    - This includes creating both Tesseract Combined Temp and Tesseract Combined.
    - No differences were actually compared as everything matched.
Example 2: Simple Full-Page Layout, Shorter Recognition
- Total runtime: 3536
  - Recognition: 2583
  - Font optimization: 859
  - OCR Comparison: 84
Example 3: Complex Full-Page Layout, Longer Recognition
- Total runtime: 10333
  - Recognition: 8797
  - Font Optimization: 1076
  - OCR Comparison: 441

Possible Changes

As demonstrated by the timings above, the runtime for different images is attributable to different steps. For the simplest image the vast majority of runtime is attributable to font optimization, while for the most complex document, the vast majority of runtime is attributable to recognition.

Change 1: Skip Font Optimization Entirely for Small Inputs

In the case of the example above, there is no need to do anything after recognition (at least for .txt outputs), as there were no mismatches to judge between. While this is admittedly a special case, a broadly-applicable change would be to skip the font optimization step (although not font detection) for small inputs. Font optimization relies on having a certain amount of data points--data containing a single word or sentence is not sufficient for generating custom fonts, even putting aside runtime considerations,

Change 2: Add Option for Parallelizing Recognition at Job Level

For single-page jobs where runtime is driven by recognition (such as example 3 above), we can likely produce significant performance gains by adding an option for parallelizing steps within the same job. Note that this would make runtimes for large documents significantly slower, so this should only be an option and/or automatically enabled for single-page jobs. When running with large, multi-page jobs, the most efficient way of implementing parallel processing is to implement on the "coarse grained" level, where multiple pages are processed in parallel, which is what we do now.

Example Images

The text was updated successfully, but these errors were encountered:

Balearica changed the title ~~Create parallel processing mode tuned for small jobs~~ Improve performance for single-page jobs Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for single-page jobs #4

Improve performance for single-page jobs #4

Balearica commented Aug 29, 2024 •

edited

Loading

Improve performance for single-page jobs #4

Improve performance for single-page jobs #4

Comments

Balearica commented Aug 29, 2024 • edited Loading

Overview

Simple Benchmark

Possible Changes

Change 1: Skip Font Optimization Entirely for Small Inputs

Change 2: Add Option for Parallelizing Recognition at Job Level

Example Images

Balearica commented Aug 29, 2024 •

edited

Loading