Skip to content

Commit

Permalink
Update to v5 (#830)
Browse files Browse the repository at this point in the history
  • Loading branch information
Balearica authored Sep 28, 2023
1 parent ccf7414 commit 6ebe92f
Show file tree
Hide file tree
Showing 45 changed files with 532 additions and 777 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.DS_Store
node_modules/*
yarn.lock
tesseract.dev.js
worker.dev.js
tesseract.min.js
worker.min.js
*.traineddata
*.traineddata.gz
.nyc_output
Expand Down
111 changes: 48 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,82 +31,32 @@ Video Real-time Recognition


Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine.
It works in the browser using [webpack](https://webpack.js.org/) or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
It works in the browser using [webpack](https://webpack.js.org/), esm, or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
After you [install it](#installation), using it is as simple as:

```javascript
import Tesseract from 'tesseract.js';

Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
})
```

Or using workers (recommended for production use):

```javascript
import { createWorker } from 'tesseract.js';

const worker = await createWorker({
logger: m => console.log(m)
});

(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
const worker = await createWorker('eng');
const data = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(data.text);
await worker.terminate();
})();
```
When recognizing multiple images, users should create a worker once, run `worker.recognize` for each image, and then run `worker.terminate()` once at the end (rather than running the above snippet for every image).

For a basic overview of the functions, including the pros/cons of different approaches, see the [intro](./docs/intro.md). [Check out the docs](#documentation) for a full explanation of the API.

## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.

- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option

## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12

## Major changes in v2
- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream)
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript

Read a story about v2: <a href="https://jeromewu.github.io/why-i-refactor-tesseract.js-v2/">Why I refactor tesseract.js v2?</a><br>
Check the <a href="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1
## Installation
Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via `npm` and on Node.js with `npm/yarn`.


### CDN
```html
<!-- v4 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@4/dist/tesseract.min.js'></script>
<!-- v5 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>
```
After including the script the `Tesseract` variable will be globally available.
After including the script the `Tesseract` variable will be globally available and a worker can be created using `Tesseract.createWorker`.

Alternatively, an ESM build (used with `import` syntax) can be found at `https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.esm.min.js`.

### Node.js

Expand All @@ -122,16 +72,51 @@ npm install [email protected]
yarn add [email protected]
```


## Documentation

* [Intro](./docs/intro.md)
* [Workers vs. Schedulers](./docs/workers_vs_schedulers.md)
* [Examples](./docs/examples.md)
* [Image Format](./docs/image-format.md)
* [Supported Image Formats](./docs/image-format.md)
* [API](./docs/api.md)
* [Local Installation](./docs/local-installation.md)
* [FAQ](./docs/faq.md)

## Major changes in v5
Version 5 changes are documented in [this issue](https://github.com/naptha/tesseract.js/issues/820). Highlights are below.

- Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese)
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet)
- Significantly lower memory usage
- Compatible with iOS 17 (using default settings)
- Breaking changes:
- `createWorker` arguments changed
- Setting non-default language and OEM now happens in `createWorker`
- E.g. `createWorker("chi_sim", 1)`
- `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code
- See [this issue](https://github.com/naptha/tesseract.js/issues/820) for full list

## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.

- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option

## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12


## Use tesseract.js the way you like!

- Electron Version: https://github.com/Balearica/tesseract.js-electron
Expand Down Expand Up @@ -167,7 +152,7 @@ npm start
```

The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser.
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder.
It will automatically rebuild `tesseract.min.js` and `worker.min.js` when you change files in the **src** folder.

### Online Setup with a single Click

Expand Down
11 changes: 3 additions & 8 deletions benchmarks/browser/auto-rotate-benchmark.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<html>

<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
<style>
.column {
float: left;
Expand Down Expand Up @@ -37,15 +37,10 @@

const element = document.getElementById("imgRow");

const worker = await Tesseract.createWorker({
const worker = await Tesseract.createWorker('eng', 0, {
// corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js"
workerPath: "/dist/worker.min.js"
});
await worker.loadLanguage('eng');
await worker.initialize('eng');

await worker.initialize();


const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0;
Expand Down
24 changes: 14 additions & 10 deletions benchmarks/browser/speed-benchmark.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
</head>
<body>
<textarea id="message">Working...</textarea>
Expand All @@ -13,20 +13,21 @@
const { createWorker } = Tesseract;

(async () => {
const worker = await createWorker({
// corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js"
const worker = await createWorker("eng", 1, {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.min.js",
});
await worker.loadLanguage('eng');
await worker.initialize('eng');

// The performance.measureUserAgentSpecificMemory function only runs under specific circumstances for security reasons.
// See: https://developer.mozilla.org/en-US/docs/Web/API/Performance/measureUserAgentSpecificMemory#security_requirements
// Launching a server using `npm start` and accessing via localhost on the same system should meet these conditions.
const debugMemory = true;
if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after initialization:");
console.log(await performance.measureUserAgentSpecificMemory());
const memObj = await performance.measureUserAgentSpecificMemory();
const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;

console.log(`Worker memory utilization after initialization: ${memMb} MB`);

} else {
console.log("Unable to run `performance.measureUserAgentSpecificMemory`: not crossOriginIsolated.")
}
Expand All @@ -45,8 +46,11 @@
}

if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after recognition:");
console.log(await performance.measureUserAgentSpecificMemory());
const memObj = await performance.measureUserAgentSpecificMemory();
const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;

console.log(`Worker memory utilization after recognition: ${memMb} MB`);

}

document.getElementById('message').innerHTML += "\nTotal runtime: " + timeTotal + "s";
Expand Down
2 changes: 0 additions & 2 deletions benchmarks/node/speed-benchmark.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ const { createWorker } = require('../../');

(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0;
for (let file of fileArr) {
Expand Down
Loading

0 comments on commit 6ebe92f

Please sign in to comment.