You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
54% smaller file sizes for English, 73% smaller for Chinese (see Reduce file sizes #806 for details)
This results in a ~50% decrease in runtime for first-time users (who do not yet have the data downloaded/cached)
Significantly lower memory usage
Worker memory utilization in the web benchmark is reduced from 311 MB to 164 MB (47% reduction)
The lower memory footprint makes it feasible to use more workers, significantly improving performance for projects that utilize schedulers for parallel processing
Compatible with iOS 17 (using default settings)
iOS 17 broke compatibility with Tesseract.js v4--upgrading to v5 should resolve
See discussion section below for details
Breaking Changes Impacting Many Users
createWorker arguments changed
Setting non-default language and OEM now happens in createWorker
E.g. createWorker("chi_sim", 1)
worker.initialize and worker.loadLanguage functions now do nothing and can be deleted from code
Loading the language and initialization now occurs in createWorker
Workers can be re-initialized with different settings using worker.reinitialize
In other words, code should be modified from this:
const worker = await Tesseract.createWorker("eng");
const ret = await worker.recognize(file);
Breaking Changes Impacting Fewer Users
Users who manually set corePath will need to update the contents of their corePath directory
corePath should point to a directory that contains all 4 of the files below from Tesseract.js-core v5:
tesseract-core.wasm.js
tesseract-core-simd.wasm.js
tesseract-core-lstm.wasm.js
tesseract-core-simd-lstm.wasm.js
Tesseract.js will automatically select the correct version to use
worker.detect function disabled by default
Orientation + script detection is a function of the Legacy model only, which is no longer included by default
To enable, set arguments legacyCore: true and legacyLang: true in createWorker options
E.g. Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
Language of progress logs standardized
This should only impact users who parse status logs (e.g. to update a loading bar)
Non-Breaking Changes
Language data loaded from jsdelivr by default (rather than GitHub pages)
This should result in improved performance and uptime
Separate "development" build (that produced tesseract.dev.js and worker.dev.js removed
Documentation and examples were modified to prevent new users from using Tesseract.recognize and Tesseract.detect
Users who already use these functions are encouraged to modify their code to use worker.recognize and worker.detect instead
Discussion
How can file sizes be reduced by so much?
Tesseract contains 2 recognition models—LSTM and Legacy. The vast majority of users only use the LSTM model (the default). However, the Legacy model takes up more space, and previous versions of Tesseract.js loaded all of the resources required for both models. This resulted in significant wasteful network activity. For example, for Chinese (simplified) 73% of the size of the code and data was attributable to the (usually) unused Legacy model.
What justifies the breaking changes to createWorker/loadLanguage/initialize?
The primary reason is that these changes are necessary to facilitate the major improvement of v5—significantly reducing file sizes. How this reduction is achieved is described in the answer directly above. As Tesseract.js is a JavaScript library generally run in the browser, having reasonable file sizes is a high priority. This is especially true as use on mobile devices becomes more common. Making this improvement would have been impossible without combining createWorker/loadLanguage/initialize.
Previously, the user specified which recognition model (OEM) to use during initialize. As initialize was run after createWorker and loadLanguage (which load the code and language required for each model), there was no way for these functions to only load the data required for the chosen model. By combining these functions, Tesseract.js knows what model is being used before it loads code or data, so can load only the required resources.
In addition to this primary reason, combining these functions should simplify the process of creating a worker. The large number of functions required to create a new worker (4 in v3 and 3 in v4) was pushing some users towards using Tesseract.recognize instead (as this handles everything in a single function). Simplifying the process of creating a new worker will hopefully result in more users using workers, which is more efficient than Tesseract.recognize (which creates and destroys a worker every time it is used).
How can I restore the old behavior (loading both LSTM + Legacy models)?
Within createWorker, if you set oem to 0 (Tesseract Legacy) or 2 (Tesseract Legacy + LSTM), code and language data for both the Legacy and LSTM models will be loaded automatically. You can force both models to be loaded regardless of oem by setting legacyCore: true and legacyLang: true in the createWorker options. For example:
If your application re-initializes existing workers with a different language or OEM, this is now achieved using worker.reinitialize (rather than worker.loadLanguage and worker.initialize). For example, the following snippet recognizes file using the LSTM model, and then switches to the Legacy model and re-runs recognition.
iOS v17.0 and v17.1 include a bug that causes the Legacy + LSTM build of Tesseract.js to crash. Apple patched this issue in iOS v17.2. This bug does not impact the LSTM-only build, which became the default in Tesseract.js v5. Therefore, developers who want their application to be compatible with iOS v17.0 and v17.1 are advised to upgrade to Tesseract.js v5. Discussion regarding this issue is documented in #804.
I am still having trouble upgrading my project, what should I do?
Start by reviewing the examples directory--most uses of Tesseract.js have a corresponding example. If you are struggling to upgrade your project after reviewing both this issue and the examples, feel free to open a new git issue.
The text was updated successfully, but these errors were encountered:
Balearica
changed the title
[Draft] Version 5 Development and Changes
Version 5 Development and Changes
Sep 10, 2023
Just by chance, how do I invoke the new createWorker in a local environment. For example doing it right now for me gives me that the createWorker was not defined. It is the only issue so far.
@luisalvarado I am assuming this is solely in reference to the new dev/v5 branch, and you are able to run the examples in the master branch. I updated the examples in the dev/v5 branch today, and confirmed they all run. Therefore, if you run (for example) this example code it should be clear how to run locally.
Summary
Major New Features
Breaking Changes Impacting Many Users
createWorker
arguments changedcreateWorker
createWorker("chi_sim", 1)
worker.initialize
andworker.loadLanguage
functions now do nothing and can be deleted from codecreateWorker
worker.reinitialize
In other words, code should be modified from this:
To this:
Breaking Changes Impacting Fewer Users
corePath
will need to update the contents of theircorePath
directorycorePath
should point to a directory that contains all 4 of the files below from Tesseract.js-core v5:tesseract-core.wasm.js
tesseract-core-simd.wasm.js
tesseract-core-lstm.wasm.js
tesseract-core-simd-lstm.wasm.js
worker.detect
function disabled by defaultlegacyCore: true
andlegacyLang: true
increateWorker
optionsTesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
Non-Breaking Changes
jsdelivr
by default (rather than GitHub pages)tesseract.dev.js
andworker.dev.js
removedTesseract.recognize
andTesseract.detect
worker.recognize
andworker.detect
insteadDiscussion
How can file sizes be reduced by so much?
Tesseract contains 2 recognition models—LSTM and Legacy. The vast majority of users only use the LSTM model (the default). However, the Legacy model takes up more space, and previous versions of Tesseract.js loaded all of the resources required for both models. This resulted in significant wasteful network activity. For example, for Chinese (simplified) 73% of the size of the code and data was attributable to the (usually) unused Legacy model.
What justifies the breaking changes to
createWorker
/loadLanguage
/initialize
?The primary reason is that these changes are necessary to facilitate the major improvement of v5—significantly reducing file sizes. How this reduction is achieved is described in the answer directly above. As Tesseract.js is a JavaScript library generally run in the browser, having reasonable file sizes is a high priority. This is especially true as use on mobile devices becomes more common. Making this improvement would have been impossible without combining
createWorker
/loadLanguage
/initialize
.Previously, the user specified which recognition model (OEM) to use during
initialize
. Asinitialize
was run aftercreateWorker
andloadLanguage
(which load the code and language required for each model), there was no way for these functions to only load the data required for the chosen model. By combining these functions, Tesseract.js knows what model is being used before it loads code or data, so can load only the required resources.In addition to this primary reason, combining these functions should simplify the process of creating a worker. The large number of functions required to create a new worker (4 in
v3
and 3 inv4
) was pushing some users towards usingTesseract.recognize
instead (as this handles everything in a single function). Simplifying the process of creating a new worker will hopefully result in more users using workers, which is more efficient thanTesseract.recognize
(which creates and destroys a worker every time it is used).How can I restore the old behavior (loading both LSTM + Legacy models)?
Within
createWorker
, if you setoem
to0
(Tesseract Legacy) or2
(Tesseract Legacy + LSTM), code and language data for both the Legacy and LSTM models will be loaded automatically. You can force both models to be loaded regardless ofoem
by settinglegacyCore: true
andlegacyLang: true
in thecreateWorker
options. For example:If your application re-initializes existing workers with a different language or OEM, this is now achieved using
worker.reinitialize
(rather thanworker.loadLanguage
andworker.initialize
). For example, the following snippet recognizesfile
using the LSTM model, and then switches to the Legacy model and re-runs recognition.How does this release impact iOS compatibility?
iOS
v17.0
andv17.1
include a bug that causes the Legacy + LSTM build of Tesseract.js to crash. Apple patched this issue in iOSv17.2
. This bug does not impact the LSTM-only build, which became the default in Tesseract.js v5. Therefore, developers who want their application to be compatible with iOSv17.0
andv17.1
are advised to upgrade to Tesseract.js v5. Discussion regarding this issue is documented in #804.I am still having trouble upgrading my project, what should I do?
Start by reviewing the examples directory--most uses of Tesseract.js have a corresponding example. If you are struggling to upgrade your project after reviewing both this issue and the examples, feel free to open a new git issue.
The text was updated successfully, but these errors were encountered: