New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete invalid .traineddata files in cache #753
Comments
Summary of this changeTL;DR Setting ExplanationBy default, Tesseract.js caches Prior to Due to this bug, many developers using Tesseract.js started bypassing the caching feature entirely by setting Starting in |
One of the most common error messages reported is
Error opening data file ./eng.traineddata
(or the equivalent for other languages). This is due to our current caching behavior.When a
.traineddata
file is downloaded, any fetch response reported asok
(which corresponds to a status of 200-299) is cached.tesseract.js/src/worker-script/index.js
Lines 108 to 111 in 7a087ca
The cached file is then used until the user manually deletes it, even if the file is invalid. The assumption this code makes is that an
ok
response indicates that some.traineddata
file was successfully downloaded, and if that file is somehow corrupted, that is because the developer uploaded a corrupted.traineddata
file.This does not appear to be the case. Some server configurations appear to return
200
responses, even if thelangPath
value is invalid (see #714). Furthermore, given user reports, this may even happen when the defaultlangPath
value is used (see #521), although the mechanism for this is unclear.We should edit so that tesseract.js deletes the saved
.traineddata
file when it detects that it is invalid. With this change, the next time the code is run it will again try and download the.traineddata
file fromlangPath
, rather than re-using the cached data that has already been determined to be invalid.The text was updated successfully, but these errors were encountered: