Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of garbage instead of dots (...) #873

Open
0xff00ff opened this issue Jan 15, 2024 · 1 comment
Open

A lot of garbage instead of dots (...) #873

0xff00ff opened this issue Jan 15, 2024 · 1 comment

Comments

@0xff00ff
Copy link

0xff00ff commented Jan 15, 2024

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
^5.0.4

Describe the bug
I try to recognize a text with a lot of dots, and some times it adds instead (and sometimes after those dots some garbage of random letters.

To Reproduce
I use this code:

const { createWorker } = require('tesseract.js');

(async () => {
  const worker = await createWorker("pol", 1, {
    logger: m => console.log(m),
  });
  const { data: { text } } = await worker.recognize("https://0x-tesseract.s3.amazonaws.com/Screenshot+2024-01-15+000829.png");
  console.log(text);
  await worker.terminate();
})();

Expected behavior
A clear and concise description of what you expected to happen.

Device Version:

  • OS + Version: windows 11 + wsl2
  • Node v18

source image:
Screenshot 2024-01-15 000829

The resylt looks like:


W koszyku z „zakupami. (zakupy) Są ........eeenenoeneneenenene dzezzronoenonownenowneneeeee (OWOC, warzywo). Są tam
awanawnananowawnowowawnanew | eaenonawnanowawaenenawoenene | sesesaeneseeeeneseseeeeeee (dojrzały czerwony pomidor — I. mn.), — siesaroononoononowoononene
aśdzswśonisrwsawiośwwianeśs (ZfeloNy, Ogórek — /..mN;)  ... <cutted>
@Balearica
Copy link
Collaborator

I believe what you are describing is a limitation of the Tesseract recognition model(s) rather than something specific to Tesseract.js. Tesseract.js is the Javascript/Webassembly port of Tesseract, so making changes to the model is outside of the scope of this repo.

From personal experience, I can confirm that the LSTM model (oem value 1) is prone to hallucinating text given dots or squiggles. The Legacy model (oem value 0) will be less prone to hallucinating words that are completely at odds with what you see on the page, as it relies more on the shape of the individual letters, so you could try using that. However, in addition to being less accurate in general, the Legacy model is known to struggle with italics, which your image contains. Therefore, I would not expect either model to give excellent results for your image without pre or post processing to filter off the junk.

Tesseract offers many different configuration settings that you can experiment with, so it is possible that there is some setting that would help in this case. These options would be documented in the main Tesseract documentation or repo, rather than here in the Tesseract.js repo. Every configuration setting for Tesseract can also be used in Tesseract.js.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants