A lot of garbage instead of dots (...) #873

0xff00ff · 2024-01-15T16:00:58Z

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
^5.0.4

Describe the bug
I try to recognize a text with a lot of dots, and some times it adds instead (and sometimes after those dots some garbage of random letters.

To Reproduce
I use this code:

const { createWorker } = require('tesseract.js');

(async () => {
  const worker = await createWorker("pol", 1, {
    logger: m => console.log(m),
  });
  const { data: { text } } = await worker.recognize("https://0x-tesseract.s3.amazonaws.com/Screenshot+2024-01-15+000829.png");
  console.log(text);
  await worker.terminate();
})();

Expected behavior
A clear and concise description of what you expected to happen.

Device Version:

OS + Version: windows 11 + wsl2
Node v18

source image:

The resylt looks like:


W koszyku z „zakupami. (zakupy) Są ........eeenenoeneneenenene dzezzronoenonownenowneneeeee (OWOC, warzywo). Są tam
awanawnananowawnowowawnanew | eaenonawnanowawaenenawoenene | sesesaeneseeeeneseseeeeeee (dojrzały czerwony pomidor — I. mn.), — siesaroononoononowoononene
aśdzswśonisrwsawiośwwianeśs (ZfeloNy, Ogórek — /..mN;)  ... <cutted>

The text was updated successfully, but these errors were encountered:

Balearica · 2024-01-19T08:12:03Z

I believe what you are describing is a limitation of the Tesseract recognition model(s) rather than something specific to Tesseract.js. Tesseract.js is the Javascript/Webassembly port of Tesseract, so making changes to the model is outside of the scope of this repo.

From personal experience, I can confirm that the LSTM model (oem value 1) is prone to hallucinating text given dots or squiggles. The Legacy model (oem value 0) will be less prone to hallucinating words that are completely at odds with what you see on the page, as it relies more on the shape of the individual letters, so you could try using that. However, in addition to being less accurate in general, the Legacy model is known to struggle with italics, which your image contains. Therefore, I would not expect either model to give excellent results for your image without pre or post processing to filter off the junk.

Tesseract offers many different configuration settings that you can experiment with, so it is possible that there is some setting that would help in this case. These options would be documented in the main Tesseract documentation or repo, rather than here in the Tesseract.js repo. Every configuration setting for Tesseract can also be used in Tesseract.js.

A lot of garbage instead of dots (...) #873

A lot of garbage instead of dots (...) #873

0xff00ff commented Jan 15, 2024 •

edited

Balearica commented Jan 19, 2024

A lot of garbage instead of dots (...) #873

A lot of garbage instead of dots (...) #873

Comments

0xff00ff commented Jan 15, 2024 • edited

Balearica commented Jan 19, 2024

0xff00ff commented Jan 15, 2024 •

edited