【BUG】cannot recognize the four directions texts for part symbol image #861

EasyEDA2021 · 2023-12-22T12:37:11Z

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
v5.0.3
Describe the bug
as the image

for example, page 1
https://atta.szlcsc.com/upload/public/pdf/source/20151029/1457707509740.pdf

miss texts:

To Reproduce
Steps to reproduce the behavior:
take the screeshot, and then import to Tesseract

Please attach any input image required to replicate this behavior.

Expected behavior
support recognize four directions texts and correctly

Device Version:

OS + Version: [Windows 10]
Browser [Chrome 122]

Additional context
no

thank you for the nice job

Balearica · 2023-12-24T23:44:37Z

There are multiple intersecting reasons why these particular images perform poorly, however all are issues with the Tesseract OCR engine rather than Tesseract.js, so fixing would be outside of the scope of this repo.

Tesseract is not capable of handling multiple text orientations within the same image
- Tesseract should be capable of recognizing "this entire image needs to be rotated 90 degrees", however it is not capable of recognizing "this word needs to be rotated 90 degrees"
Tesseract often performs poorly when non-text elements are combined with text elements
- Underlining text, drawing boxes around text, etc. often throws Tesseract off
Tesseract's often performs poorly when recognizing complex layouts
- Any layout more complex than a basic 1 or 2 column layout, including images where text is essentially scattered throughout, is likely to perform poorly

For context, Tesseract.js is the Javascript/Webassembly port of Tesseract. We do not make any edits to the recognition engine, so any accuracy issues with the Tesseract engine are outside of the scope of this project. Therefore, if you would like to pursue further, you should consult the documentation and discussion for the main Tesseract project. You may find that there are configuration settings that may help to achieve better results.

If you do not find settings that improve recognition, and believe this constitutes a (previously unreported) bug, then you should replicate the issue using the main (CLI) Tesseract project program and raise the issue with that project.

EasyEDA2021 · 2023-12-27T05:59:50Z

Hi Balearica
thank you for your reply, I got it, we will check this issue if Tesseract project met it
thanks

EasyEDA2021 closed this as completed Dec 27, 2023

【BUG】cannot recognize the four directions texts for part symbol image #861

【BUG】cannot recognize the four directions texts for part symbol image #861

EasyEDA2021 commented Dec 22, 2023 •

edited

Balearica commented Dec 24, 2023

EasyEDA2021 commented Dec 27, 2023

【BUG】cannot recognize the four directions texts for part symbol image #861

【BUG】cannot recognize the four directions texts for part symbol image #861

Comments

EasyEDA2021 commented Dec 22, 2023 • edited

Balearica commented Dec 24, 2023

EasyEDA2021 commented Dec 27, 2023

EasyEDA2021 commented Dec 22, 2023 •

edited