Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【BUG】cannot recognize the four directions texts for part symbol image #861

Closed
EasyEDA2021 opened this issue Dec 22, 2023 · 2 comments
Closed

Comments

@EasyEDA2021
Copy link

EasyEDA2021 commented Dec 22, 2023

Tesseract.js version (version number for npm/GitHub release, or specific commit for repo)
v5.0.3
Describe the bug
as the image
img_v3_026c_030a1028-ef46-4555-9838-c291aaf3670g
for example, page 1
https://atta.szlcsc.com/upload/public/pdf/source/20151029/1457707509740.pdf

miss texts:
image

To Reproduce
Steps to reproduce the behavior:
take the screeshot, and then import to Tesseract

Please attach any input image required to replicate this behavior.
image
image
image

Expected behavior
support recognize four directions texts and correctly

Device Version:

  • OS + Version: [Windows 10]
  • Browser [Chrome 122]

Additional context
no

thank you for the nice job

@Balearica
Copy link
Collaborator

There are multiple intersecting reasons why these particular images perform poorly, however all are issues with the Tesseract OCR engine rather than Tesseract.js, so fixing would be outside of the scope of this repo.

  1. Tesseract is not capable of handling multiple text orientations within the same image
    • Tesseract should be capable of recognizing "this entire image needs to be rotated 90 degrees", however it is not capable of recognizing "this word needs to be rotated 90 degrees"
  2. Tesseract often performs poorly when non-text elements are combined with text elements
    • Underlining text, drawing boxes around text, etc. often throws Tesseract off
  3. Tesseract's often performs poorly when recognizing complex layouts
    • Any layout more complex than a basic 1 or 2 column layout, including images where text is essentially scattered throughout, is likely to perform poorly

For context, Tesseract.js is the Javascript/Webassembly port of Tesseract. We do not make any edits to the recognition engine, so any accuracy issues with the Tesseract engine are outside of the scope of this project. Therefore, if you would like to pursue further, you should consult the documentation and discussion for the main Tesseract project. You may find that there are configuration settings that may help to achieve better results.

If you do not find settings that improve recognition, and believe this constitutes a (previously unreported) bug, then you should replicate the issue using the main (CLI) Tesseract project program and raise the issue with that project.

@EasyEDA2021
Copy link
Author

Hi Balearica
thank you for your reply, I got it, we will check this issue if Tesseract project met it
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants