Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpret currency symbols #838

Closed
guntercn opened this issue Oct 5, 2023 · 1 comment
Closed

interpret currency symbols #838

guntercn opened this issue Oct 5, 2023 · 1 comment

Comments

@guntercn
Copy link

guntercn commented Oct 5, 2023

In the Spanish ('spa') tereract interpretation, the currency symbol ₡ is interpreted as a number 2.

Tesseract interprets a character as a number.

English language, Tesseract interprets well as a character, but it is in 'spa' the need.

Any way to improve or correct this "error"?

Additional context
Number with a currency simbol
Cajas2

thanks

@Balearica
Copy link
Collaborator

The words and symbols that are seen as valid are determined by the language data loaded (spa.traineddata in this instance). By default, Tesseract.js loads .traineddata files provided by the main Tesseract project--we use integerized versions of the tessdata_best data. If you were to find or create a .traineddata file that does not have this issue, you can use it by setting the langPath argument.

Issues related to language data are outside of the scope of this repo. The goal of Tesseract.js is to bring the Tesseract OCR engine to the browser--we do not make any edits to the recognition engine or .traineddata provided by Tesseract. If you are interested in learning more about how language data works, and what tools exist to modify it, you should look at the documentation provided by the main Tesseract project. Their website is here and repo is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants