GetPDF() with Scheduler returns the same PDF file #488

saxelsen · 2020-10-07T12:50:45Z

First of all, great work with this project! It's an impressive feat so far, despite the performance difference between this and native Tesseract.

Describe the bug
When using a Scheduler with multiple Workers, creating a GetPDF() job returns the same PDF document repeatedly, even though the Scheduler processes multiple different files.

To Reproduce
Steps to reproduce the behavior:

const {createWorker, createScheduler} = require('tesseract.js');
const fs = require('fs');

const scheduler = createScheduler();
const nWorkers = 4;

async function initialize() {
    for (let i = 0; i < nWorkers; i++) {
        let worker = createWorker({
            cachePath: 'langs',
            langPath: 'langs',
            logger: m => console.log(m)
        });
        await worker.load();
        await worker.loadLanguage('dan-fast');
        await worker.initialize('dan-fast');
        scheduler.addWorker(worker);
    }
    console.log('OCR initialized');
}

async function recognize(imagePath) {
    const filePath = imagePath.split('/').slice(-1).pop();
    const result = await scheduler.addJob('recognize', imagePath);
    const {data} = await scheduler.addJob('getPDF', filePath);
    fs.writeFileSync(`images/ocr-${filePath.replace('.png', '.pdf')}`, Buffer.from(data));
    return result.data.text
}

initialize()
    .then(() => {
        const promises = [];
        for (let i = 0; i < 3; i++) {
            const promise = recognize(`images/page-${i+1}.png`)
            promises.push(promise);
        }

        return Promise.all(promises)
    })
    .then((results) => {
        results.forEach((res) => {
            console.log('-----------');
            console.log(res);
        })
    })
    .then(() => {
        return scheduler.terminate();
    });

Expected behavior
GetPDF() should be able to produce the PDF file (or its byte representation) associated with a given recognition job, or the PDF byte representation should be part of the result from the recognition job.

The current operating model disables OCR-PDF rendering when using a scheduler, making processing a document with several pages more time-consuming.

Screenshots

Versions:

OS: Mac OS 10.13.6
Node: v12.18.1
Tesseract.js: 2.1.3

The text was updated successfully, but these errors were encountered:

Balearica · 2022-09-17T23:18:39Z

Thanks for reporting. I think the issue is inherent to getPDF being its own function, and agree with your second proposed solution ("the PDF byte representation should be part of the result from the recognition job.").

In general, using a scheduler assumes that all workers are fungible, so any job can be sent to any worker. This makes sense for detect and recognize, but not other functions. E.g. if you used loadLanguage through a scheduler, it would just set the language of a random worker. I think this limitation is fine in the case of loadLanguage, but it clearly does not make sense for getPDF.

As getPDF is currently the only output format that requires a separate function call (we don't have separate functions for getting the raw text, hocr, etc.), I think the most intuitive and straightforward solution is to depreciate that function and add an option for recognize to return a PDF.

Balearica · 2022-09-18T03:38:23Z

I added an option called pdf which when set to true will include the PDF in the results on the dev/v4 branch branch, and will be included with the next major release (v4). I also updated both the PDF examples for both browser and node in that branch.

To learn more about changes in v4 and/or try out the changes see Issue #662.

saxelsen · 2022-09-19T07:09:16Z

Great, thanks a lot for spending time on it! I'll mark this issue as closed and migrate my current solution over once v4 is released :)

See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version

Balearica mentioned this issue Sep 18, 2022

Allow for setting parameters for single recognize job when using scheduler #665

Closed

Balearica pushed a commit that referenced this issue Sep 18, 2022

Added savePDF option to recognize per #488; cleaned up code for linter

622c841

Balearica mentioned this issue Sep 18, 2022

Version 4 Development and Changes #662

Closed

saxelsen closed this as completed Sep 19, 2022

Balearica mentioned this issue May 29, 2023

Upgrading from v2 to v5 Guide #771

Open

GetPDF() with Scheduler returns the same PDF file #488

GetPDF() with Scheduler returns the same PDF file #488

saxelsen commented Oct 7, 2020 •

edited

Balearica commented Sep 17, 2022

Balearica commented Sep 18, 2022 •

edited

saxelsen commented Sep 19, 2022

GetPDF() with Scheduler returns the same PDF file #488

GetPDF() with Scheduler returns the same PDF file #488

Comments

saxelsen commented Oct 7, 2020 • edited

Balearica commented Sep 17, 2022

Balearica commented Sep 18, 2022 • edited

saxelsen commented Sep 19, 2022

saxelsen commented Oct 7, 2020 •

edited

Balearica commented Sep 18, 2022 •

edited