bt5/erp5_dms/TestTemplateItem/portal_components/test.erp5.testDms.py · 9e375b8e9f14c3eb0b8fe6b3ac5a17a7997b7935 · Eteri / erp5

Lighter processing for OCR activities · 9e375b8e

Jérome Perrin authored Jun 04, 2021

When running OCR, we sometimes have issues because processing is "too heavy":
 - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
 - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
 - [x] ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of slapos!985

See merge request !1420

9e375b8e

test.erp5.testDms.py 134 KB

Replace test.erp5.testDms.py