bt5/erp5_dms/TestTemplateItem/portal_components/test.erp5.testDms.py · 9e375b8e9f14c3eb0b8fe6b3ac5a17a7997b7935 · Lu Xu / erp5

Lighter processing for OCR activities · 9e375b8e

Jérome Perrin authored Jun 04, 2021

When running OCR, we sometimes have issues because processing is "too heavy":
 - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
 - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
 - [x] ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of nexedi/slapos!985

See merge request nexedi/erp5!1420

9e375b8e

test.erp5.testDms.py 134 KB

Replace test.erp5.testDms.py