Pipeline · nexedi / erp5

Lighter processing for OCR activities

When running OCR, we sometimes have issues because processing is "too heavy":
 - [x] use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
 - [x] use 300% of CPU. Fixed by setting `OMP_THREAD_LIMIT` when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
 - [x] ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of slapos!985

See merge request !1420

5 jobs for master in 0 seconds (queued for 5 seconds)

9e375b8e

Status	Job ID	Name
External
passed	#233265 external	ERP5.CodingStyleTest-Master	00:55:50 Jun 04, 2021
failed	#233272 external	ERP5.UnitTest-Master	02:22:07 Jun 04, 2021
failed	#233270 external	ERP5.UnitTest-Master.Medusa	02:22:46 Jun 04, 2021
passed	#233259 external	SlapOS.Eggs.UnitTest-Master.Python2	00:08:41 Jun 04, 2021
passed	#233261 external	SlapOS.Eggs.UnitTest-Master.Python3	00:33:20 Jun 04, 2021