Commit d74981c3 authored by Jérome Perrin's avatar Jérome Perrin

PortalTransforms/tiff_to_text: run tesseract with OMP_THREAD_LIMIT=1

By default, tesseract runs on 4 CPU and this can be controlled by
OMP_THREAD_LIMIT=1 to run on only one CPU (as documented on
https://tesseract-ocr.github.io/tessdoc/FAQ.html)

In ERP5, we tend to use one zope node per CPU, so we don't want each
of these zope nodes to spawn a process which will run on 4 CPU.

In a quick benchmark it's not slower, even a bit faster to disable threads:

    ## a big image in france (a picture of an invoice)
    $ time ./bin/tesseract /tmp/input.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1
    Error in pixClipBoxToForeground: box not within image
    Error in pixClipBoxToForeground: box not within image

    ________________________________________________________
    Executed in   14.41 secs   fish           external
      usr time   27.88 secs  1002.00 micros   27.88 secs
      sys time    0.74 secs    0.00 micros    0.74 secs

    $ time OMP_THREAD_LIMIT=1 ./bin/tesseract /tmp/input.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1
    Error in pixClipBoxToForeground: box not within image
    Error in pixClipBoxToForeground: box not within image

    ________________________________________________________
    Executed in   12.58 secs   fish           external
      usr time   11.84 secs  955.00 micros   11.84 secs
      sys time    0.52 secs  503.00 micros    0.52 secs

    ## a small japanese image

    $ time ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1

    ________________________________________________________
    Executed in    2.16 secs   fish           external
      usr time    3.77 secs  590.00 micros    3.77 secs
      sys time    0.27 secs  209.00 micros    0.27 secs

    $ time OMP_THREAD_LIMIT=1 ./tesseract -l jpn+eng /tmp/inputjp.tiff /tmp/out.txt
    Tesseract Open Source OCR Engine v4.1.1 with Leptonica
    Page 1

    ________________________________________________________
    Executed in    2.02 secs   fish           external
      usr time  1766.07 millis  1437.00 micros  1764.63 millis
      sys time  214.06 millis  522.00 micros  213.54 millis
parent 2cbd5640
...@@ -34,9 +34,11 @@ class tiff_to_text(commandtransform): ...@@ -34,9 +34,11 @@ class tiff_to_text(commandtransform):
try: try:
output_file_path = os.path.join(tmp_dir, 'output') output_file_path = os.path.join(tmp_dir, 'output')
cmd = self.binary, input_file, output_file_path cmd = self.binary, input_file, output_file_path
process = subprocess.Popen(cmd, process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,) stderr=subprocess.STDOUT,
env=dict(os.environ, OMP_THREAD_LIMIT='1'))
stdout = process.communicate()[0] stdout = process.communicate()[0]
err = process.returncode err = process.returncode
if err: if err:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment