Loading

Personally, for my english PDF files I run the command

ocrmypdf --tesseract-timeout 600 --rotate-pages --deskew --pdf-renderer tesseract --output-type pdf -l eng --clean --skip-text input.pdf output.pdf

This ensures we aren’t un-necessairly running OCR on text pages while OCR-ing any non-text pages and cleaning up the pdf file.

confidence too low to rotate

add the flag rotate-pages-threshold

Here is the list of the most common commands you can feed into ocrmypdf.

usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-n] [--flowchart FILE]
                [-l LANGUAGE] [-j N] [--image-dpi DPI]
                [--output-type {pdfa,pdf}] [--title TITLE] [--author AUTHOR]
                [--subject SUBJECT] [--keywords KEYWORDS] [-r]
                [--remove-background] [-d] [-c] [-i] [--oversample DPI] [-f]
                [-s] [--skip-big MPixels] [--tesseract-config CFG]
                [--tesseract-pagesegmode PSM]
                [--pdf-renderer {auto,tesseract,hocr}]
                [--tesseract-timeout SECONDS]
                [--rotate-pages-threshold CONFIDENCE] [-k] [-g]
                input_file output_file

The full command list is as follows:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
                [--output-type {pdfa,pdf,pdfa-1,pdfa-2,pdfa-3}]
                [--sidecar [FILE]] [--version] [-j N] [-q] [-v [VERBOSE]]
                [--title TITLE] [--author AUTHOR] [--subject SUBJECT]
                [--keywords KEYWORDS] [-r] [--remove-background] [-d] [-c]
                [-i] [--oversample DPI] [-f] [-s] [--skip-big MPixels]
                [-O {0,1,2,3}] [--jpeg-quality Q] [--png-quality Q]
                [--max-image-mpixels MPixels] [--tesseract-config CFG]
                [--tesseract-pagesegmode PSM] [--tesseract-oem MODE]
                [--pdf-renderer {auto,hocr,sandwich}]
                [--tesseract-timeout SECONDS]
                [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}]
                [--user-words FILE] [--user-patterns FILE] [-k]
                [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf

Generates a searchable PDF or PDF/A from a regular PDF.

OCRmyPDF rasterizes each page of the input PDF, optionally corrects page
rotation and performs image processing, runs the Tesseract OCR engine on the
image, and then creates a PDF from the OCR information.

positional arguments:
  input_pdf_or_image    PDF file containing the images to be OCRed (or '-' to
                        read from standard input)
  output_pdf            Output searchable PDF file (or '-' to write to
                        standard output). Existing files will be ovewritten.
                        If same as input file, the input file will be updated
                        only if processing is successful.

optional arguments:
  -h, --help            show this help message and exit
  -l LANGUAGE, --language LANGUAGE
                        Language(s) of the file to be OCRed (see tesseract
                        --list-langs for all language packs installed in your
                        system). Use -l eng+deu for multiple languages.
  --image-dpi DPI       For input image instead of PDF, use this DPI instead
                        of file's.
  --output-type {pdfa,pdf,pdfa-1,pdfa-2,pdfa-3}
                        Choose output type. 'pdfa' creates a PDF/A-2b
                        compliant file for long term archiving (default,
                        recommended) but may not suitable for users who want
                        their file altered as little as possible. 'pdfa' also
                        has problems with full Unicode text. 'pdf' attempts to
                        preserve file contents as much as possible. 'pdf-a1'
                        creates a PDF/A1-b file. 'pdf-a2' is equivalent to
                        'pdfa'. 'pdf-a3' creates a PDF/A3-b file.
  --sidecar [FILE]      Generate sidecar text files that contain the same text
                        recognized by Tesseract. This may be useful for
                        building a OCR text database. If FILE is omitted, the
                        sidecar file be named {output_file}.txt If FILE is set
                        to '-', the sidecar is written to stdout (a convenient
                        way to preview OCR quality). The output file and
                        sidecar may not both use stdout at the same time.
  --version             Print program version and exit

Job control options:
  -j N, --jobs N        Use up to N CPU cores simultaneously (default: use
                        all).
  -q, --quiet           Suppress INFO messages
  -v [VERBOSE], --verbose [VERBOSE]
                        Print more verbose messages for each additional
                        verbose level

Metadata options:
  Set output PDF/A metadata (default: copy input document's metadata)

  --title TITLE         Set document title (place multiple words in quotes)
  --author AUTHOR       Set document author
  --subject SUBJECT     Set document subject description
  --keywords KEYWORDS   Set document keywords

Image preprocessing options:
  Options to improve the quality of the final PDF and OCR

  -r, --rotate-pages    Automatically rotate pages based on detected text
                        orientation
  --remove-background   Attempt to remove background from gray or color pages,
                        setting it to white
  -d, --deskew          Deskew each page before performing OCR
  -c, --clean           Clean pages from scanning artifacts before performing
                        OCR, and send the cleaned page to OCR, but do not
                        include the cleaned page in the output
  -i, --clean-final     Clean page as above, and incorporate the cleaned image
                        in the final PDF. Might remove desired content.
  --oversample DPI      Oversample images to at least the specified DPI, to
                        improve OCR results slightly

OCR options:
  Control how OCR is applied

  -f, --force-ocr       Rasterize any fonts or vector objects on each page,
                        apply OCR, and save the rastered output (this rewrites
                        the PDF)
  -s, --skip-text       Skip OCR on any pages that already contain text, but
                        include the page in final output; useful for PDFs that
                        contain a mix of images, text pages, and/or previously
                        OCRed pages
  --skip-big MPixels    Skip OCR on pages larger than the specified amount of
                        megapixels, but include skipped pages in final output

Optimization options:
  Control how the PDF is optimized after OCR

  -O {0,1,2,3}, --optimize {0,1,2,3}
                        Control how PDF is optimized after processing:0 - do
                        not optimize;1 - do safe, lossless optimizations
                        (default);2 - do lossy optimizations; 3 - do
                        aggressive lossy optimizations
  --jpeg-quality Q      Adjust JPEG quality level for JPEG optimization. 100
                        is best quality and largest output size; 1 is lowest
                        quality and smallest output0 uses the default.
  --png-quality Q       Adjust PNG quality level to use when quantizing PNGs.
                        Values have same meaning as with --jpeg-quality

Advanced:
  Advanced options to control Tesseract's OCR behavior

  --max-image-mpixels MPixels
                        Set maximum number of pixels to unpack before treating
                        an image as a decompression bomb
  --tesseract-config CFG
                        Additional Tesseract configuration files -- see
                        documentation
  --tesseract-pagesegmode PSM
                        Set Tesseract page segmentation mode (see tesseract
                        --help)
  --tesseract-oem MODE  Set Tesseract 4.0 OCR engine mode: 0 - original
                        Tesseract only; 1 - neural nets LSTM only; 2 -
                        Tesseract + LSTM; 3 - default.
  --pdf-renderer {auto,hocr,sandwich}
                        Choose OCR PDF renderer - the default option is to let
                        OCRmyPDF choose. See documentation for discussion.
  --tesseract-timeout SECONDS
                        Give up on OCR after the timeout, but copy the
                        preprocessed page into the final output
  --rotate-pages-threshold CONFIDENCE
                        Only rotate pages when confidence is above this value
                        (arbitrary units reported by tesseract)
  --pdfa-image-compression {auto,jpeg,lossless}
                        Specify how to compress images in the output PDF/A.
                        'auto' lets OCRmyPDF decide. 'jpeg' changes all
                        grayscale and color images to JPEG compression.
                        'lossless' uses PNG-style lossless compression for all
                        images. Monochrome images are always compressed using
                        a lossless codec. Compression settings are applied to
                        all pages, including those for which OCR was skipped.
                        Not supported for --output-type=pdf ; that setting
                        preserves the original compression of all images.
  --user-words FILE     Specify the location of the Tesseract user words file.
                        This is a list of words Tesseract should consider
                        while performing OCR in addition to its standard
                        language dictionaries. This can improve OCR quality
                        especially for specialized and technical documents.
  --user-patterns FILE  Specify the location of the Tesseract user patterns
                        file.

Debugging:
  Arguments to help with troubleshooting and debugging

  -k, --keep-temporary-files
                        Keep temporary files (helpful for debugging)
  --flowchart FLOWCHART
                        Generate the pipeline execution flowchart

OCRmyPDF attempts to keep the output file at about the same size.  If a file
contains losslessly compressed images, and output file will be losslessly
compressed as well.

PDF is a page description file that attempts to preserve a layout exactly.
A PDF can contain vector objects (such as text or lines) and raster objects
(images).  A page might have multiple images.  OCRmyPDF is prepared to deal
with the wide variety of PDFs that exist in the wild.

When a PDF page contains text, OCRmyPDF assumes that the page has already
been OCRed or is a "born digital" page that should not be OCRed.  The default
behavior is to exit in this case without producing a file.  You can use the
option --skip-text to ignore pages with text, or --force-ocr to rasterize
all objects on the page and produce an image-only PDF as output.

    ocrmypdf --skip-text file_with_some_text_pages.pdf output.pdf

    ocrmypdf --force-ocr word_document.pdf output.pdf

If you are concerned about long-term archiving of PDFs, use the default option
--output-type pdfa which converts the PDF to a standardized PDF/A-2b.  This
converts images to sRGB colorspace, removes some features from the PDF such
as Javascript or forms. If you want to minimize the number of changes made to
your PDF, use --output-type pdf.

If OCRmyPDF is given an image file as input, it will attempt to convert the
image to a PDF before processing.  For more control over the conversion of
images to PDF, use the Python package img2pdf or other image to PDF software.

For example, this command uses img2pdf to convert all .png files beginning
with the 'page' prefix to a PDF, fitting each image on A4-sized paper, and
sending the result to OCRmyPDF through a pipe.  img2pdf is a dependency of
ocrmypdf so it is already installed.

    img2pdf --pagesize A4 page*.png | ocrmypdf - myfile.pdf

Online documentation is located at:
    https://ocrmypdf.readthedocs.io/en/latest/introduction.html

Here is an example of the verbose output from a simple ocrmypdf command.

  DEBUG - ocrmypdf 4.3.5
  DEBUG - os.symlink(644909.pdf, /tmp/com.github.ocrmypdf.k5jlmwt7/origin)

________________________________________
Tasks which will be run:


Task enters queue = 'ocrmypdf.triage'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/origin, /tmp/com.github.ocrmypdf.k5jlmwt7/origin.pdf)
Completed Task = 'ocrmypdf.triage'
Task enters queue = 'ocrmypdf.repair_pdf'
  DEBUG - [{'pageno': 0, 'images': [{'name': '/Im0', 'width': 2200, 'height': 1700, 'bpc': 1, 'type': 'image', 'enc': 'ccitt', 'color': 'gray', 'comp': 1, 'dpi_w': Decimal('200.000'), 'dpi_h': Decimal('200.000'), 'dpi': Decimal('200.000')}], 'has_text': False, 'width_inches': Decimal('11.0'), 'height_inches': Decimal('8.5'), 'rotate': 270, 'xres': Decimal('200.000'), 'yres': Decimal('200.000'), 'width_pixels': 2200, 'height_pixels': 1700}, {'pageno': 1, 'images': [{'name': '/Im0', 'width': 2200, 'height': 1700, 'bpc': 1, 'type': 'image', 'enc': 'ccitt', 'color': 'gray', 'comp': 1, 'dpi_w': Decimal('200.000'), 'dpi_h': Decimal('200.000'), 'dpi': Decimal('200.000')}], 'has_text': False, 'width_inches': Decimal('11.0'), 'height_inches': Decimal('8.5'), 'rotate': 270, 'xres': Decimal('200.000'), 'yres': Decimal('200.000'), 'width_pixels': 2200, 'height_pixels': 1700}, {'pageno': 2, 'images': [{'name': '/Im0', 'width': 2200, 'height': 1700, 'bpc': 1, 'type': 'image', 'enc': 'ccitt', 'color': 'gray', 'comp': 1, 'dpi_w': Decimal('200.000'), 'dpi_h': Decimal('200.000'), 'dpi': Decimal('200.000')}], 'has_text': False, 'width_inches': Decimal('11.0'), 'height_inches': Decimal('8.5'), 'rotate': 270, 'xres': Decimal('200.000'), 'yres': Decimal('200.000'), 'width_pixels': 2200, 'height_pixels': 1700}]
Completed Task = 'ocrmypdf.repair_pdf'
Task enters queue = 'ocrmypdf.split_pages'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.page.pdf, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.split_pages'
Task enters queue = 'ocrmypdf.orient_page'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf.orient_page'
Task enters queue = 'ocrmypdf.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.skip_page'
Uptodate Task = 'ocrmypdf.skip_page'


WARNING:
        In Task 'ocrmypdf.skip_page':
        No jobs were run because no file names matched.
        Please make sure that the regular expression is correctly specified.

  DEBUG - Rasterize 000001.ocr.oriented.pdf with pngmono
  DEBUG -
  DEBUG - Rasterize 000002.ocr.oriented.pdf with pngmono
  DEBUG - Rasterize 000003.ocr.oriented.pdf with pngmono
  DEBUG -
  DEBUG -
Completed Task = 'ocrmypdf.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.preprocess_remove_background'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.page.png, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.pp-background.png)
Completed Task = 'ocrmypdf.preprocess_remove_background'
Task enters queue = 'ocrmypdf.preprocess_deskew'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.pp-background.png, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.pp-deskew.png)
Task enters queue = 'ocrmypdf.preprocess_clean'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.pp-clean.png)
Completed Task = 'ocrmypdf.preprocess_clean'
Task enters queue = 'ocrmypdf.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.select_image_for_pdf'
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.page.png, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.image)
Completed Task = 'ocrmypdf.select_image_for_pdf'
Task enters queue = 'ocrmypdf.select_image_layer'
  DEBUG -    1: page eligible for lossless reconstruction
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.k5jlmwt7/000001.ocr.oriented.pdf, /tmp/com.github.ocrmypdf.k5jlmwt7/000001.image-layer.pdf)
Completed Task = 'ocrmypdf.select_image_layer'
Completed Task = 'ocrmypdf.ocr_tesseract_hocr'
Task enters queue = 'ocrmypdf.render_hocr_page'
Completed Task = 'ocrmypdf.render_hocr_page'
Task enters queue = 'ocrmypdf.add_text_layer'
   INFO -    1: rotating image layer 90 degrees
Completed Task = 'ocrmypdf.add_text_layer'
Task enters queue = 'ocrmypdf.merge_pages_qpdf'
  DEBUG - Final pages: /tmp/com.github.ocrmypdf.k5jlmwt7/000001.rendered.pdf
/tmp/com.github.ocrmypdf.k5jlmwt7/000002.rendered.pdf
/tmp/com.github.ocrmypdf.k5jlmwt7/000003.rendered.pdf
Completed Task = 'ocrmypdf.merge_pages_qpdf'
Task enters queue = 'ocrmypdf.copy_final'
Completed Task = 'ocrmypdf.copy_final'
  DEBUG - [{'has_text': False,
  'height_inches': Decimal('8.5'),
  'height_pixels': 1700,
  'images': [{'bpc': 1,
              'color': 'gray',
              'comp': 1,
              'dpi': Decimal('200.000'),
              'dpi_h': Decimal('200.000'),
              'dpi_w': Decimal('200.000'),
              'enc': 'ccitt',
              'height': 1700,
              'name': '/Im0',
              'type': 'image',
              'width': 2200}],
  'pageno': 0,
  'rotate': 270,
  'width_inches': Decimal('11.0'),
  'width_pixels': 2200,
  'xres': Decimal('200.000'),
  'yres': Decimal('200.000')}]

Leave a Reply

Your email address will not be published. Required fields are marked *

Top