OCR for PDF documents

Note

If modifying the files (or a copy) is acceptable, then using OCRmyPDF to add a text layer to the PDF itself is a better solution than using the Recoll OCR feature: e.g. allowing Recoll to position the PDF viewer on the search target when opening the document, and permitting secondary search in the native tool.

The Recoll OCR is enabled by the pdfocr configuration variable, and will only be executed if the processed file has no text content.

Example configuration fragment in recoll.conf:

pdfocr = 1
ocrprogs = tesseract
tesseractlang = eng

The pdfocr variable can be set globally or for specific subtrees.

Under Windows the recoll.conf configuration file is found by default in C:/Users/[you]/AppData/Local/Recoll/recoll.conf and you will probably need to indicate the actual path of the tesseract command by setting the tesseractcmd variable, for example:

tesseractcmd = C:/Program Files/Tesseract-OCR/tesseract.exe