Howto Make scanned PDFs searchable (OCR) using pdfocr

What pdfocr is for

Suppose you have a PDF document that was made using a scanner, or otherwise consists of image data but doesn't have text data. Such a PDF can't be searched by PDF readers or desktop search applications. pdfocr is a simple utility I made that takes a PDF file, then generates a new one that has the text layer added, so it's searchable by your PDF reader and can be indexed by your desktop search application, but is still identical when printed.

What pdfocr is not for

This is only of use if your PDF was made from a scanned source; if you exported your PDF from OpenOffice or the like it already has a text layer so this is unnecessary.

Compatibility

This guide will work on Ubuntu Karmic (9.10) or Lucid (10.04)

Install pdfocr in Ubuntu Karmic (9.10) and Lucid (10.04)

Open the terminal and run the following commands

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

Using pdfocr to add a text layer to your scanned PDF file

Open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file)

pdfocr -i input.pdf -o output.pdf

Now wait as OCR is performed on the PDF file page-by-page, and the output file is generated. This should take a few seconds per page, depending on the resolution of your PDF file (high-res PDF files get better accuracy, but will take longer). Once done, you should now have a searchable PDF at output.pdf.

Credit goes here

Sponsored Link

You may also like...

6 Responses

  1. Tal says:

    Do you happen to have any information about where this software is from, specifically the source code?

  2. jkl says:

    Software from ppas is always free as stated here: https://help.launchpad.net/PPATermsofUse

  3. darrask says:

    This OCR Software is impressive!
    Free, and efficient, as compared to commercial software.
    It lacks a GUI and easier customization options, but really, it’s amazing!

  4. Jan Greeff says:

    This guide has lost me: how do you open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file) – after all, a directory cannot be opened in a terminal? Or am I missing something here?

  5. Bob Cosack says:

    I added the repository and everything is installed however when I use this on my scanned notes (Words are in capitalization) it results in no search terms available. Any ideas!

Leave a Reply

Your email address will not be published. Required fields are marked *