Howto Make scanned PDFs searchable (OCR) using pdfocr

April 18, 2010 · General ·

What pdfocr is for

Suppose you have a PDF document that was made using a scanner, or otherwise consists of image data but doesn't have text data. Such a PDF can't be searched by PDF readers or desktop search applications. pdfocr is a simple utility I made that takes a PDF file, then generates a new one that has the text layer added, so it's searchable by your PDF reader and can be indexed by your desktop search application, but is still identical when printed.

What pdfocr is not for

This is only of use if your PDF was made from a scanned source; if you exported your PDF from OpenOffice or the like it already has a text layer so this is unnecessary.

Compatibility

This guide will work on Ubuntu Karmic (9.10) or Lucid (10.04)

Install pdfocr in Ubuntu Karmic (9.10) and Lucid (10.04)

Open the terminal and run the following commands

sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

Using pdfocr to add a text layer to your scanned PDF file

Open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file)

pdfocr -i input.pdf -o output.pdf

Now wait as OCR is performed on the PDF file page-by-page, and the output file is generated. This should take a few seconds per page, depending on the resolution of your PDF file (high-res PDF files get better accuracy, but will take longer). Once done, you should now have a searchable PDF at output.pdf.

Credit goes here

6 Comments to “Howto Make scanned PDFs searchable (OCR) using pdfocr”

Tal says:

April 19, 2010 at 5:14 am

Do you happen to have any information about where this software is from, specifically the source code?
joeYao says:

April 19, 2010 at 5:18 pm

@Tal
http://ubuntuforums.org/showthread.php?p=9136558
jkl says:

April 20, 2010 at 4:44 pm

Software from ppas is always free as stated here: https://help.launchpad.net/PPATermsofUse
darrask says:

May 6, 2012 at 5:37 pm

This OCR Software is impressive!
Free, and efficient, as compared to commercial software.
It lacks a GUI and easier customization options, but really, it’s amazing!
Jan Greeff says:

July 4, 2012 at 6:58 pm

This guide has lost me: how do you open a terminal, go to the directory that has the PDF file you want to convert, and enter (substituting input.pdf with the input PDF file, and output.pdf with the output PDF file) – after all, a directory cannot be opened in a terminal? Or am I missing something here?
Bob Cosack says:

March 8, 2014 at 7:58 am

I added the repository and everything is installed however when I use this on my scanned notes (Words are in capitalization) it results in no search terms available. Any ideas!

Ubuntu Linux Tutorials,Howtos,Tips & News | Oracular Oriole , Plucky Puffin

Sponsored Link

Categories

Sponsored Link

Archives

Howto Make scanned PDFs searchable (OCR) using pdfocr

Sponsored Link

6 Comments to “Howto Make scanned PDFs searchable (OCR) using pdfocr”

Leave a Reply

Support Ubuntu Geek

Recent entries

Recent comments

Popular posts

Ubuntu Linux Tutorials,Howtos,Tips & News | Oracular Oriole , Plucky Puffin

Sponsored Link

Categories

Sponsored Link

Archives

Howto Make scanned PDFs searchable (OCR) using pdfocr

Sponsored Link

6 Comments to “Howto Make scanned PDFs searchable (OCR) using pdfocr”

Leave a Reply

Support Ubuntu Geek

Favourite Sites

Recent entries

Recent comments

Popular posts