PdfMasher – Tool to convert PDF files containing text in ready-for-ebook HTML files

Sponsored Link
PdfMasher is a tool to convert PDF files containing text in ready-for-ebook HTML files. Most ebook readers support PDF files natively, but it's often a real pain to read those documents because we don't have font size control over the document like we have with native ebooks. In many cases, we have to use the zooming feature and it's just a pain. Another drawback of PDFs on ebook readers is that annotations are not supported.

Enter PdfMasher. PdfMasher asks the user about the role of each piece of text, and does it in an efficient manner. Your PDF has a header on each page and you don't want them to litter your text? Sort text elements by Y-position (thus grouping them all together), shift select the elements and flag them as ignored. They will not appear on your final HTML. Your PDF has footnotes on many pages? Sort your elements by text content (thus grouping all elements with the text starting with a number together) and flag them as footnotes. They will be moved to the end of the document, and PdfMasher will try to create hyperlinks to footnote references.

Install PdfMasher on ubuntu 11.04 (Natty)

Download deb package from here once you have deb package you can install by double clicking on it.

PdfMasher Demo Video


Sponsored Link

You may also like...

2 Responses

  1. Dan Serban says:

    I have tried to use PDFmasher as it is delivered on the author’s website.
    I failed for one simple reason: version 0.1.1 of PDFmasher is broken!!! … and I believe the author knows it’s broken.
    I filed a bug report but have not received a response.
    In the mean time I have figured out what’s wrong and was able to make it work.
    PDFmasher consists of two components:
    1) the PDFminer library, which is the software component that does all the heavy lifting. PDFminer is listed on PYPI, but with an explicit mention that Python 3 is not supported!!!
    2) a layer of user-friendly QT sugar-coating written in Python 3.
    There is a fork of PDFminer for Python 3, called pdfminer3k, but it is not listed on PYPI. It’s on BitBucket.
    The way PDFmasher has (inadvertently or not) been packaged is a toxic combination of PDFminer for Python 2 and a Python 3 based GUI layer.
    What you need to do (as root) is remove the pdfminer directory under /usr/local/share/pdfmasher, and replace it with PDFminer for Python 3 from BitBucket. Something like this:
    cd /usr/local/share/pdfmasher
    rm -r pdfminer
    hg clone https://bitbucket.org/hsoft/pdfminer3k
    mv pdfminer3k/pdfminer .
    rm -r pdfminer3k

    Hopefully this fix will be incorporated into the next release of PDFmasher.

  2. pdfminer3k has been on PyPI since the first release of pdfmasher.

    http://pypi.python.org/pypi/pdfminer3k/

    Also, pdfmasher is a bit more than sugar-coating of pdfminer. Yes, pdfminer does a big part of the work, but no, pdfmasher is not a simple gui wrapper.

Leave a Reply

Your email address will not be published. Required fields are marked *