Extract text from a PDF

Arshad Khan left a comment on my post on the less and more utilities saying “on ubuntu if I do less on a pdf file, it shows me the text contents of the pdf.”

Apparently this is an undocumented feature of GNU less. It works, but I don’t see anything about it in the man page documentation [1].

Not all versions of less do this. On my Mac, less applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.

A more portable way to extract text from a PDF would be to use something like the pypdf Python module:

    from pypdf import PdfReader

    reader = PdfReader("myfile.pdf")
    for page in reader.pages:
        print(page.extract_text()) 

The pypdf documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?

PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.

Related post: Your PDF may reveal more than you intend

[1] Update: Several people have responded saying that that less isn’t extracting the text from a PDF, but lesspipe is. That would explain why it’s not a documented feature of less. But it’s not clear how lesspipe is implicitly inserting itself.

Further update: Thanks to Konrad Hinsen for pointing me to this explanation. less reads an environment variable LESSOPEN for a preprocessor to run on its arguments, and that variable is, on some systems, set to lesspipe.

5 thoughts on “Extract text from a PDF

  1. The poppler-utils library is also great for this. It provides a CLI `pdftotext` command on Linux.

    One of my favorite shell aliases is for grepping through all the PDFs in a directory:

    find . -name ‘*.pdf’ -exec sh -c “pdftotext \”{}\” – | grep -i -A 2 –with-filename –label=\”{}\” –color \”$@\”” \;

  2. Stefan Schmiedl

    man less, then search for LESSOPEN.
    less supports input preprocessors to extract the text from “non-text” files, which is then fed into the “normal” pager.

  3. Regarding Garrett’s comment on grepping PDF contents, I strongly recommend the “pdfgrep” tool for that.

    I used it all the time when I was studying at university, and I am still using it now for digital forensics.

  4. On OSX, there is also pdfgrep one can install via homebrew. To grep all text, use command like:

    pdfgrep . some.pdf | less

Leave a Reply

Your email address will not be published. Required fields are marked *