Extract text from a PDF

Arshad Khan left a comment on my post on the less and more utilities saying “on ubuntu if I do less on a pdf file, it shows me the text contents of the pdf.”

Apparently this is an undocumented feature of GNU less. It works, but I don’t see anything about it in the man page documentation [1].

Not all versions of less do this. On my Mac, less applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.

A more portable way to extract text from a PDF would be to use something like the pypdf Python module:

    from pypdf import PdfReader

    reader = PdfReader("myfile.pdf")
    for page in reader.pages:
        print(page.extract_text())

The pypdf documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?

PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.

Related post: Your PDF may reveal more than you intend

[1] Update: Several people have responded saying that that less isn’t extracting the text from a PDF, but lesspipe is. That would explain why it’s not a documented feature of less. But it’s not clear how lesspipe is implicitly inserting itself.

Further update: Thanks to Konrad Hinsen for pointing me to this explanation. less reads an environment variable LESSOPEN for a preprocessor to run on its arguments, and that variable is, on some systems, set to lesspipe.

5 thoughts on “Extract text from a PDF”

Garrett Mills

20 April 2024 at 08:59

The poppler-utils library is also great for this. It provides a CLI `pdftotext` command on Linux.

One of my favorite shell aliases is for grepping through all the PDFs in a directory:

find . -name ‘*.pdf’ -exec sh -c “pdftotext \”{}\” – | grep -i -A 2 –with-filename –label=\”{}\” –color \”$@\”” \;

Stefan Schmiedl

20 April 2024 at 16:04

man less, then search for LESSOPEN.
less supports input preprocessors to extract the text from “non-text” files, which is then fed into the “normal” pager.

Dave Pawson

21 April 2024 at 01:50

https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/1/introduction Uses Python packages to extract text from PDF etc.

Andrea Lazzarotto

21 April 2024 at 04:10

Regarding Garrett’s comment on grepping PDF contents, I strongly recommend the “pdfgrep” tool for that.

I used it all the time when I was studying at university, and I am still using it now for digital forensics.

Hanson Char

22 April 2024 at 10:52

On OSX, there is also pdfgrep one can install via homebrew. To grep all text, use command like:

pdfgrep . some.pdf | less