Arshad Khan left a comment on my post on the less
and more
utilities saying “on ubuntu if I do less
on a pdf file, it shows me the text contents of the pdf.”
Apparently this is an undocumented feature of GNU less
. It works, but I don’t see anything about it in the man
page documentation [1].
Not all versions of less
do this. On my Mac, less
applied to a PDF gives a warning saying “… may be a binary file. See it anyway?” If you insist, it will dump gibberish to the command line.
A more portable way to extract text from a PDF would be to use something like the pypdf
Python module:
from pypdf import PdfReader reader = PdfReader("myfile.pdf") for page in reader.pages: print(page.extract_text())
The pypdf
documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it’s not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?
PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you’re likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.
Related post: Your PDF may reveal more than you intend
[1] Update: Several people have responded saying that that less
isn’t extracting the text from a PDF, but lesspipe is. That would explain why it’s not a documented feature of less
. But it’s not clear how lesspipe
is implicitly inserting itself.
Further update: Thanks to Konrad Hinsen for pointing me to this explanation. less
reads an environment variable LESSOPEN
for a preprocessor to run on its arguments, and that variable is, on some systems, set to lesspipe
.
The poppler-utils library is also great for this. It provides a CLI `pdftotext` command on Linux.
One of my favorite shell aliases is for grepping through all the PDFs in a directory:
find . -name ‘*.pdf’ -exec sh -c “pdftotext \”{}\” – | grep -i -A 2 –with-filename –label=\”{}\” –color \”$@\”” \;
man less, then search for LESSOPEN.
less supports input preprocessors to extract the text from “non-text” files, which is then fed into the “normal” pager.
https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/1/introduction Uses Python packages to extract text from PDF etc.
Regarding Garrett’s comment on grepping PDF contents, I strongly recommend the “pdfgrep” tool for that.
I used it all the time when I was studying at university, and I am still using it now for digital forensics.
On OSX, there is also pdfgrep one can install via homebrew. To grep all text, use command like:
pdfgrep . some.pdf | less