Confidential OCR | optical character recognition commandline

A client emailed me a screenshot of a table rather than pasting the table as text into an email.

I thought about using an LLM to convert it to text, but the table is confidential client information and so I shouldn’t upload it anywhere.

I searched for a commandline utility to do OCR and found tesseract. I installed it with

    sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

and ran it with the default settings

    tesseract screenshot.png textfile

It worked remarkably well. I had to change a C to a U, but otherwise I didn’t have to add or change any text, but I did have to delete a few extraneous parentheses generated by the software.

I work locally in part out of habit; it was the only way to work when I started using a computer. It has numerous advantages, such as being able to keep working when a hurricane knocks out my internet connection, but above all it is private.

I pay more attention to privacy than is convenient because I work in data privacy. And aside from my privacy, I have to protect our clients’ privacy.

Update: According to the comments, ChatGPT uses tesseract. Assuming that’s true, using tesseract directly is better than ChatGPT because it does exactly what you want. No ambiguity as far as what expected. No potential for tinkering with your results before you see them.

4 thoughts on “Confidential OCR”

xpil

20 November 2024 at 15:40

Tesseract is what ChatGPT uses under the hood when presented a large amount of text in a graphical attachment.

Duane Murphy

21 November 2024 at 08:11

Mac OS Preview will OCR most any image document that it opens.

Gregory Engel

21 November 2024 at 10:32

There is a Linux GUI to Tesseract: gImageReader. I’ve found it easier to use than the command line on occasion.

njo

23 November 2024 at 12:51

My favorite way of doing this is with the “OCR – Image Reader” browser extension. It is an open source extension developed by Brian Girko, and it does OCR in the local browser, without sending the data elsewhere. I believe it also uses Tesseract under the hood.

Comments are closed.

Related posts

4 thoughts on “Confidential OCR”