This is a guest post by Ondřej Čertík. Ondřej formerly worked at Los Alamos National Labs and now works for GSI Technologies. He is known in the Python community for starting the SymPy project and in the Fortran community for starting LFortran. — John
I finally got to experiment a bit with archiving data on a regular A4 or US Letter page using a regular printer and a phone camera to read it. It’s been bothering me for about 10 years. What is the maximum data size that we can store on the paper and reliably retrieve?
My setup
It seems it is limited by my camera, not my printer. This image stores 30KB of data. I printed it, took a photo with my iPhone, and then decoded and unpacked it using tar and bzip2. It correctly unpacks into 360KB of C++ files, 6305 lines.
I am using the Twibright Optar software.
It uses Golay codes (used by Voyager) that can fix 3 bad bits in each 24-bit code word (which contains 12-bit payload, and 12-bit parity bits). If there are 4 bad bits, the errors are detected, but cannot be corrected. In my experiment, there were 6 codewords which had 3 bad bits, and no codewords with more bad bits, so there was no data loss. It seems most of the bad bits were in the area where my phone cast a shadow on the paper, so possibly retaking the picture in broad daylight might help.
Bad bits
Here are the stats from the optar code:
7305 bits bad from 483840, bit error rate 1.5098%. 49.1855% black dirt, 50.8145% white dirt and 0 (0%) irreparable. Golay stats =========== 0 bad bits 13164 1 bad bit 6693 2 bad bits 297 3 bad bits 6 4 bad bits 0 total codewords 20160
The original setup can store 200KB of data. I tried it, it seems to print fine. But it’s really tiny, and my iPhone camera can’t read it well enough, so nothing is recovered. So I used larger pixels. The 30KB is what I was able to store and retrieve and from the stats above, it seems this is the limit.
Other possibilities
Competing products, such as this one only store 3-4KB/page, so my experiment above is 10x that.
One idea is to use a different error correction scheme. The Golay above only uses 50% to store data. If we could use 75% to store data, that gives us 50% more capacity.
Another idea to improve is to use colors, with 8 colors we get 3x larger capacity, with 16 colors we get 4x more. That would give us around 100KB/page with colors. Can we do better?
Comparison with other storage techniques
I think the closest to compare is floppy disks. The 5¼-inch floppy disk that I used as a kid could store 180 KB single side, 360KB both sides. The 3½-inch floppy disk could store 720 KB single side or 1.44MB both sides. We can also print on both sides of the paper, so let’s just compare single side for both:
- Optar original (requires a good scanner): 200 KB
- My experiment above (iPhone): 30 KB
- Estimate with 8 colors: 120 KB
- 5¼-inch floppy disk: 180 KB
- 3½-inch floppy disk: 720 KB
It seems that the 5¼-inch floppy disk is a good target.
Applications
One application is to store this at the end of a book, so that you don’t need to distribute CDs or floppy disks (as used to be the habit), but you just put 10-20 pages to the appendix, this should be possible to decode even 100 years from now quite reliably.
Another application is just archiving any projects that I care about. I still have some floppy disks in my parents’ attic, and I am quite sure they are completely unreadable by now. While I also have some printouts on paper from the same era 30 years ago, and they are perfectly readable.
I think the requirement to use a regular iPhone is good (mine is a few years old, perhaps the newest one can do better?). If we allow scanners, then of course we can do better, but not many people have a high quality scanner at home, and there is no limit to it: GitHub’s Arctic Vault uses microfilm and high quality scanner. I want something that can be put into a book, or article on paper, something that anyone can decode with any phone, and that doesn’t require any special treatment to store.
Fun! I worked for a company long ago that stored 35K — 50K on postcard-sized paper or cardstock that could be printed on a printing press for mass distribution, and then read using a device that connected by serial port. This was before CD-ROM.
I also tried taking a photo under a direct New Mexican morning sun light which left no shadows on the paper and I got the following error correction statistics.
130 bits bad from 483840, bit error rate 0.0268684%. 0% black dirt, 100% white dirt and 0 (0%) irreparable.
Golay stats
===========
0 bad bits 20030
1 bad bit 130
2 bad bits 0
3 bad bits 0
4 bad bits 0
total codewords 20160
Writing debug image into a_0001_debug.pgm.
Interesting! For comparison, QR codes can store up to ~3 kbytes, according to wikipedia: https://en.wikipedia.org/wiki/QR_code. They use Reed-Solomon codes, so a little bit more advanced. Maybe that technology can be extended to even larger capacity?