There’s a little program called strings
that searches for what appear to be strings inside binary file. I’ll refer to it as strings(1) to distinguish the program name from the common English word strings. [1]
What does strings(1) consider to be a string? By default it is a sequence of four or more bytes that correspond to printable ASCII strings. There are command options to change the sequence length and the character encoding.
There are 98 printable ASCII characters [2] and 256 possible values for an 8-bit byte, so the probability of a byte being a printable character is
p = 98/256 = 0.3828125.
This implies that the probability of strings(1) flagging a sequence of four random bytes as a string is p4 or about 2%.
How long a string might you find inside a binary file?
I ran strings(1) on a photograph and found a 46-character string:
.IEC 61966-2.1 Default RGB colour space - sRGB
Of course this isn’t random. This string is part of the metadata stored inside JPEG images.
Next I encrypted the file so that no strings would be metadata and all strings would be false positives. The longest line was 12 characters:
Z<Bq{7fH}~G9
How does this compare to what we might expect from a random file? I wrote about the probability of long runs a dozen years ago. In a file of n bytes, the expected length of the longest run of printable characters is approximately
In my case, the file had n = 203,308 bytes. The expected length of the longest run of printable characters would then be 12.2 characters, and so the actual length of the longest run is in line with what theory would have predicted.
[1] Unix documentation is separated into sections, and parentheses after a name specify the documentation section. Section 1 is for programs, Section 2 is for system calls, etc. So, for example, chmod(1) is the command line utility named chmod, and chmod(2) is the system call by the same name. Since command line utilities often have names that are common words, tacking (1) on the end helps distinguish program names from English words.
[2] p could be a little larger if you consider some of the control characters to be printable.
https://github.com/Unstructured-IO/unstructured-python-client/blob/main/README.md is the api John.