Resolving a mysterious problem with find

Suppose you want to write a shell script searches the current directory for files that have a keyword in the name of the file or in its contents. Here’s a first attempt.

find . -name '*.py' -type f -print0 | grep -i "$1"
find . -name '*.py' -type f -print0 | xargs -0 grep -il "$1"

This works well for searching file contents but behaves unexpectedly when searching for file names.

If I have a file named frodo.py in the directory, the script will return

grep: (standard input): binary file matches

Binary file matches?! I wasn’t searching binary files. I was searching files with names consisting entirely of ASCII characters. Where is a binary file coming from?

If we cut off the pipe at the end of the first line of the script and run

find . -name '*.py' -type f -print0

we get something like

.elwing.py/.frodo.py/gandalf.py

with no apparent non-ASCII characters. But if we pipe the output through xxd to see a hex dump, we see that there are invisible null characters after each file name.

One way to fix our script would be to add a -a option to the call to grep, telling to treat the input as ASCII. But this will return the same output as above. The output of find is treated as one long (ASCII) string, which matches the regular expression.

Another possibility would be to add a -o flag to direct grep to return just the match. But this is less than ideal as well. If you were looking for file names containing a Q, for example, you’d get Q as your output, which doesn’t tell you the full file name.

There may be better solutions [1], but my solution was to insert a call to strings in the pipeline:

find . -name '*.py' -type f -print0 | strings | grep -i "$1"

This will extract the ASCII strings out of the input it receives, which has the effect of splitting the string of file names into individual names.

By default the strings command defines an ASCII string to be a string of 4 or more consecutive ASCII characters. A file with anything before the .py extension will necessarily have at least four characters, but the analogous script to search C source files would overlook a file named x.c. You could fix this by using strings -n 3 to find sequences of three or more ASCII characters.

If you don’t have the strings command installed, you could use sed to replace the null characters with newlines.

find . -name '*.py' -type f -print0 | sed 's/\x0/\n/g' | grep -i "$1"

Note that the null character is denoted \x0 rather than simply \0.

Related posts

[1] See the comments for better solutions. I really appreciate your feedback. I’ve learned a lot over the years from reader comments.

8 thoughts on “Resolving a mysterious problem with find

  1. Use `-print0`for `xargs`to pass file names with whitespace to other programs, but not for `grep`that simple searches each newline-terminated line.

    It is odd that `-0` for `xargs` has the behavior of not replacing the NUL byte. I’m not able to quite reproduce this.

    `od -c` is your friend here.

  2. The -z option for grep tells it to treat “lines” as nul-character delimited, which is what you want if you are using the -print0 option to find.

  3. The easiest way to search filenames is probably just to use `find` itself:

    find . -iname “*$1*.py” -type f

  4. You could use find directly with any regular expression instead …
    find [path] -regex [regular_expression]

  5. Instead of `find`, in most circumstances, I prefer using `tree`. For example:

    `tree -afiQ [path] | grep -iE [pattern]`

    where:

    `-a` includes hidden files in the output,
    `-f` prints the full path prefix for each file,
    `-i` removes indentations in the returned result, and
    `-Q` encloses each result in double-quotes.

    These options are really useful if you’re using the results in a script.

    For symbolic links, the results also indicate the files they refer to.

  6. I’ve moved on from the days of poring through the find (or other utilities) man pages. Now, I just outline my problem to chatgpt/etc and ask for the answer. I do have the background to understand the suggested command line generated by the AI tool; from there it’s easy for me to tweak any final options.

  7. Piping find into xargs is a very old technique. Originally find had only -exec … {} \; which was slow because it execs one command per file and xargs could be faster. Modern find has -exec … {} + which can be more efficient than piping into xargs and requires less typing.

    find . -name ‘*.py’ -type f -exec grep -i “$1” {} +

    There are still some cases where you might want to use xargs with find. You can use xargs to precisely control the number of arguments processed at a time with the -n flag. A better example is if you want parallel execution. This might also use -n with something such as xargs -0 -n1 -P4.

Leave a Reply

Your email address will not be published. Required fields are marked *