Sort and remove duplicates

A common idiom in command line processing of text files is

    ... | sort | uniq | ...

Some process produces lines of text. You want to pipe that text through sort to sort the lines in alphabetical order, then pass it to uniq to filter out all but the unique lines. The uniq utility only removes adjacent duplicates, and so it will not remove all duplicates unless the input is sorted. (At least the duplicate lines need to be grouped together; the groups need not be in any particular order.)

When given the -u flag, sort will sort and remove duplicates. This says the idiom above could be shortened to

    ... | sort -u | ...

Aside from saving a few keystrokes, is there any advantage to the latter? There could be, depending on how sort -u is implemented. If internally it simply sorts its input and then removes duplicates, then there is no advantage. But if the code simultaneously sorts and removes duplicates, it could save memory and time, depending on the input. If the code discarded duplicates as their were encountered, the code would need working memory proportional to the amount of unique input rather than the total amount of input.

I had a project recently that makes a good test case for this. The Gutenberg text corpus contains a list of words for 55,000 different sources, each in a separate text file. There are a lot of files, and there is a lot of redundancy between files. The combined file is 3.4 GB.

Running sort | uniq on the combined file took 610 seconds.

Running sort -u on the file took 394 seconds.

So in this example, sort -u not only saved a few keystrokes, it took about 35% off the time.

I expected it would save even more time. Maybe a custom program optimized for large files with a lot of redundancy could be even faster.

Update: awk

Bob Lyon’s comment made me think about using awk without concatenating the files together. Each of the 55,000 text files contains a word and a count. I didn’t concatenate the files per se but piped each through cut -f 1 to select the first column.

Using awk I created an associative array (a hash table, but awk calls them associative arrays) for each word, then printed out the keys of the array.

    awk '{count[$1]++}; 
        END {for (key in count) {print key}}' | 
        sort > out

This ran in 254 seconds, 58% faster than sort | uniq and 36% faster than sort -u.

There is a lot of redundancy between the files—the list of unique words is 320 times smaller than the union of all the input files—and the awk script takes advantage of this by maintaining an array roughly the size of the output rather than the size of the input.

Tim Chase suggested a more elegant solution:

    awk '!count[$1]++ {print $1}' *.txt | sort > out

It seems this should be more efficient as well since awk only goes through the data once, immediately printing new values as they are encountered. However, it actually took slightly longer, 263 seconds.

4 thoughts on “Sort and remove duplicates

  1. Using `sort -u` is definitely the correct idiom in this case. However, `uniq` can count the number of duplicates too, which I don’t think `sort` can do on itself (not fundamentally, it simply isn’t implemented). One still needs to use `sort | uniq -c` in such cases.

  2. A topic near and dear to my heart! The password-cracking community has worked in this space for some time, for its own specific ends. A couple of tools stand out:

    1. The hashcat-utils suite includes “rli” (assumed unsorted input, memory bound) and “rli2” (assumes sorted input, not memory bound) for merge/dedupe activities – specifically, for removing all of the contents of one file from another file.

    https://hashcat.net/wiki/doku.php?id=hashcat_utils#rli
    https://hashcat.net/wiki/doku.php?id=hashcat_utils#rli2

    2. For more complex use cases, ‘Waffle’ of the Cynosure Prime team as author (with myself as the primary “corner tenant” use-case stakeholder during development) produced ‘rling’ (“RLI next gen”). It is multithreaded, provides estimates of memory usage and other rich feedback, can replace files in place, can produce lists by frequency, and has a number of other features driven by our use cases.

    https://github.com/Cynosureprime/rling

  3. The following two commands will remove duplicate lines if the input file can fit into an in-memory hash table:

    awk ‘!count[$0]++’ file.txt

    perl -ne ‘print if ! $count{$_}++’ file.txt

  4. I think I have “learned” about sort -u several times, and for whatever reason it never sticks. I have been typing sort | uniq for so many years I think it is nearly muscle memory now. Perhaps these data regarding the relative performance of each will be compelling enough to inspire change…

    I do agree with Omid though that a nontrivial amount of time I end up using uniq -c, sometimes on the second go-round, so it’s quite nice to have the pipeline all ready to go and have a -c added at the end. A small thing, granted.

Leave a Reply

Your email address will not be published. Required fields are marked *