Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. For example,

    shuf -n 10 foo.txt

would select 10 lines from foo.txt.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

    shuf -r -n 10 foo.txt

to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing

    alpha
    beta
    gamma

I got the output

    beta
    gamma
    gamma
    beta
    alpha
    alpha
    gamma
    gamma
    beta

I don’t know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

Related

5 thoughts on “Random sampling from a file

  1. Back in the ’80’s we used shuf to generate both our login MOTD and finger responses.

    Ah, those were the days.

  2. Apropos the seed: GNU shuf has a flag –random-source=FILE to set a custom random source. I am not sure that I would use it to generate secure passwords, but still, it is a start.

Leave a Reply

Your email address will not be published. Required fields are marked *