Exploring bad passwords

If your password is in the file rockyou.txt then it’s a bad password. Password cracking software will find it instantly. (Use long, randomly generated passwords; staying off the list of worst passwords is necessary but not sufficient for security.)

The rockyou.txt file currently contains 14,344,394 bad passwords. I poked around in the file and this post reports some things I found.

To make things more interesting, I made myself a rule that I could only use command line utilities.

Pure numeric passwords

I was curious how many of these passwords consisted only of digits so I ran the following.

    grep -P '^\d+$' rockyou.txt | wc -l

This says 2,346,744 of the passwords only contain digits, about 1 in 6.

Digit distribution

I made a file of digits appearing in the passwords

    grep -o -P '\d' rockyou.txt > digits

and looked at the frequency of digits.

    for i in 0 1 2 3 4 5 6 7 8 9
        grep -c $i digits

This is what I got:


The digits are distributed more evenly than I would have expected. 1’s are more common than other digits, but only about twice as common as the least common digits.

Longest bad passwords

How long is the longest bad password? The command

    wc -L rockyou.txt

shows that one line in the file is 285 characters long. What is this password? The command

    grep -P '.{285}' rockyou.txt

shows that it’s some HTML code. Nice try whoever thought of that, but you’ve been pwned.

A similar search for all-digit passwords show that the longest numeric passwords are 255 digits long. One of these is a string of 255 zeros.

Dictionary words

A common bit of advice is to not choose passwords that can be found in a database. That’s good advice as far as it goes, but it doesn’t go very far.

I used the comm utility to see how many bad passwords are not in the dictionary by running

    comm -23 sorted dict | wc -l

and the answer was 14,310,684. Nearly all the bad passwords are not in a dictionary!

(Here sorted is a sorted version of the rockyou.txt file; I believe the file is initially sorted by popularity, worst passwords first. The comm utility complained that my system dictionary isn’t sorted, which I found odd, but I sorted it to make comm happy and dict is the sorted file.)

Curiously, the command

    comm -13 sorted dict | wc -l

shows there are 70,624 words in the dictionary (specifically, the american-english file on my Linux box) that are not on the bad password list.

Smallest ‘good’ numeric password

What is the smallest number not in the list of pure numeric passwords? The following command strips leading zeros from purely numeric passwords, sorts the results as numbers, removes duplicates, and stores the results in a file called nums.

    grep -P '^\d+$' rockyou.txt | sed 's/^0\+//' | sort -n | uniq > nums

The file nums begins with a blank. I removed this with sed.

    sed -i 1d nums

Next I used awk to print instances where the line number does not match the line in the file nums.

    awk '{if (NR-$0 < 0) print $0 }' nums | less

The first number this prints is 61. This means that the first line is 1, the second line is 2, and so on, but the 60th line is 61. That means 60 is missing. The file rockyou.txt does not contain 60. You can verify this: the command

    grep '^60$' rockyou.txt

returns nothing. 60 is the smallest number not in the bad password file. There are passwords that contain ’60’ as a substring, but just 60 as a complete password is not in the file.

Related posts

Recursive grep

The regular expression search utility grep has a recursive switch -R, but it may not work like you’d expect.

Suppose want to find the names of all .org files in your current directory and below that contain the text “cheese.”

You have four files, two in the working directory and two below, that all contain the same string: “I like cheese.”

    $ ls -R
    rootfile.org  rootfile.txt  sub
    subfile.org  subfile.txt

It seems that grep -R can either search all files of the form *.org in the current directory, ignoring the -R switch, or search all files recursively if you don’t give it a file glob, but it can’t do both.

    $ grep -R -l cheese *.org
    $ grep -R -l cheese .

One way to solve this is with find and xargs:

    $ find . -name '*.org' | xargs grep -l cheese 

I was discussing this with Chris Toomey and he suggested an alternative using a subshell that seems more natural:

    grep -l cheese $(find . -name '*.org')

Now the code reads more like an ordinary call to grep. From left to right, it essentially says “Search for ‘cheese’ in files ending in .org” whereas the version with find reads like “Find files whose names end in .org and search them for ‘cheese.'” It’s good to understand how both approaches work.

Related posts

A shell one-liner to search directories

I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.

So I’m splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.

I want to explore LaTeX files, so I’ll start by using find to find such files.

    find . -name "*.tex"

This searches for files ending in .tex, starting with the current directory (hence .) and searching recursively into subdirectories. The find command explores subdirectories by default; you have to tell it not to if that’s not what you want.

Next, I want to use grep to search the LaTeX files. If I pipe the output of find to grep it will search the file names, but I want it to search the file contents. The xargs command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.

    find . -name "*.tex" | xargs grep ...

LaTeX commands have the form of a backslash followed by letters, so the regular expression I’ll pass is \\[a-z]+. This says to look for a literal backslash followed by one or more letters.

I’ll give grep four option flags. I’ll use -i to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I’ll use -E to tell it I want to use extended regular expressions [1].

I’m after just the commands, not the lines containing commands, and so I use the -o option to tell grep to return just the commands, one per line. But that’s not enough. It would be enough if we were only search one file, but since we’re searching multiple files, the default behavior is for grep to return the file name as well. The -h option tells it to only return the matches, no file names.

So now we’re up to this:

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'

Next I want to count how many times each command occurs, and I need to sort the output first so that uniq will count correctly.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c

And finally I want to sort the output by frequency, in descending order. The -n option tells sort to sort numerically, and -r says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less to view it one screen at a time.

    find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less

That’s my one-liner. In the next post I’ll look at the results.

More command line posts

[1] I learned regular expressions from writing Perl long ago. What I think of a simply a regular expression is what grep calls “extended” regular expressions, so adding the -E option keeps me out of trouble in case I use a feature that grep considers an extension. You could use egrep instead, which is essentially the same as grep -E.

Doing a database join with CSV files

relational database

It’s easy to manipulate CSV files with basic command line tools until you need to do a join. When your data is spread over two different files, like two tables in a normalized database, joining the files is more difficult unless the two files have the same keys in the same order. Fortunately, the xsv utility is just the tool for the job. Among other useful features, xsv supports database-like joins.

Suppose you want to look at weights broken down by sex, but weights are in one file and sex is in another. The weight file alone doesn’t tell you whether the weights belong to men or women.

Suppose a file weight.csv has the following rows:


and a file person.csv has the following:


Note that the two files have different ID values: 123 and 789 are in both files, 999 is only in weight.csv and 456 is only in person.csv. We want to join the two tables together, analogous to the JOIN command in SQL.

The command

    xsv join ID person.csv ID weight.csv

does just this, producing


by joining the two tables on their ID columns.

The command includes ID twice, once for the field called ID in person.csv and once for the field called ID in weight.csv. The fields could have different names. For example, if the first column of person.csv were renamed Key, then the command

    xsv join Key person.csv ID weight.csv

would produce


We’re not interested in the ID columns per se; we only want to use them to join the two files. We could suppress them in the output by asking xsv to select the second and fourth columns of the output

    xsv join Key person.csv ID weight.csv | xsv select 2,4

which would return


We can do other kinds of joins by passing a modifier to join. For example, if we do a left join, we will include all rows in the left file, person.csv, even if there isn’t a match in the right file, weight.csv. The weight will be missing for such records, and so

    $ xsv join --left Key person.csv ID weight.csv



Right joins are analogous, including every record from the second file, and so

    xsv join --right Key person.csv ID weight.csv



You can also do a full join, with

    xsv join --full Key person.csv ID weight.csv



More data wrangling posts

Exporting Excel files to CSV with in2csv

This post shows how to export an Excel file to a CSV file using in2csv from the csvkit package.

You could always use Excel itself to export an Excel file to CSV but there are several reasons you might not want to. First and foremost, you might not have Excel. Another reason is that you might want to work from the command line in order to automate the process. Finally, you might not want the kind of CSV format that Excel exports.

For illustration I made a tiny Excel file. In order to show how commas are handled, populations contain commas but areas do not.

See post content for text dump of data

When I ask Excel to export the file I get


Note that areas are exported as plain integers, but populations are exported as quoted strings containing commas.

Using csvkit

Now install csvkit and run in2csv.

    $ in2csv states.xlsx

The output goes to standard out, though of course you could redirect the output to a file. All numbers are exported as numbers, with no thousands separators. This makes the output easier to use from a program that does crude parsing [1]. For example, suppose we save states.xlsx to states.csv using Excel then ask cut for the second column. Then we don’t get what we want.

    $ cut -d, -f2 states.csv

But if we use in2csv to create states.csv then we get what we’d expect.

    cut -d, -f2 states.csv

Multiple sheets

So far we’ve assumed our Excel file has a single sheet. I added a second sheet with data on US territories. The sheet doesn’t have a header row just to show that the header row isn’t required.


I named the two sheets “States” and “Territories” respectively.

States, Territories

Now if we ask in2csv to export our Excel file as before, it only exports the first sheet. But if we specify the Territories sheet, it will export that.

    $ in2csv --sheet Territories states.xlsx

More CSV posts

[1] The cut utility isn’t specialized for CSV files and doesn’t understand that commas inside quotation marks are not field separators. A specialized utility like csvtool would make this distinction. You could extract the second column with

    csvtool col 2 states.csv


    csvtool namedcol Population states.csv

This parses the columns correctly, but you still have to remove the quotation marks and thousands separators.

Minimizing context switching between shell and Python

Sometimes you’re in the flow using the command line and you’d like to briefly switch over to Python without too much interruption. Or it could be the other way around: you’re in the Python REPL and need to issue a quick shell command.

One solution would be to run your shell and Python session in different terminals, but let’s assume that for whatever reason you’re working in just one terminal. For example, maybe you want the output of a shell command to be visually close when you run Python, or vice versa.

Calling Python from shell

You can run a Python one-liner from the shell by calling Python with the -c option. For example,

    $ python -c "print(3*7)"

I hardly ever do this because I want to run more than a one-liner. What I find more useful is to launch Python with the -q option to suppress all the start-up verbiage and simply bring up a prompt.

    $ python -q

More on this in my post on quiet modes.

Calling shell from Python

If you run Python with the ipython command rather than the default python you get much better shell integration. IPython let’s you type a shell command at any point simply by preceding it with a !. For example, the following command tells us this is the 364th day of the year.

    In [1]: ! date +%j

You can run some of the most common shell commands, such as cd and ls without even a bang prefix. These are “magic” commands that do what you’d expect if you forgot for a moment that you’re in a Python REPL rather than a command shell.

    In [2]: cd ..
    Out[2]: '/mnt/c/Users'

IPython also supports other forms of shell integration such as capturing the output of a shell command as a Python variable, or using a Python variable as an argument to a shell command.

More on context switching and shells

Why can’t grep find negative numbers?

Suppose you’re looking for instances of -42 in a file foo.txt. The command

    grep -42 foo.txt

won’t work. Instead you’ll get a warning message like the following.

    Usage: grep [OPTION]... PATTERN [FILE]...
    Try 'grep --help' for more information.

Putting single or double quotes around -42 won’t help. The problem is that grep interprets 42 as a command line option, and <codegrep doesn’t have such an option. This is a problem if you’re searching for negative numbers, or any pattern that beings with a dash, such as -able or --version.

The solution is to put -e in front of a regular expression containing a dash. That tells grep that the next token at the command line is a regular expression, not a command line option. So

    grep -e -42 foo.txt

will work.

You can also use -e several times to give grep several regular expressions to search for. For example,

    grep -e cat -e dog foo.txt

will search for “cat” or “dog.”

See the previous post for another example of where grep doesn’t seem to work. By default grep supports a restricted regular expression syntax and may need to be told to use “extended” regular expressions.

Why doesn’t grep work?

If you learned regular expressions by using a programming language like Perl or Python, you may be surprised when tools like grep seem broken. That’s because what you think of as simply regular expressions, these tools consider extended regular expressions. Tell them to search on extended regular expressions and some of your frustration will go away.

As an example, we’ll revisit a post I wrote a while back about searching for ICD-9 and ICD-10 codes with regular expressions. From that post:

Most ICD-9 diagnosis codes are just numbers, but they may also start with E or V. Numeric ICD-9 codes are at least three digits. Optionally there may be a decimal followed by one of two more digits. … Sometimes the decimals are left out.

Let’s start with the following regular expression.


This says to look for three instances of the digits 0 through 9, optionally followed by a literal period, followed by zero, one, or two more digits. (Since . is a special character in regular expressions, we have to use a backslash to literally match a period.)

The regular expression above will work with Perl or Python, but not with grep or sed by default. That’s because it uses two features of extended regular expressions (ERE), but programs like grep and sed support basic regular expressions (BRE) by default.

Basic regular expressions would use \{3\} rather than {3} to match a pattern three times. So, for example,

   echo 123 | grep "[0-9]\{3\}"

would return 123, but

   echo 123 | grep "[0-9]{3}"

would return nothing.


    echo 123 | sed -n "/[0-9]\{3\}/p"

would return 123 but

    echo 123 | sed -n "/[0-9]{3}/p"

returns nothing.

(The -n option to sed tells it not to print every line by default. The p following the regular expression tells sed to print those lines that match the pattern. Here there’s only one line, the output of echo, but typically grep and sed would be use on files with multiple lines.)

Turning on ERE support

You can tell grep and sed that you want to use extended regular expressions by giving either one the -E option. So, for example, both

   echo 123 | grep -E "[0-9]{3}"


    echo 123 | sed -E -n "/[0-9]{3}/p"

will print 123.

You can use egrep as a synonym for grep -E, at least with Gnu implementations.

Incidentally, awk uses extended regular expressions, and so

    echo 123 | awk "/[0-9]{3}/"

will also print 123.

Going back to our full regular expression, using \.? for an optional period works with grep and sed if we ask for ERE support. The following commands all print 123.4.

    echo 123.4 | grep -E "[0-9]{3}\.?[0-9]{0,2}"
    echo 123.4 | sed -E -n "/[0-9]{3}\.?[0-9]{0,2}/p"
    echo 123.4 | awk "/[0-9]{3}\.[0-9]{0,2}/"

Without the -E option, grep and sed will not return a match.

This doesn’t fix everything

At the top of the post I said that if you tell tools you want extended regular expression support “some of your frustration will go away.” The regular expression from my ICD code post was actually


rather than


I used the shortcut \d to denote a digit. Python, Perl, and Awk will understand this, but grep will not, even with the -E option.

grep will understand \d if instead you use the -P option, telling it you want to use Perl-compatible regular expressions (PCRE). The Gnu version of grep supports this option, but the man page says “This is experimental and grep -P may warn of unimplemented features.” I don’t know whether other implementations of grep support PCRE. And sed does not have an option to support PCRE.


Set theory at the command line

Often you have two lists, and you want to know what items belong to both lists, or which things belong to one list but not the other. The command line utility comm was written for this task.

Given two files, A and B, comm returns its output in three columns:

  1. AB
  2. BA
  3. AB.

Here the minus sign means set difference, i.e. AB is the set of things in A but not in B.

Venn diagram of comm parameters
The numbering above will be correspond to command line options for comm. More on that shortly.

Difference and intersection

Here’s an example. Suppose we have a file states.txt containing a list of US states


and a file names.txt containing female names.


Then the command

    comm states.txt names.txt

returns the following.


The first column is states which are not female names. The second column is female names which are not states. The third column is states which are also female names.

Filtering output

The output of comm is tab-separated. So you could pull out one of the columns by piping the output through cut, but you probably don’t want to do that for a couple reasons. First, you might get unwanted spaces. For example,

    comm states.txt names.txt | cut -f 1




Worse, if you ask cut to return the second column, you won’t get what you expect.



This is because although comm uses tabs, it doesn’t produce a typical tab-separated file.

The way to filter comm output is to tell comm which columns to suppress.

If you only want the first column, states that are not names, use the option -23, i.e. to select column 1, tell comm not to print columns 2 or 3. So

    comm -23 states.txt names.txt



with no blank lines. Similarly, if you just want column 2, use comm -13. And if you want the intersection, column 3, use comm -12.


The comm utility assumes its input files are sorted. If they’re not, it will warn you.

If your files are not already sorted, you could sort them and send them to comm in a one-liner [1].

    comm <(sort states.txt) <(sort names.txt)


If your input files are truly sets, i.e. no item appears twice, comm will act as you expect. If you actually have a multiset, i.e. items may appear more than once, the output may surprise you at first, though on further reflection you may agree it does what you’d hope.

Suppose you have a file places.txt of places


and a file of US Presidents.



    comm places.txt presidents.txt

produces the following output.


Cleveland was on the list of presidents twice. (Grover Cleveland was the 22nd and the 24th president, the only US president to serve non-consecutive terms.) The output of comm lists Cleveland once in the intersection of places and presidents (say 22nd president and a city in Ohio), but also lists him once as a president not corresponding to a place. That is because there are two Clevelands on our list of presidents, there is only one in our list of places.

Suppose we had included five places named Cleveland (There are cities named Cleveland in several states). Then comm would list Cleveland 3 times as a place not a president, and 2 times as a place and a president.

In general, comm utility follows the mathematical conventions for multisets. Suppose an item x appears m times in multiset A, and n times in a multiset B. Then x appears max(mn, 0) times in AB, max(nm, 0) times in BA, and min(m, n) times in AB.


To form the union of two files as multisets of lines, just combine them into one file, with duplicates. You can join file1 and file2 with cat (short for “concatenate”).

    cat file1 file2 > multiset_union_file

To find the union of two files as sets, first find the union as multisets, then remove duplicates.

    cat file1 file2 | sort | uniq > set_union_file

Note that uniq, like comm, assumes files are sorted.


[1] The commutility is native to Unix-like systems, but has been ported to Windows. The examples in this post will work on Windows with ports of the necessary utilities, except for where we sort a file before sending it on to comm with <(sort file). That’s not a feature of sort or comm but of the shell, bash in my case. The Windows command line doesn’t support this syntax. (But bash ported to Windows would.)