Uncategorized

Microsoft replacing SHA-1

According to this article, Microsoft is patching Windows 7 and Windows Server 2008 to look for SHA-2 hash functions of updates. These older versions of Windows have been using SHA-1, while newer version are already using SHA-2.

This is a good move, but unnecessary. Here’s what I mean by that. The update was likely unnecessary for reasons I’ll explain below, but it was easy to do, and it increased consistency across Microsoft’s product line. It’s also good PR.

What are SHA-1 and SHA-2?

Let’s back up a bit. SHA-1 and SHA-2 are secure hash functions [1]. They take a file, in this case a Microsoft software update, and return a relatively small number, small relative to the original file size. In the case of SHA-1, the result is 160 bits (20 bytes).  They’re designed so that if a file is changed, the function value is nearly certain to change. That is, it’s extremely unlikely that a change to the file would not result in a change to the hash value.

The concern isn’t accidental changes. The probability of accidentally producing two files with the same hash function value is tiny as I show here.

The concern is a clever attacker who could modify the software update in such a way that the hash function remains unchanged, bypassing the hash as a security measure. That would be harder to do with SHA-2 than with SHA-1, hence Microsoft’s decision years ago to move to SHA-2 for new versions of the operating system, and its recent decision to make the change retroactive.

How hard is it to produce collisions?

By a collision we mean two files that hash to the same value. It’s obvious from the pigeon hole principle [2] that collisions are possible, but how hard are they to produce deliberately?

Google demonstrated two years ago that it could produce two PDF files with the same SHA-1 hash value. But doing so required over 6,500 years of CPU time running in parallel [3]. Also, Google started with a file designed to make collisions possible. According to their announcement,

We started by creating a PDF prefix specifically crafted to allow us to generate two documents with arbitrary distinct visual contents, but that would hash to the same SHA-1 digest.

It would be harder to start with a specified input, such as a software update file and generate a collision. It would be harder still to generate a collision that had some desired behavior.

According to this page, it’s known how to tamper with two files simultaneously so that they will have the same SHA-1 hash values. This is what Google did, at the cost of thousands of CPU years. But so far, nobody has been able to start with a given file and create another file with the same SHA-1 value.

As I said at the beginning, it made sense for Microsoft to decide to move from SHA-1 to SHA-2 because the cost of doing so was small. But the use of SHA-1 hash codes is probably not the biggest security risk in Windows 7.

Related posts

[1] SHA-1 is a hash function, but SHA-2 is actually a family of hash functions: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. All are believed to provide at least 112 bits of security, while SHA-1 provides less than 63.

The SHA-x functions output x bits. The SHA-x/y functions use x bits of internal state and output y bits. To be consistent with this naming convention, SHA-1 should be called SHA-160.

[2] The pigeon hole principle says that if you put more than n things into n boxes, one of the boxes has to have more than one thing. If you hash files of more than n bits to n-bit numbers, at least two files have to go to the same value.

[3] If you were to rent this much CPU time in the cloud at 5 cents per CPU hour, it would cost about $2,800,000. If the only obstacle were the cost of computing resources, someone might be willing to pay that to tamper with a Microsoft update.

Supercookies

superhero cookies

Supercookies, also known as evercookies or zombie cookies, are like browser cookies in that they can be used to track you, but are much harder to remove.

What is a supercookie?

The way I first heard supercookies describe was as a cookie that you can appear to delete, but as soon as you do, software rewrites the cookie. Like the Hydra from Greek mythology, cutting off a head does no good because it grows back [1].

This explanation is oversimplified. It doesn’t quite work that way.

A supercookie is not a cookie per se. It’s anything that can be used to uniquely identify your browser: font fingerprinting, flash cache, cached images, browser plugins and preferences, etc. Deleting your cookies has no effect because a supercookie is not a cookie.

However, a supercookie can work with other code to recreate deleted cookies, and so the simplified description is not entirely wrong. A supercookie could alert web sites that a cookie has been deleted, and allow those sites to replace that cookie, or update the cookie if some browser data has changed.

What about ‘Do Not Track’?

You can ask sites to not track you, but this works on an honor system and is ignored with impunity, even (especially?) by the best known companies.

Apple has announced that it is removing Do Not Track from its Safari browser because the feature is worse than useless. Servers don’t honor it, and it gives a false sense of privacy. Not only that, the DNT setting is one more bit that servers could use to identify you! Because only about 9% of users turn on DNT, knowing that someone has it turned on gives about 3.5 bits of information toward identifying that person.

How to remove supercookies

How do you remove supercookies? You can’t. As explained above, a supercookie isn’t a file that can be removed. It’s a procedure for exploiting a combination of data.

You could remove specific ways that sites try to identify you. You could, for example, remove Flash to thwart attempts to exploit Flash’s data, cutting off one head of the Hydra. This might block the way some companies track you, but there are others.

It’s an arms race. As fingerprinting techniques become well known, browser developers and users try to block them, and those intent on identifying you come up with more creative approaches.

The economics of identification

Given the efforts companies use to identify individuals (or at least their devices), it seems it must be worth it. At least companies believe it’s worth it, and for some it probably is. But there are reasons to believe that tracking isn’t as valuable as it seems. For example, this article argues that the most valuable targeting information is freely given. For example, you know who is interested in buying weighted blankets? People who search on weighted blankets!

There have been numerous anecdotal stories recently of companies that have changed their marketing methods in order to comply with GDPR and have increased their sales. These are only anecdotes, but they suggest that at least for some companies, there are profitable alternatives to identifying customers who don’t wish to be identified.

Related posts

[1] In the Greek myth, cutting off one head of the Hydra caused two heads to grow back. Does deleting a supercookie cause it to come back stronger? Maybe. Clearing your cookies is another user behavior that can be used to fingerprint you.

Probabilisitic Identifiers in CCPA

The CCPA, the California Consumer Privacy Act, was passed last year and goes into effect at the beginning of next year. And just as the GDPR impacts businesses outside Europe, the CCPA will impact businesses outside California.

The law specifically mentions probabilistic identifiers.

“Probabilistic identifier” means the identification of a consumer or a device to a degree of certainty of more probable than not based on any categories of personal information included in, or similar to, the categories enumerated in the definition of personal information.

So anything that gives you better than a 50% chance of guessing personal data fields [1]. That could be really broad. For example, the fact that you’re reading this blog post makes it “more probable than not” that you have a college degree, and education is one of the categories mentioned in the law.

Personal information

What are these enumerated categories of personal information mentioned above? They start out specific:

Identifiers such as a real name, alias, postal address, unique personal identifier, online identifier Internet Protocol address, email address, …

but then they get more vague:

purchasing or consuming histories or tendencies … interaction with an Internet Web site … professional or employment-related information.

And in addition to the vague categories are “any categories … similar to” these.

Significance

What is the significance of a probabilistic identifier? That’s hard to say. A large part of the CCPA is devoted to definitions, and some of these definitions don’t seem to be used. Maybe this is a consequence of the bill being rushed to a vote in order to avoid a ballot initiative. Maybe the definitions were included in case they’re needed in a future amended version of the law.

The CCPA seems to give probabilistic identifiers the same status as deterministic identifiers:

“Unique identifier” or “Unique personal identifier” means … or probabilistic identifiers that can be used to identify a particular consumer or device.

That seems odd. Data that can give you a “more probable than not” guess at someone’s “purchasing or consuming histories” hardly seems like a unique identifier.

Devices

It’s interesting that the CCPA says “a particular consumer or device.” That would seem to include browser fingerprinting. That could be a big deal. Identifying devices, but not directly people, is a major industry.

Related posts

[1] Nothing in this blog post is legal advice. I’m not a lawyer and I don’t give legal advice. I enjoy working with lawyers because the division of labor is clear: they do law and I do math.

Font Fingerprinting

Browser fingerprint

Web sites may not be able to identify you, but they can probably identify your web browser. Your browser sends a lot of information back to web servers, and the combination of settings for a particular browser are usually unique. To get an idea what information we’re talking about, you could take a look at Device Info.

Installed fonts

One of the pieces of information that gets sent back to servers is the list of fonts installed on your device. Your font fingerprint is just one component of your browser fingerprint, but it’s an easy component to understand.

Application fonts

Various applications install their own fonts. If you’ve installed Microsoft Office, for example, that would be evident in your list of fonts. However, Office is ubiquitous, so that information doesn’t go very far to identifying you. Maybe the lack of fonts installed with Office would be more conspicuous.

Less common software goes further toward identifying you. For example, I have Mathematica on one of my computers, and along with it Mathematica fonts, something that’s not too common.

Personal fonts

Then there are the fonts you’ve installed deliberately, many of the free. Maybe you’ve installed fonts to support various languages, such as Hebrew and Greek fonts for Bible scholars. Maybe you have dyslexia and have installed fonts that are easier for you to read. Maybe you’ve installed a font because it contains technical symbols you need for your work. These increase the chances that your combination of fonts is unique.

Commercial fonts

Maybe you have purchased a few commercial fonts. One of the reasons to buy fonts is to have something that doesn’t look so common. This also makes the font fingerprint of your browser less common.

Moderate obscurity

Servers have to query whether particular fonts are installed. An obscure font would go a long way toward identifying you. But if a font is truly obscure, the server isn’t likely to ask whether it’s installed. So the greatest privacy risk comes from moderately uncommon fonts [1].

Advertising

Your browser fingerprint is probably unique, unless you have a brand new device, or you’ve made a deliberate effort to keep your fingerprint generic. So while a site may not know who you are, it can recognize whether you’ve been there before, and customize the content you receive accordingly. Maybe you’ve looked at the same product three times without buying, and so you get a nudge to encourage you to go ahead and buy.

(It’ll be interesting to see what effect the California Consumer Privacy Act has on this when it goes into effect the beginning of next year.)

What about changes?

Since there’s more than enough information to uniquely identify your browser, fingerprints are robust to changes. Installing a new font won’t throw advertisers off your trail. If you still have the same monitor size, same geographic location, etc. then advertisers can update your fingerprint information to include he new font. You might even get an advertisement for more fonts if they infer you’re a typography aficionado.

Related posts

[1] Except for a spearphishing attack. A server might check for the presence of fonts that, although uncommon in general, are likely to be on the target’s computer. For example, if someone wanted to detect my browser in particular, they know I have Mathematica fonts installed because I said so above. And they might guess that I have installed the Greek and Hebrew fonts I mentioned. They might also look for obscure fonts I’ve mentioned in the blog, such as Unifont, Andika, and Inconsolata.

3,000th blog post

I just saw that I’d written 2,999 blog posts, so that makes this one the 3,000th. About a year ago was the 10th anniversary, and Tim Hopper wrote his retrospective about my blog.

In addition to chronological blog posts, there are about 200 “pages” on the site, mostly technical notes. These include the most popular content on the site.

***

The following is a rambling discussion of things that have changed over the history of the blog.

I started out posting a little more than once a day on average. I’ve slowed down a little, but I still wrote about 300 posts over the last year. My posts are getting a little longer. They’re still fairly short, but a little longer than they used to be.

I’ve also started using more footnotes. That has worked out well, letting me write for two audiences at the same time: those who want a high-level overview and those who want technical details.

More people are reading from mobile devices, so I moved to a responsive design. I also use more SVG images because they look great across a variety of platforms.

HTTPS has become more common, and I switched to HTTPS a while back.

Google nearly killed RSS when it killed the most popular RSS reader. I announce most of my posts on Twitter because Twitter has sorta taken the place of RSS. You can still subscribe by RSS, and many do, but not as many as before the demise of Google Reader.

I’ve been writing more about privacy and security lately because I’m doing more work with privacy and security.

I was reluctant to start blogging, but I gave it a chance after several people exhorted me to do it. I was especially hesitant to allow comments, but the signal to noise ratio has been a pleasant surprise (aside from the millions of comments blocked by my spam filter). I’ve learned a lot from your feedback. I’ve met a lot of friends and clients through the blog and am very glad I started it.

The most low-key newsletter

My monthly newsletter is one of the most low-key ones around. It’s almost a secret. You can find it via the navigation menu if you look for it.

I won’t put a popup on my site cajoling you to subscribe, nor will I ask you to sign up before letting you read something I’ve written. My newsletter is just for people who want to read it.

I’ve been sending out a monthly newsletter for about four years. I’ve never sent out more than one email a month, though I might some day if I have a good reason to.

Each newsletter highlights a few of the most popular posts since the previous newsletter. I used to say a little about my business each month, or maybe something personal, but I quit when that got to be repetitive. I may go back to doing that occasionally if I have something to say that I believe you’ll find interesting.

Salting and stretching a password

This post will look at a progression of ways to store passwords, from naive to sophisticated.

Most naive: clear text

Storing passwords in plain text is least secure thing a server could do. If this list is leaked, someone knows all the passwords with no effort.

Better: hash values

A better approach would be to run passwords through a secure hash function before storing them. The server would never store a password per se, only its hashed value. When someone enters a password, the server would check whether the hash of the password matches the stored hashed value. If the list of hashed values is leaked, it will take some effort to recover the original passwords, though maybe not much effort, as I wrote about in my previous post.

Better still: add salt

The next step in sophistication is to append a random value, called a salt, to the password before applying the hash function. The server would store each user’s salt value and the hash of the password with the salt added.

Now if the user has a naive password like “qwerty” you couldn’t crack the password by doing a simple Google search. You can find the hash value of “qwerty” by searching, but not the hash value of qwerty + random salt, because although the former is common the latter is probably unique. You could still crack the password if the hash is insecure, but it would take a little effort.

Better still: key stretching

Suppose an attacker has a list of salt values and corresponding hash values for salt + password. They could guess passwords, hashing each with a salt value, to see if any hash values match. They would start with the most common passwords and probably get some matches.

Key stretching is a way to make this brute force search more time consuming by requiring repeated hashing. In the following stretching algorithm, p is the password, s is the salt, h is the hash function, and || means string concatenation.

\begin{align*} x_0 &= \phi \\ x_i &= h(x_{i-1} || p || s) \mbox{ for i = 1, \ldots, }r \\ K &= x_r \end{align*}

Now the time required to test each password has been multiplied by r. The idea is to pick a value of r that is affordable for legitimate use but expensive for attacks.

Please update your RSS subscription

RSS icon

When I started this blog I routed my RSS feed through Feedburner, and now Feedburner is no longer working for my site.

If you subscribed by RSS, please check the feed URL. It should be

https://www.johndcook.com/blog/feed

which was previously forwarded to a Feedburner URL. If you subscribe directly to the feed with my domain, it should work. I had some problems with it earlier, but I believe they’ve been fixed.

Also, if you subscribed by email, you went through Feedburner. You won’t receive any more posts by email that way. I’m considering possible replacements.

In the mean time, you can sign up for monthly highlights from the blog via email. Each month I highlight the three or four most popular posts. Once in a while I may add a paragraph about what I’m up to. Short and sweet.

You can also find me on Twitter.

Update: Maybe Feedburner is working after all. I’m not sure what’s going on. Still, it seems Feedburner is hanging on by a thread. The steps above are a good way to prepare for when it goes away.

Flattest US states

US physical map

I read somewhere that, contrary to popular belief, Kansas is not the flattest state in the US. Instead, Florida is the flattest, and Kansas was several notches further down the list.

(Update: Nevertheless, Kansas is literally flatter than a pancake. Thanks to Andreas Krause for the link.)

How would you measure the flatness of a geographic region? The simplest approach would be to look at the range of elevation from the highest point to the lowest point, so that’s what I did. Here are elevations and differences for each state, measured in feet.

|----------------+-------+------+-------|
| State          |  High |  Low |  Diff |
|----------------+-------+------+-------|
| Florida        |   345 |    0 |   345 |
| Delaware       |   450 |    0 |   450 |
| Louisiana      |   535 |   -7 |   542 |
| Mississippi    |   807 |    0 |   807 |
| Rhode Island   |   814 |    0 |   814 |
| Indiana        |  1257 |  322 |   935 |
| Illinois       |  1237 |  279 |   958 |
| Ohio           |  1549 |  456 |  1093 |
| Iowa           |  1670 |  479 |  1191 |
| Wisconsin      |  1952 |  581 |  1371 |
| Michigan       |  1982 |  571 |  1411 |
| Missouri       |  1772 |  230 |  1542 |
| Minnesota      |  2303 |  600 |  1703 |
| New Jersey     |  1804 |    0 |  1804 |
| Connecticut    |  2382 |    0 |  2382 |
| Alabama        |  2405 |    0 |  2405 |
| Arkansas       |  2756 |   56 |  2700 |
| North Dakota   |  3507 |  751 |  2756 |
| Pennsylvania   |  3215 |    0 |  3215 |
| Kansas         |  4042 |  679 |  3363 |
| Maryland       |  3363 |    0 |  3363 |
| Massachusetts  |  3491 |    0 |  3491 |
| South Carolina |  3563 |    0 |  3563 |
| Kentucky       |  4144 |  256 |  3888 |
| Vermont        |  4396 |   95 |  4301 |
| Nebraska       |  5427 |  840 |  4587 |
| West Virginia  |  4865 |  240 |  4625 |
| Oklahoma       |  4977 |  289 |  4688 |
| Georgia        |  4787 |    0 |  4787 |
| Maine          |  5269 |    0 |  5269 |
| New York       |  5348 |    0 |  5348 |
| Virginia       |  5732 |    0 |  5732 |
| South Dakota   |  7247 |  968 |  6279 |
| New Hampshire  |  6293 |    0 |  6293 |
| Tennessee      |  6647 |  177 |  6470 |
| North Carolina |  6690 |    0 |  6690 |
| Texas          |  8753 |    0 |  8753 |
| New Mexico     | 13169 | 2844 | 10325 |
| Wyoming        | 13812 | 3100 | 10712 |
| Montana        | 12808 | 1801 | 11007 |
| Colorado       | 14439 | 3314 | 11125 |
| Oregon         | 11247 |    0 | 11247 |
| Utah           | 13527 | 2001 | 11526 |
| Idaho          | 12667 |  712 | 11955 |
| Arizona        | 12631 |   69 | 12562 |
| Nevada         | 13146 |  479 | 12667 |
| Hawaii         | 13806 |    0 | 13806 |
| Washington     | 14419 |    0 | 14419 |
| California     | 14505 | -282 | 14787 |
| Alaska         | 20335 |    0 | 20335 |
|----------------+-------+------+-------|

By the measure used in the table above, Florida is about 10 times flatter than Kansas.

The centroid of the continental US is in Kansas, and if you look at the center of the map above, you’ll see that there’s an elevation change across Kansas.

If I remember correctly, the source I saw said that Kansas was something like 9th, though in the table above it’s 20th. Maybe that source measured flatness differently. If you had a grid of measurements throughout each state, you could compute something like a Sobolev norm that accounts for gradients.