Uncategorized

Riffing on mistakes

I mentioned on Twitter yesterday that one way to relieve the boredom of grading math papers is to explore mistakes. If a statement is wrong, what would it take to make it right? Is it approximately correct? Is there some different context where it is correct? Several people said they’d like to see examples, so this blog post is a sort of response.

***

One famous example of this is the so-called Freshman’s Dream theorem:

(a + b)p = ap + bp

This is not true over the real numbers, but it is true, for example, when working with integers mod p.

(More generally, the Freshman’s Dream is true in any ring of characteristic p. This is more than an amusing result; it’s useful in applications of finite fields.)

***

A common misunderstanding in calculus is that a series converges if its terms converge to zero. The canonical counterexample is the harmonic series. It’s terms converge to zero, but the sum diverges.

But this can’t happen in the p-adic numbers. There if the terms of a series converge to zero, the series converges (though maybe not absolutely).

***

Here’s something sorta along these lines. It looks wrong, and someone might arrive at it via a wrong understanding, but it’s actually correct.

sin(xy) sin(x + y) = (sin(x) – sin(y)) (sin(x) + sin(y))

***

Odd integers end in odd digits, but that might not be true if you’re not working in base 10. See Odd numbers in odd bases.

***

You can misunderstand how percentages work, but still get a useful results. See Sales tax included.

***

When probabilities are small, you can often get by with adding them together even when strictly speaking they don’t add. See Probability mistake can make a good approximation.

Scaling up differential privacy: lessons from the US Census

The paper Issues Encountered Deploying Differential Privacy describes some of the difficulties the US Census Bureau has run into while deploying differential privacy for the 2020 census. It’s not surprising that they would have difficulties. It’s surprising that they would even consider applying differential privacy on such an enormous scale.

If your data project is smaller than the US Census, you can probably make differential privacy work.

Related posts

US Census Bureau embraces differential privacy

The US Census Bureau is convinced that traditional methods of statistical disclosure limitation have not done enough to protect privacy. These methods may have been adequate in the past, but it no longer makes sense to implicitly assume that those who would like to violate privacy have limited resources or limited motivation. The Bureau has turned to differential privacy for quantifiable privacy guarantees that are independent of the attacker’s resources and determination.

John Abowd, chief scientist for the US Census Bureau, gave a talk a few days ago (March 4, 2019) in which he discusses the need for differential privacy and how the bureau is implementing differential privacy for the 2020 census.

Absolutely the hardest lesson in modern data science is the constraint on publication that the fundamental law of information recovery imposes. I usually call it the death knell for traditional method of publication, and not just in statistical agencies.

Related posts

Congress and the Equifax data breach

Dialog from a congressional hearing February 26, 2019.

Representative Katie Porter: My question for you is whether you would be willing to share today your social security, your birth date, and your address at this public hearing.

Equifax CEO Mark Begor: I would be a bit uncomfortable doing that, Congresswoman. If you’d so oblige me, I’d prefer not to.

KP: Could I ask you why you’re unwilling?

MB: Well that’s sensitive information. I think it’s sensitive information that I like to protect, and I think consumers should protect theirs.

KP: My question is then, if you agree that exposing this kind of information, information like that you have in your credit reports, creates harm, therefore you’re unwilling to share it, why are your lawyers arguing in federal court that there was no injury and no harm created by your data breach?

Related posts

Miscellaneous

Image editor

Image editing software is complicated, and I don’t use it often enough to remember how to do much. I like Paint.NET on Windows because it is in a sort of sweet spot for me, more powerful than Paint and much less complicated than Photoshop.

I found out there’s a program Pinta for Linux that was inspired by Paint.NET. (Pinta runs on Windows, Mac, and BDS as well.)

Exponential sum of the day

I have a page that draws a different image every day, based on putting the month, day, and the laws two digits of the year into an exponential sum. This year’s images have been more intricate than last year’s because 19 is prime.

I liked today’s image.

exponential sum for 2019-02-27

The page has a link to details explaining the equation behind the image, and an animate link to let you see the sequence in which the points are traversed.

Podcast interview

Rebecca Herold posted a new episode of her podcast yesterday in which she asks me questions about privacy and artificial intelligence.

Entropy update

I updated my blog post on solving for probability from entropy because Sjoerd Visscher pointed out that a crude approximation I used could be made much more accurate with a minor tweak.

As a bonus, the new error plot looks cool.

approximation error on log scale
Newsletter

My monthly newsletter comes out tomorrow. This newsletter highlights the most popular blog posts of the month.

I used to say something each month about what I’m up to. Then I stopped because it got to be repetitive. Tomorrow I include a few words about projects I have coming up.

The letter S

I was helping my daughter with physics homework last night and she said “Why do they use s for arc length?!” I said that I don’t know, but that it is conventional.

By the way, this section heading is a reference to Donald Knuth’s essay The Letter S where he writes in delightful Knuthian detail about the design of the letter S in TeX. You can find the essay in his book Literate Programming.

What sticks in your head

This morning I read an article by Dennis Felsing about his impressive/intimidating Linux desktop setup. He uses a lot of tools that are not the easiest way to get things done immediately but are long-term productivity investments.

Remembrance of syntax past

Felsing apparently is able to remember the syntax of scores of tools and programming languages. I cannot. Part of the reason is practice. I cannot remember the syntax of any software I don’t use regularly. It’s tempting to say that’s the end of the story: use it or lose it. Everybody has their set of things they use regularly and remember.

But I don’t think that’s all. I remember bits of math that I haven’t used in 30 years. Math fits in my head and sticks. Presumably software syntax sticks in the heads of people who use a lot of software tools.

There is some software syntax I can remember, however, and that’s software closely related to math. As I commented here, it was easy to come back to Mathematica and LaTeX after not using them for a few years.

Imprinting

Imprinting has something to do with this too: it’s easier to remember what we learn when we’re young. Felsing says he started using Linux in 2006, and his site says he graduated college in 2012, so presumably he was a high school or college student when he learned Linux.

When I was a student, my software world consisted primarily of Unix, Emacs, LaTeX, and Mathematica. These are all tools that I quit using for a few years, later came back to, and use today. I probably remember LaTeX and Mathematica syntax in part because I used it when I was a student. (I also think Mathematica in particular has an internal consistency that makes its syntax easier to remember.)

Picking your memory battles

I see the value in Felsing’s choice of tools. For example, the xmonad window manager. I’ve tried it, and I could imagine that it would make you more productive if you mastered it. But I don’t see myself mastering it.

I’ve learned a few tools with lots of arbitrary syntax, e.g. Emacs. But since I don’t have a prodigious memory for such things, I have to limit the number of tools I try to keep loaded in memory. Other things I load as needed, such as a language a client wants me to use that I haven’t used in a while.

Revisiting a piece of math doesn’t feel to me like revisiting a programming language. Brushing up on something from differential equations, for example, feels like pulling a book off a mental shelf. Brushing up on C# feels like driving to a storage unit, bringing back an old couch, and struggling to cram it in the door.

Middle ground

There are things you use so often that you remember their syntax without trying. And there are things you may never use again, and it’s not worth memorizing their syntax just in case. Some things in the middle, things you don’t use often enough to naturally remember, but often enough that you’d like to deliberately remember them. Some of these are what I call bicycle skills, things that you can’t learn just-in-time. For things in this middle ground, you might try something like Anki, a flashcard program with spaced repetition.

However, this middle ground should be very narrow, at least in my experience/opinion. For the most part, if you don’t use something often enough to keep it loaded in memory, I’d say either let it go or practice using it regularly.

Related posts

More of everything

If you want your music to have more bass, more mid-range, and more treble, then you just want the music louder. You can increase all three components in absolute terms, but not in relative terms. You can’t increase the proportions of everything.

Would you like more students to major in STEM subjects? OK, what subjects would you like fewer students to major in? English, perhaps? Administrators are applauded when they say they’d like to see more STEM majors, but they know better than to say which majors they’d like to see fewer of.

We have a hard time with constraints.

I’m all for win-win, make-the-pie-bigger solutions when they’re possible. And often they are. But sometimes they’re not.

Microsoft replacing SHA-1

According to this article, Microsoft is patching Windows 7 and Windows Server 2008 to look for SHA-2 hash functions of updates. These older versions of Windows have been using SHA-1, while newer version are already using SHA-2.

This is a good move, but unnecessary. Here’s what I mean by that. The update was likely unnecessary for reasons I’ll explain below, but it was easy to do, and it increased consistency across Microsoft’s product line. It’s also good PR.

What are SHA-1 and SHA-2?

Let’s back up a bit. SHA-1 and SHA-2 are secure hash functions [1]. They take a file, in this case a Microsoft software update, and return a relatively small number, small relative to the original file size. In the case of SHA-1, the result is 160 bits (20 bytes).  They’re designed so that if a file is changed, the function value is nearly certain to change. That is, it’s extremely unlikely that a change to the file would not result in a change to the hash value.

The concern isn’t accidental changes. The probability of accidentally producing two files with the same hash function value is tiny as I show here.

The concern is a clever attacker who could modify the software update in such a way that the hash function remains unchanged, bypassing the hash as a security measure. That would be harder to do with SHA-2 than with SHA-1, hence Microsoft’s decision years ago to move to SHA-2 for new versions of the operating system, and its recent decision to make the change retroactive.

How hard is it to produce collisions?

By a collision we mean two files that hash to the same value. It’s obvious from the pigeon hole principle [2] that collisions are possible, but how hard are they to produce deliberately?

Google demonstrated two years ago that it could produce two PDF files with the same SHA-1 hash value. But doing so required over 6,500 years of CPU time running in parallel [3]. Also, Google started with a file designed to make collisions possible. According to their announcement,

We started by creating a PDF prefix specifically crafted to allow us to generate two documents with arbitrary distinct visual contents, but that would hash to the same SHA-1 digest.

It would be harder to start with a specified input, such as a software update file and generate a collision. It would be harder still to generate a collision that had some desired behavior.

According to this page, it’s known how to tamper with two files simultaneously so that they will have the same SHA-1 hash values. This is what Google did, at the cost of thousands of CPU years. But so far, nobody has been able to start with a given file and create another file with the same SHA-1 value.

As I said at the beginning, it made sense for Microsoft to decide to move from SHA-1 to SHA-2 because the cost of doing so was small. But the use of SHA-1 hash codes is probably not the biggest security risk in Windows 7.

Related posts

[1] SHA-1 is a hash function, but SHA-2 is actually a family of hash functions: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. All are believed to provide at least 112 bits of security, while SHA-1 provides less than 63.

The SHA-x functions output x bits. The SHA-x/y functions use x bits of internal state and output y bits. To be consistent with this naming convention, SHA-1 should be called SHA-160.

[2] The pigeon hole principle says that if you put more than n things into n boxes, one of the boxes has to have more than one thing. If you hash files of more than n bits to n-bit numbers, at least two files have to go to the same value.

[3] If you were to rent this much CPU time in the cloud at 5 cents per CPU hour, it would cost about $2,800,000. If the only obstacle were the cost of computing resources, someone might be willing to pay that to tamper with a Microsoft update.

Supercookies

superhero cookies

Supercookies, also known as evercookies or zombie cookies, are like browser cookies in that they can be used to track you, but are much harder to remove.

What is a supercookie?

The way I first heard supercookies describe was as a cookie that you can appear to delete, but as soon as you do, software rewrites the cookie. Like the Hydra from Greek mythology, cutting off a head does no good because it grows back [1].

This explanation is oversimplified. It doesn’t quite work that way.

A supercookie is not a cookie per se. It’s anything that can be used to uniquely identify your browser: font fingerprinting, flash cache, cached images, browser plugins and preferences, etc. Deleting your cookies has no effect because a supercookie is not a cookie.

However, a supercookie can work with other code to recreate deleted cookies, and so the simplified description is not entirely wrong. A supercookie could alert web sites that a cookie has been deleted, and allow those sites to replace that cookie, or update the cookie if some browser data has changed.

What about ‘Do Not Track’?

You can ask sites to not track you, but this works on an honor system and is ignored with impunity, even (especially?) by the best known companies.

Apple has announced that it is removing Do Not Track from its Safari browser because the feature is worse than useless. Servers don’t honor it, and it gives a false sense of privacy. Not only that, the DNT setting is one more bit that servers could use to identify you! Because only about 9% of users turn on DNT, knowing that someone has it turned on gives about 3.5 bits of information toward identifying that person.

How to remove supercookies

How do you remove supercookies? You can’t. As explained above, a supercookie isn’t a file that can be removed. It’s a procedure for exploiting a combination of data.

You could remove specific ways that sites try to identify you. You could, for example, remove Flash to thwart attempts to exploit Flash’s data, cutting off one head of the Hydra. This might block the way some companies track you, but there are others.

It’s an arms race. As fingerprinting techniques become well known, browser developers and users try to block them, and those intent on identifying you come up with more creative approaches.

The economics of identification

Given the efforts companies use to identify individuals (or at least their devices), it seems it must be worth it. At least companies believe it’s worth it, and for some it probably is. But there are reasons to believe that tracking isn’t as valuable as it seems. For example, this article argues that the most valuable targeting information is freely given. For example, you know who is interested in buying weighted blankets? People who search on weighted blankets!

There have been numerous anecdotal stories recently of companies that have changed their marketing methods in order to comply with GDPR and have increased their sales. These are only anecdotes, but they suggest that at least for some companies, there are profitable alternatives to identifying customers who don’t wish to be identified.

Related posts

[1] In the Greek myth, cutting off one head of the Hydra caused two heads to grow back. Does deleting a supercookie cause it to come back stronger? Maybe. Clearing your cookies is another user behavior that can be used to fingerprint you.