A statistical problem with “nothing to hide”

One problem with the nothing-to-hide argument is that it assumes innocent people will be exonerated certainly and effortlessly. That is, it assumes that there are no errors, or if there are, they are resolved quickly and easily.

Suppose the probability of a correctly analyzing an email or phone call is not 100% but 99.99%. In other words, there’s one chance in 10,000 of an innocent message being incriminating. Imagine authorities analyzing one message each from 300,000,000 people, roughly the population of the United States. Then around 30,000 innocent people will have some ‘splaining to do. They will have to interrupt their dinner to answer questions from an agent knocking on their door, or maybe they’ll spend a few weeks in custody. If the legal system is 99.99% reliable, then three of them will go to prison.

Now suppose false positives are really rare, one in a million. If you analyze 100 messages from each person rather than just one, you’re approximately back to the scenario above.

Scientists call indiscriminately looking through large amounts of data “a fishing expedition” or “data dredging.” One way to mitigate the problem of massive false positives from data dredging is to demand a hypothesis: before you look through the data, say what you’re hoping to prove and why you think it’s plausible.

The legal analog of a plausible hypothesis is a search warrant. In statistical terms, “probable cause” is a judge’s estimation that the prior probability of a hypothesis is moderately high. Requiring scientists to have a hypothesis and requiring law enforcement to have a search warrant both dramatically reduce the number of false positives.

Related:

You commit three felonies a day
You do too have something to hide

 

36 thoughts on “A statistical problem with “nothing to hide”

  1. I often say what could be considered politically fiery (nothing of a terrorist nature) on the net. If they were to investigate me, they would soon realize I’m not a threat to anyone. Just exercising rational, objective, capitalist free speech. They would probably let me go. Still not keen on it though. I’m paranoid enough of unwarranted persecution as it is.

    Good article.

  2. James: I know someone who would describe himself the same way you do. His web site was shut down and his servers were confiscated.

  3. Yikes. Hopefully they didn’t find anything. They wouldn’t find anything with me. They’d be wasting their time. Now that I said that, they will probably investigate me. Maybe I just shot myself in the foot.

  4. Yeah, but you’re only looking at the false positives.

    What about the false negatives? If the “answer” is information in the data and you do not look at it then your false negatives is 100%. And what is the cost of a false negative? Three dead and hundreds injured in the latest attack in the US. A decapitated soldier in London. We don’t know the balance. They say attacks have been prevented.

    There is always a cost balance between the false positives and false negatives, in every experiment.

    I don’t say this lightly. I am probably under much higher scrutiny than you are because someone in my family has security clearance. I would be surprised if they WEREN’T tracking my email and phone calls. But when you sign up to serve your country that is a price you pay.

    I expect the cost is not borne evenly. 300 million are not under the same scrutiny. Americans calling their moms Islamic countries are obviously much more likely to be a false positive than Americans who do not have overseas contacts. Most have nothing to worry about. A few have reason to be worried.

    So there is a price of safety, and a cost of freedoms. I drove out of Boston on marathon day with my son in the back seat as the police and ambulances drove in, sirens blaring. I am not saying I know the answer, but I don’t think it makes sense to leave out half of the equation.

  5. Sara: I’m only looking at one small slice of a large problem. I’m saying that the innocent person does have something to lose from surveillance.

  6. So there is a price of safety, and a cost of freedoms.

    How much safety, though, for how much restriction of freedom? Consider the overall risk we face from terrorism. Imagine, just to take a completely absurdly exaggerated scenario, that there was a 9/11-sized attack every single year and we did absolutely nothing about it.

    That would be a death rate of about 3,000 people a year from terrorist attacks. (In the real world, there hasn’t been a year after 2001 in which the terrorism death rate inside the US was greater than 20. But I’m assuming a completely unrealistic level of complacency.)

    3,000 is less than a tenth of the number of Americans who die every year from exposure to secondhand smoke, about a tenth of the number who die in traffic accidents, and about a third of the number who die from gunshot wounds.

    We do have laws designed to protect us from all of those things, and they do limit our freedom to do what we want in public; but we don’t routinely troll emails or tap phones to prevent them.

  7. I have argued above that we have to consider uncertainty when evaluating the cost of surveillance. One cost is lives interrupted by false accusations.

    The other side of the argument is to consider uncertainty in evaluating the benefits of surveillance. When people say they’re willing to abandon their civil liberties in order to prevent terrorist attacks, we have to evaluate the probability that such a goal would be achieved. That’s a subject for another post.

    But all this is a moot point if surveillance is illegal. Reading someone’s email without a search warrant is a flagrant violation of the Fourth Amendment. So it seems the debate should properly be over whether we should amend the Constitution to allow surveillance of a kind currently prohibited by law because we believe the benefits outweigh the costs.

  8. How much safety, though, for how much restriction of freedom?

    I would gladly let anyone read my phone bill to decrease the probability of my son having his legs blown off by 0.001%, or really at all, because I couldn’t care less who reads my phone bill and I would like my son to keep his legs.

    But this is not decision to make. I do not have know what the real true and false positive numbers that John describes are. The nature of this information means that to be effective it can only be known by a few. Also, I have not been elected senator.

    Edward Snowden made the decision to compromise this program, not only replacing his own subjective judgement for the subjective judgement of our democratically elected officials, but also without having all of the information you need to correctly make such a decision.

    And that is the statistical argument for why Edward Snowden is an asshat.

  9. The false negative angle supports John’s argument. It says nothing less than that you simply won’t get all the real perpetrators, hence you still have to live in fear of terrorists gassing or bombing you on the train to work. Luckily there’s a lot less terrorists than there’s innocents – unfortunately for analysis that means there’s simply not enough structural data known, necessary to evaluate all the e-mails, phone conversations and so on. Not to forget those presumably targeted, that is real terrorists, not any ol’ Muslim, are hiding; they only advertise after the fact.

    BTW, Benjamin Franklin wasn’t in favour of even one falsely sentenced to get safety: He that would exchange liberty for temporary safety deserves neither liberty nor safety.

  10. I think Sara is mistakenly assuming there’s little-to-no benefit from privacy. I’d argue that almost every American implicitly values it highly. It simply takes a line of questioning like, “do you close your bedroom blinds when you’re intimate with your spouse?” for someone to realize that privacy != illegality and privacy is valuable.
    Saying, “I’d gladly give up X privacy, in exchange for more security” assumes that all Americans value privacy as little as you do.

  11. Statistical thinking really helps here because you assume tests are never perfectly reliable. You always talk about the probabilities of false positives and false negatives together. No one will take you seriously if you discuss one without the other. Decision theory is even better because there you explicitly assign a cost to each kind of error.

    Political discussions are not that way. People implicitly assume perfection on one side or the other, in this case that all innocent people will be exonerated or that all terrorists will be stopped. They’ll discuss benefit without cost, or cost without benefit, but not a tradeoff between both.

  12. Sorry William, I have a typo there.

    It should say “But this is not MY decision to make.”

    And by the same logic Snowden saying is, “I’d gladly give up X security, in exchange for more privacy” assumes that all Americans value security as little as he does do.

    And this is not his decision to make.

  13. This reminds me of an article I read a while ago that pretty much said never ever talk to cops unless they have a warrant. The reason was the cops will never use what they get from you in court for your defense. But they are free to use any circumstantial thing to help corroborate a theory they have that you are the perp.

    You literally have nothing to gain from cooperating. Add to that that it is perfectly legal for them to keep wording things slightly differently and play other word games to try to “catch you” in a lie. Then when you try to correct your misstatment they can just come back with “Oh he realized his mistake but it was too late he already told “the truth””.

    There is a cost to security in the case of airports security probably costs more than all the people lives lost in 9/11 per year of increased security scans. Did it prevent more events? Quite possibly but it doesn’t make the 30min a flight cost everyone pays free it just pushes the cost from people living/working in high value targets to all people that fly (as presumably Jihidis are going to crash into something “meaningful”).

  14. I thought this issue was settled, at least in broad strokes, in 1791 when the Bill of Rights was adopted.

  15. > And what is the cost of a false negative? Three dead and hundreds injured in the latest attack in the US. A decapitated soldier in London. We don’t know the balance. They say attacks have been prevented.

    Sara, I’m afraid your argument isn’t very persuasive, because terrorism would have to be pretty darned significant to meaningfully impact the death statistics. For instance, you’re almost 10x more likely to get killed by a *law enforcement officer* tomorrow than you are to get killed by a terrorist. This is the group of people that we now know are watching everything you do and say, and that’s only the statistical likelihood that they’ll just kill you outright (not necessarily with malice, certainly, but kill you nonetheless).

    This doesn’t even consider the likelihood they’ll just use their significant *legal* powers to arrest and detain without due process. Why aren’t you afraid of them?

  16. Two examples from history:
    1) During the Ford administration, the gov’t proposed giving swine flu vaccinations to everyone to prevent the deaths from the predicted swine flu epidemic. Fortunately, the death rate from the vaccination was known and it was shown that the massive vaccination program would almost certainly kill more than the possible flu epidemic. This program was cancelled.
    2) During the 1920’s a lot of collage-aged kids joined the Communist Party because it was fashionable and maybe a good way to meet members of the opposite sex. 30 years later, many of them had to explain to the government why their signatures were on those cards they had signed and forgotten about decades ago.

    Reacting to noise or false signal can and does result in costs, disruptions, and even death counts greater than would be incurred by doing nothing at all.

  17. Richard’s history on 1976 swine flu vaccination is wrong on all counts. The concern was the similarity of the flu strain to the one that killed 500,000 Americans in 1918-19 and perhaps 15-20 million around the world. The vaccine was developed, was administered to millions of Americans, but the flu strain had died out rather than become pandemic. The vaccine did, probably, have a serious side effect. Many attribute approximately 500 cases of Guillan-Barre syndrome, including 25 deaths to the vaccine. A consultant working with us was one of those cases. His goth-girl-engineer employee took over and ran his small consulting firm for the 3-4 months that it took him to recover for the paralysis induced by G-B.

  18. Sara: Allowing a government full surveillance of society is bad for many reasons. I will list some below.

    (1) Surveillance prohibits the liberty of choosing lifestyles.
    In a democracy the interests of a minority must be protected from the majority. For instance, the act of being gay, and many other styles of life, is frowned upon in many places. Knowing that they are being watched, people from such minorities will have their quality of life lowered. We do not want a society which prohibits diverse ways of living.

    (2) Surveillance breeds distrust between the people and their servants, the government. We, the citizens, have not opted in for being watched. Not everyone trusts the government, for a lot of different reasons. That is why we have the media to watch the government. In real democracies the citizens decide what the government should do for them. The government as idea and establishment is based upon trust, and being chosen by the people. That the government is secretly spying upon the entirety of its own citizens is a breach of this trust. Which really means that the citizens should choose themselves a new government.

    (3) Surveillance may be used maliciously. Information gathered by the people working for the government may be misused against people which are actually innocent by law. For example for politcal purposes, see https://en.wikipedia.org/wiki/Watergate_scandal

    (4) Information gathered by surveillance may go astray. It is important for many people to keep patient journals and other sensitive data private from certain other individuals (e.g. employers, possible romances, others). Spending the weekend at a gay bar, a comic convention, or a country music festival should be information that one chooses to disclose to whomever one wants to get this information.

    (5) Surveilling the entire population will have the effect of shifting the jurisdictional principle of “innocent until proven guilty” toward “guilty until proven otherwise”. Especially since the act of -planning- to do terrorism is now a crime, this is a big jurisdictional problem.

    (6) It should be the right of a citizen to have whatever strange friends he or she chooses to have, without fear of being spied upon or discriminated against. Also, being with the wrong people at the wrong time is even more dangerous since, again, just planning to do terrorism is a crime.

    There are probably lots other good reasons for being very skeptical toward being spied upon by the government as well, including unforeseen consequences.

  19. @Sara:
    > Yeah, but you’re only looking at the false positives.

    > What about the false negatives? If the “answer” is information in the data and you do not look at it then your false negatives is 100%. And what is the cost of a false negative? Three dead and hundreds injured in the latest attack in the US. A decapitated soldier in London. We don’t know the balance. They say attacks have been prevented.

    Let’s say, false negatives occur with a chance of 99.99%, while false positives occur with a chance of 99.9% (Get all terrorists and put up with getting some innocent people interrogated), and let’s also say, that 0.01% of the population are terrorists (a wild exaggeration).

    Then you get: 30,000 false positives (0.1% of ~300,000,000), ten times the number of the 3,000 terrorists in total, while you will miss 0.3 (0 to 1) of the terrorists.

  20. Oops, my numbers are off by a factor of 10, so 300,000 false positives vs. 30,000 terrorists total and 3 false negatives.

  21. Thomas, you have no idea what the false positive and false negative rates are and don’t have the data to reasonably estimate them.

    And this is a dumb statistical model anyway because it assumes everyone is targeted equally which has no bearing on reality.

    The false positives will disproportionately target people who talk to people overseas.

    The false negatives will disproportionately target people who volunteer to work as soldiers or first responders, like Officer Collier or the 403 first responders who died on 9/11.

    The people worried someone is reading their phone bill are unlikely to be in either category, as the middle class generally limit their fighting for the freedom to fighting on the internet rather than actually fighting.

    Since the actual chances of being falsely targeted are dominated by prior likelihoods a Bayesian model would make more sense.

  22. > The false positives will disproportionately target people who talk to people overseas.

    Otherwise known as discrimination via racial profiling.

  23. Discrimination via racial profiling: this is a tough one. Yeah people of different races might be more likely to call overseas but if calls overseas are seen as an (albeit weak) predictor of terrorist intent it isn’t the race but the actions that are being discriminated.

    Police do this all the time: park your car in an area known for drug activity, have a chat with a known prostitute etc etc. Otherwise legal actions in the wrong context make you look suspicious. The problem with this bulk surveillance to me is watch anyone long enough and they are bound to bump into someone or do something that would be suspicious. Filters need to be very narrowly defined so as to put the action into context not just “sometime in your life X and oh also some other time in your life Y” but more of “talked overseas 2 days before a IUD went off in Baghdad to a known associate of Jerka Jerka Jihad a known terrorist in Iraq”.

  24. Mike, your comment touches on something I plan to blog about: the difference between statistical evidence and legal evidence. Good statistical evidence may not be admissible legal evidence, and rightfully so.

  25. The algorithm the NSA describes is actually inherently unbiased. They take a phone number overseas suspected to belong to a terrorist and see what phone numbers in the US are calling that number. Race isn’t even in the Verizon data unless you specifically text mine surnames, but that would likely diminish the accuracy of the algorithm and there isn’t any data to suggest they are doing that.

  26. > They take a phone number overseas suspected to belong to a terrorist and see what phone numbers in the US are calling that number.

    1. How precisely do they ascertain that this phone number might belong to a terrorist?
    2. How do you even know this is how it works? This is perhaps how they are *alleging* it works, but there is a complete lack of transparency and oversight, so you’re just willing to take their word for it?

  27. @Jim: Thank you for correcting my Swine Flu info. I will have to stop using that as an example. I double-checked what you said and found that either my memory was wrong or, perhaps, I existed in an alternate time stream back in the ’70s.

    @Sara: Not looking at data does not result in 100% false negatives. To have a false negative (or positive) you must look at the data first.

    Lots of interesting discussion here.

  28. Sara: “I would gladly let anyone read my phone bill to decrease the probability of my son having his legs blown off”

    So, in order to incarcerate the criminals, we should allow the state to imprison 30k people? Is that the price you’d pay? What if one of them were your son, that you wanted to protect in the first place? Would you “gladly let anyone” take him into custody, registering him as a terrorist or criminal, ruining his chances to get a job, education, a proper family, and a free life to get then stuck in prison for alleged crimes he did not commit? If your answer is yes, you need counsel. Not a single parent I know wants their kids prosecuted, specially when they’re freaking innocent. That’s why we, the people, allow certain criminals to walk away because of lack of evidence and certainty ‘beyond reasonable doubt’, just to make sure not a single innocent man has to spend decades of his only life in prison.

    Your argument overthrows the very reason so many people died before, though didn’t want to either, to protect civil rights and liberties. That’s in the constitution. It’s easy to forget about it when you haven’t shed blood to protect those liberties.

Comments are closed.