From the monthly archives:

October 2008

Comparing two ways to fit a line to data

by John on October 20, 2008

Suppose you have a set of points and you want to fit a straight line to the data. I’ll present two ways of doing this, one that works better numerically than the other.

Given points (x1, y1), …, (xN, yN) we want to fit the best line of the form y = mx + b, best in the sense of minimizing the square of the differences between the actual data yi values and the predicted values mxi + b.

The textbook solution is the following sequence of equations.

The equations above are absolutely correct, but turning them directly into software isn’t the best approach. Given vectors of x and y values, here is C++ code to compute the slope m and the intercept b directly from the equations above.

	double sx = 0.0, sy = 0.0, sxx = 0.0, sxy = 0.0;
	int n = x.size();
	for (int i = 0; i < n; ++i)
	{
		sx += x[i];
		sy += y[i];
		sxx += x[i]*x[i];
		sxy += x[i]*y[i];
	}
	double delta = n*sxx - sx*sx;
	double slope = (n*sxy - sx*sy)/delta;
	double intercept = (sxx*sy - sx*sxy)/delta;

This direct method often works. However, it presents multiple opportunities for loss of precision. First of all, the Δ term is the same calculation criticized in this post on computing standard deviation. If the x values are large relative to their variance, the Δ term can be inaccurate, possibly wildly inaccurate.

The problem with computing Δ directly is that it could potentially require subtracting nearly equal numbers. You always want to avoid computing a small number as the difference of two big numbers if you can. The equations for slope and intercept have a similar vulnerability.

The second method is algebraically equivalent to the method above, but it is better suited to numerical calculation. Here it is in code.

	double sx = 0.0, sy = 0.0, stt = 0.0, sts = 0.0;
	int n = x.size();
	for (int i = 0; i < n; ++i)
	{
		sx += x[i];
		sy += y[i];
	}
	for (int i = 0; i < n; ++i)
	{
		double t = x[i] - sx/n;
		stt += t*t;
		sts += t*y[i];
	}

	double slope = sts/stt;
	double intercept = (sy - sx*slope)/n;

This method involves subtraction, and it too can fail for some data sets. But in general it does better than the former method.

Here are a couple examples to show how the latter method can succeed when the former method fails.

First we populate x and y as follows.

	for (int i = 0; i < numSamples; ++i)
	{
		x[i] = 1e10 + i;
		y[i] = 3*x[i] + 1e10 + 100*GetNormal();
	}

Here GetNormal() is generator for standard normal random variables. Aside from some small change due to the random noise, we would expect slope 3 and intercept 1010.

The most direct method gives slope -0.666667 and intercept intercept 5.15396e+10. The better method gives slope: 3.01055 and intercept: 9.8945e+9. If we remove the Gaussian noise and try again, the direct method gives slope 3.33333 and intercept 5.72662e+9 while the better method is exactly correct: slope 3 intercept and 1e+10.

As another example, fill x and y as follows.

	for (int i = 0; i < numSamples; ++i)
	{
		x[i] = 1e6*i;
		y[i] = 1e6*x[i] + 50 + GetNormal();
	}

Now both method compute the intercept exactly as 1e6 but the direct method computes the intercept as 60.0686 while the second method computes 52.416. Neither is very close to the theoretical value of 50, but the latter method is closer.

To read more about why one way can be so much better than the other, see this writeup on the analogous issue with computing standard deviation: theoretical explanation for numerical results.

Update: See related post on computing correlation coefficients.

{ 1 comment }

Networks and power laws

by John on October 19, 2008

In many networks — social networks, electrical power networks, networks of web pages, etc.— the number of connections for each node follows a statistical distribution known as a power law. Here’s a model of network formation that shows how power laws can emerge from very simple rules.

Albert-László Barabási describes the following algorithm in his book Linked. Create a network by starting with two connected nodes. Then add nodes one at a time. Every time a new node joins the network, it connects to two older nodes at random with probability proportional to the number of links the old nodes have. That is, nodes that already have more links are more likely to get new links. If a node has three times as many links as another node, then it’s three times as attractive to new nodes. The rich get richer, just like with Page Rank.

Barabási says this algorithm produces networks whose connectivity distribution follows a power law. That is, the number of nodes with n connections is proportional to n-k. In particular he says k = 3.

I wrote some code to play with this algorithm. As the saying goes, programming is understanding. There were aspects of the algorithm I never would have noticed had I not written code for it. For one thing, after I decided to write a program I had to read the description more carefully and I noticed I’d misunderstood a couple details.

If the number of nodes with n connections really is proportional to n-k, then when you plot the number of nodes with 1, 2, 3, … connections, the points should fall near a straight line when plotted on the log-log scale and the slope of this line should be -k.

In my experiment with 10,000 nodes, I got a line of slope -2.72 with a standard error of 0.04. Not exactly -3, but maybe the theoretical result only holds in the limit as the network becomes infinitely large. The points definitely fall near a straight line in log-log scale:

In this example, about half the nodes (5,073 out of 10,000) had only two connections. The average number of connections was 4, but the most connected node had 200 connections.

Note that the number of connections per node does not follow a bell curve at all. The standard deviation on the number of connections per node was 6.2. This means the node with 200 connections was over 30 standard deviations from the average. That simply does not happen with normal (Gaussian) distributions. But it’s not unusual at all for power law distributions because they have such thick tails.

Of course real networks are more complex than the model given here. For example, nodes add links over time, not just when they join a network. Barabási discusses in his book how some elaborations of the original model still produce power laws (possibly with different exponents) and while others do not.

{ 2 comments }

42nd Carnival of Mathematics hosted here soon

by John on October 18, 2008

There was some confusion over when and where the next Carnival of Mathematics will be hosted. It will be here on Friday, October 24. Please help spread the word since the date is earlier than the originally announced date.

Please submit your best recent math posts via the Blog Carnival site.

A few previous Carnivals of Mathematics: 41, 40, 39

{ 1 comment }

Table-driven text munging in PowerShell

by John on October 17, 2008

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "\s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &amp;
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

{ 0 comments }

Manipulating the clipboard with PowerShell

by John on October 17, 2008

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into &lt; and &gt; etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

$a = get-clipboard;
$a = $a -replace "&", "&amp;";
$a = $a -replace "<", "&lt;";
$a = $a -replace ">", "&gt;";
$a = $a -replace '"', "&quot;"
$a = $a -replace "'", "&#39;"
out-clipboard $a

So this C++ code

double& x = y;
char c = 'k';
string foo = "hello";
if (p < q) ...

turns into this HTML code

double&amp; x = y;
char c = &#39;k&#39;;
string foo = &quot;hello&quot;;
if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

Version:0.9
StartHTML:00000136
EndHTML:00000214
StartFragment:00000170
EndFragment:00000178
SourceURL:http://www.codeproject.com/
<html><body>
<!--StartFragment-->snippets<!--EndFragment-->
</body>
</html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX; it has been assigned a work item.

{ 1 comment }

Default arguments and lazy evaluation in R

by John on October 16, 2008

In C++, default function arguments must be constants, but in R they can be much more general. For example, consider this R function definition.

    f <- function(a, b=log(a)) { a*b }

If f is called with two arguments, it returns their product. If f is called with one argument, the second argument defaults to the logarithm of the first. That’s convenient, but it gets more surprising. Look at this variation.

    f <- function(a, b=c) {c = log(a); a*b}

Now the default argument is a variable that doesn’t exist until the body of the function executes! If f is called with one argument, the R interpreter chugs along until it gets to the last line of the function and says “Hmm. What is b? Let me go back and see. Oh, the default value of b is c, and now I know what c is.”

This behavior is called lazy evaluation. Expressions are not evaluated unless and until they are needed. It’s a common feature of functional programming languages.

There’s a little bit of lazy evaluation in many languages: the second argument of a Boolean expression is often not evaluated if the expression value is determined by the first argument. For example, the function bar() in the expression

    while (foo() && bar())

will not be called if the function foo() returns false. But functional languages like R take lazy evaluation much further.

R was influenced by Scheme, though you could use R for a long time without noticing its functional roots. That speaks well of the language. Many people find functional programming too hard and will not use languages that beat you over the head with their functional flavor. But it’s intriguing to know that the functional capabilities are there in case they’re needed.

See also R for programmers.

{ 2 comments }

Simple legacy

by John on October 15, 2008

Benoit Mandelbrot makes the following observation in The Fractal Geometry of Nature.

Many creative minds overrate their most baroque works, and underrate the simple ones. When history reverses such judgments, prolific writers come to be best remembered as authors of “lemmas,” of propositions they had felt “too simple” in themselves and had to be published solely as preludes to forgotten theorems.

If you’re not familiar with lemmas and theorems, think of a musician who is famous for a short prelude written as an introduction to a longer piece nobody remembers. For example, Rossini’s four-minute William Tell Overture is far more famous than the four-hour William Tell opera it introduces.

Returning to famous mathematicians, I remember as an undergraduate hearing of Schwarz’s lemma and waiting for the corresponding theorem that never came. The same applies to Poincaré’s lemma, Zorn’s lemma, and Fatou’s lemma.

We’re all naturally proud of things we work hard for. We expect other people to value our work in proportion to the amount of effort we put into it, but the world doesn’t work that way. It can be discouraging focus on the big, complex projects we’ve worked on that haven’t been appreciated. On the other hand, it can be very encouraging to think of the potential impact of small projects and simple ideas.

{ 10 comments }

I played around with Google’s translator a little after adding some notranslate directives as discussed in my previous post. Google did honor my requests to mark some sections as literal text to not be translated. Google’s translator was also able to recognize my name as a name without special markup. Yahoo, on the other hand, translated my name, turning “Cook” into “Cuisinier” in French.

Google treated text inside <code> tags as literals that should not be translated. That is, Google would leave my source code snippets alone and only translate the English prose surrounding the code. Yahoo, on the other hand, would translate everything, including source code. For example, I had some PowerShell code on my page with the keyword matches that Google left alone but Yahoo translated into “allumettes,” presumably good French prose but not a legal PowerShell keyword.

One puzzling thing about the Google translation engine was that it would change which text was hyperlinked. For example, the text “My résumé” was changed to “Mon CV,” linking on the translation for “my.” Yahoo produced what I expected, “Mon résumé.” There were several other instances in which Google produced odd links, such as hyperlinking the | marker between words that were linked before. For example, the footer of my web site has these links:

Home | Sitemap | My blog | Search

Yahoo turned this into

Maison | Sitemap | Mon blog | Recherche

while Google produced

Accueil | Plan du site | Mon blog | Recherche

So Google incorporated the separator bars as part of words, and moved the last link from “Recherche” to the bar separating “blog” and “Rescherche.”

One advantage of Google’s translation is that it lets you hover your mouse over a line of translated text and see the original text.

{ 2 comments }

Giving hints to automatic translators

by John on October 14, 2008

One problem with machine translation is that machines don’t know when to stop translating. For example Yahoo’s Babel Fish translator translates my last name “Cook” literally to “Cocinero” in Spanish and “Cuisinier” in French.

Today Google announced a way to tell its translator that text should not be translated. Place such text inside a <span> tag with the attribute class="notranslate". I tried this on a web page that explained that a certain piece of code printed out “Hello world.” Since “Hello world” is literal output, it should be left untranslated, not turned into, for example, “Bonjour le monde.” The solution was to modify the HTML to say

The code above prints &ldquo;<span class="notranslate">Hello world</span>.&rdquo;

To prevent an entire page from being translated, add the following tag in the <head> section of the page.

<meta name="google" value="notranslate">

I suppose other machine translation efforts, such as those from Microsoft and Yahoo, will follow Google’s lead and support the class=notranslate directive.

{ 1 comment }

API symmetry

by John on October 14, 2008

Symmetric APIs are easier to use. I was reminded of this when doing some regular expression programming in Python and comparing it to Perl. Perl’s regular expression operators for search and replace are symmetric in a way that their Python counterparts are not.

Perl uses m/pattern/ for matching and s/pattern/replacement/ for substitution. Both apply to the first instance of a pattern in a string by default. The g option following a match or substitute operator causes the command to apply to all instances of the pattern. The i option after either a match or substitute command causes the pattern to apply in a case-insensitive manner. Matching and substitution are symmetric.

Python uses re.search() for matching and re.sub() for substitution. The search function can only apply to the first instance of a pattern; to match all instances of a pattern, use re.findall(). The function re.sub() applies to all instances by default, but it has a max parameter that can be set to limit the number of instances it applies to. To make a search pattern case-insensitive, pass in re.IGNORECASE flag. To make a substitution case-insensitive, modify the regular expression itself by adding (?i).

In general, I find Python syntax much cleaner than Perl, but regular expressions are implemented more elegantly in Perl.

{ 3 comments }

In 1986, Lawrence Leemis published a diagram illustrating the relationships between a couple dozen probability distributions. In 2008, he published a much larger diagram, available online.

I’ve created a diagram similar to the original Leemis diagram with 21 of the most common distributions. You can click on a distribution name to find out its parameterization, and you can click on an arrow to get the details of the relationship it represents. Here’s a small version of the diagram.

See Clickable chart of distribution relationships for the full diagram.

{ 0 comments }

Watch what you name graphics files in LaTeX

by John on October 11, 2008

A while back I was trying to paste a figures into a LaTeX document this evening with names like foo_27.png and foo_32.2.png, putting a parameter value into the name of the plot. The former worked but the latter didn’t.

It turns out the \includegraphics command parses the file extension in a naive way to determine the file type. When it sees foo_27.png, it says “OK, a .png file”. But when it sees foo_32.2.png, it says “.2.png? I’ve never heard of that file type.”

Related post:

Including graphics in LaTeX documents

{ 11 comments }

Unusual security behavior in TightVNC

by John on October 11, 2008

A friend of mine told me recently about his adventures using TightVNC. When you install TightVNC it asks for two passwords.

Hmm. Wonder what that’s about.

Now its time to log in.

Surprise! The behavior of the software depends on what password you used to log in. Permissions are not tied to the user but to the password. If I’m John who came in with password Snoopy, I have one set of permissions, but if I’m John who came in with password Linus, I get another set of permissions.

I suppose this makes sense in isolation, but it’s completely contrary to convention. Yes, it could make sense for one person to have two sets of permissions. But this is nearly always done by having two accounts, not two passwords for the same account. Convention is to associate privilege with a user, not with how the user logged in. I see how it could be convenient to have two sets of privilege associated with one account, but there’s no indication in the log in dialog that it matters what password you enter. A better solution would be to have someone log in with one password, but if they have multiple privilege options, show radio buttons and ask which set of privileges they want to exercise.

{ 0 comments }

Quadratic interpolator

by John on October 10, 2008

I’ve added an quadratic interpolator to my web site similar to  the linear interpolator I wrote earlier. I find this useful when I’m doing a brute force search for some parameter and I’m unwilling or unable to do anything fancier.

{ 0 comments }

What’s good for you in red wine

by John on October 8, 2008

I’ve heard two podcasts that contradict each other somewhat as far as what it is in red wine that’s good for you.

According to this Scientific American interview with Charles Bamforth it’s really the alcohol in wine that’s good for your arteries and so you might as well drink beer, and that in fact beer has some health advantages over red wine. Bamforth is the Anheuser-Busch Endowed Professor of Brewing Science at U.C. Davis. (Hmm…)

But according to the November 30, 2006 podcast from Nature, the procyanidins in red wine are particularly good for your arteries. While there may be health benefits from alcohol in general, red wine has unique benefits, especially red wines from certain regions.

Red wine contains the anti-oxidant resveratrol which has received a lot of press. Another Nature podcast (November 2, 2006) reports research that indicates resveratrol has increased life spans in experiments with yeast, flies, worms, and mice, and there is reason to believe it might do the same in humans. However, the amount of resveratrol in red wine is insignificant: you’d have to drink hundreds of gallons of wine a day to get a beneficial dose of resveratrol.

Unlike resveratrol, there are enough procyanidins in a glass of red wine to make a difference. This has been tested by extracting the procyanidins and testing it on cultured endothelial cells. Also, people live longer in the regions that produce wines especially high in procyanidins, namely Sardinia and southwestern France. Here’s the Nature article reporting the research.

It seems that it may not be the soil in certain areas but rather the wine making traditions that increase the procyanidin levels. Vintners in these areas leave the crushed grape seeds in with the juice longer.

Related posts:

Wine, Beer, and statistics
Wine and politics

{ 1 comment }