From the monthly archives:

March 2009

Shorter URLs by using Unicode

by John on March 12, 2009

Tinyarro.ws is a service like tinyurl.com and others that shorten URLs. However, unlike similar services, Tinyarro.ws uses Unicode characters, allowing it to encode more possibilities into each character. These sub-compact URLs may contain Chinese characters, for example, or other symbols unfamiliar to many users. They’re no good for reading aloud, say over the phone or on a podcast. But they’re ideal for Twitter because you only have to click on the link, not type it into a browser.

Here’s a URL I got when I tried the Tinyarro.ws site:

screen shot from tinyarro.ws

The resulting URL may not display correctly in your browser depending on what fonts you have installed: http://➡.ws/㣸.

I pasted the URL into Microsoft Word and used Alt-x to see what the Unicode characters were. (See Three ways to enter Unicode characters in Windows.) The arrow is code point U+27A1 and the final character is code point U+38F8. I have no idea what that character means. I would appreciate someone letting me know in the comments.

Unicode character U=38F8

Related post: How to insert graphics in Twitter messages

{ 4 comments }

How to grep Twitter

by John on March 12, 2009

Twitter has an extensive search API. To build the URL for a query, start with the base http://search.twitter.com/search.atom?q=. To search for a word, just append that word to the base, such as http://search.twitter.com/search.atom?q=Coltrane to search for tweets containing “Coltrane.”

To search for a term within a particular user’s tweet stream, start with the base URL and append +from%3A and the user’s name. (The %3A is a URL- encoded colon.) See the search API page for other options, such as specifying the number of requests per page to return (look for rpp) or restricting the language (look for lang).

As far as I can tell, the API does not support regular expressions, but you could loop over the search results programmatically. Here’s how you’d do it in PowerShell.

First, grab the search results as a string. Say we want to search through the latest tweets from Hal Rottenberg, @halr9000.

$base = "http://search.twitter.com/search.atom?q="
$query = "from%3Ahalr9000"
$str = (new-object net.webclient).DownloadString($base + $query)

Now $str contains an XML string of results formatted as an Atom feed. The root element is <feed> and individual tweets are contained in <entry> tags under that. The text of the tweet is in the <title> tag and the other tags contain auxiliary data such as time stamps. The following code shows how to search for the regular expression \d{4}. (Look for four-digit numbers.)

( $str).feed.entry | where-object {$_.title -match "\d{4}"}

In English, the code says to cast $str to an XML document object and pipe the <entry> contents to the filter that selects those objects whose <title> strings match the regular expression.

The search API limits the number of entries it will return, so it’s best to do as much filtering as you can via the Twitter site before doing your own filtering.

Related posts:

Regular expressions in PowerShell and Perl
Table-driven text munging in PowerShell

{ 0 comments }

Periodic table of Typefaces

by John on March 12, 2009

Squidspot.com has created an interesting period table of typefaces.

thumbnail of period table of typefaces from Squidspot.com

Related post: Periodic table of Perl operators

{ 3 comments }

IronPython article on CodeProject

by John on March 11, 2009

It’s difficult to use SciPy from IronPython because much of SciPy is implemented in C rather than in Python itself. I wrote an article on CodeProject summarizing some things I’d learned about using Python modules with IronPython. (Many thanks to folks who left helpful comments here and answered my questions on StackOverflow.) The article gives stand-alone code for computing normal probabilities and explains why I wrote my own code rather than using SciPy.

Here’s the article: Computing Normal Probabilities in IronPython

Here’s some more stand-alone Python code for computing mathematical functions and generating random numbers. The code should work in Python 2.5, Python 3.0, and IronPython.

Related post: IronPython is a one-way gate

{ 4 comments }

Canonical examples from robust statistics

by John on March 11, 2009

The following definition of robust statistics comes from P. J. Huber’s book Robust Statistics.

… any statistical procedure should possess the following desirable features:

  1. It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
  2. It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly. …
  3. Somewhat larger deviations from the model should not cause a catastrophe.

Classical statistics focuses on the first of Huber’s points, producing methods that are optimal subject to some criteria. This post looks at the canonical examples used to illustrate Huber’s second and third points.

[click to continue...]

{ 1 comment }

Here are six of the best posts from February 2008. Four are on the list because they were popular, and two just because I liked them.

  • How to avoid being outsourced or open sourced summarizes advice from three authors regarding what skills are highly valued and difficult to outsource.
  • Free bitmap to vector software was an announcement for VectorMagic, software for turning bitmap images into vector images so they can be re-sized without becoming jagged. This software does a great job at solving a common problem, but a year later I still don’t think too many people know about it.
  • Everything begins with p was a surprisingly popular little rant about statistical notation.
  • Honeybee genealogy shows how Fibonacci numbers come up when examining a honeybee’s ancestry.
  • Enterprising software summarizes some comments from Cyndi Mitchell complaining that “enterprise” software is nothing like connotation of “enterprise” elsewhere.
  • Rethinking interruptions argues that instead of only trying to reduce interruptions, we should accept that interruptions are inevitable and look at strategies for recovering quickly from interruptions.

{ 0 comments }

New math blog

by John on March 10, 2009

John Moeller started a new blog called on Topology. Today’s post looks at how the Gaussian distribution behaves as dimension increases.

So the points that follow the distribution mostly sit in a thin shell around the mean. But the density function still says that the density is highest at the mean. Now that’s weird.

A week ago Moeller wrote a post on a similar topic, looking at the volumes of spheres and cubes as dimension increases.

{ 1 comment }

Tool for turning off web page clutter

by John on March 9, 2009

“Readability” is a bookmarklet from Arc90 for turning off all the clutter that surrounds the main text on a web page. I just installed it and played with it a little while. Looks promising.

{ 1 comment }

Free PowerShell eBook from Keith Hill

by John on March 9, 2009

Keith Hill has turned his Effective PowerShell series of blog posts into a 50-page PDF. He just posted Effective Windows PowerShell yesterday.

{ 1 comment }

Spherical trigonometry

by John on March 6, 2009

When I was in growing up, I had a copy of Handbook of Mathematical Tables and Formulas by R. S. Burington from 1940. When I first saw the book, much of it was mysterious. Now everything in the book is very familiar, and would be familiar to anyone who has gone through the typical college calculus sequence. But there is one exception: two pages on spherical trigonometry, the study of triangles drawn on a sphere.

picture of great circles on a sphere determining a spherical triangle

Until recently, the only place I’d ever heard of spherical trig was Burington’s old book. I  managed to find a book on spherical trig at the Rice University library, Spherical Trigonometry by J. H. D. Donnay, written in 1945. Amazon has several recent books on the spherical trig, but it appears they’re all reprints of much older books. It seems interest in teaching spherical trig died sometime after World War II.

I’m not sure why schools quit teaching spherical trig. It’s a practical subject; after all, we live on a sphere. Surveyors, navigators, and astronomers would find it useful. Somewhere along the way, solid geometry fell out of favor, and I suppose spherical trig fell out of favor with it. The standard math curriculum changed in order to make a bee line for calculus. Presumably this is to meet the needs of science and engineering students who need calculus as a prerequisite for their courses.  In the process, subjects like solid geometry were squeezed out. I see the logic in the contemporary sequence, but it is interesting that the sequence was different a generation or two ago.

See Notes on Spherical Trigonometry for a list of some of the elegant identities from this now obscure area of math. For example, the interior angles of a spherical triangle must add up to more than 180°, and the area of the triangle is proportional to the amount by which the sum of these angles exceeds 180°.

Related posts:

What is the shape of the earth?
Finding distances using longitude and latitude

{ 9 comments }

Math Teachers at Play #2

by John on March 6, 2009

The second edition of the new blog carnival Math Teachers at Play was posted this morning at Let’s Play Math.

[Photo by Sister72.]

The carnival has a wide variety of mathematical posts from elementary to advanced as well as posts on math education. Interspersed between the posts are interesting quotations such as the following from Hugo Rossi.

In the fall of 1972 President Nixon announced that the rate of increase of inflation was decreasing. This was the first time a sitting president used the third derivative to advance his case for reelection.

From Mathematics Is an Edifice, Not a Toolbox

{ 0 comments }

Two common ways to estimate the center of a set of data are the sample mean and the sample median. The sample mean is sometimes more efficient, but the sample median is always more robust. (I’m going to cut to the chase first, then go back and define basic terms like “median” and “robust” below.)

When the data come from distributions with thick tails, the sample median is more efficient. When the data come from distributions with a thin tail, like the normal, the sample mean is more efficient. The Student-t distribution illustrates both since it goes from having thick tails to having thinner tails as the degrees of freedom, denoted ν, increase.

When ν = 1, the Student-t is a Cauchy distribution and the sample mean wanders around without converging to anything, though the sample median behaves well. As ν increases, the Student-t becomes more like the normal and the relative efficiency of the sample median decreases.

Here is a plot of the asymptotic relative efficiency (ARE) of the median compared to the mean for samples from a Student-t distribution as a function of the degrees of freedom ν. The vertical axis is ARE and the horizontal axis is ν.

The curve crosses the top horizontal line at 4.67879. For values of ν less than that cutoff, the median is more efficient. For larger values of ν, the mean is more efficient. As ν gets larger, the relative efficiency of the median approaches the corresponding relative efficiency for the normal, 2/π = 0.63662, indicated by the bottom horizontal line.

Backing up

The sample mean is just the average of the sample values. The median is the middle value when the data are sorted.

Since data have random noise, statistics based on the data are also random. Statistics are generally less random than the data they’re computed from, but they’re still random. If you were to compute the mean, for example, many times, you’d get a different result each time. The estimates bounce around. But there are multiple ways of estimating the same thing, and some ways give estimates bounce around less than others. These are said to be more efficient. If your data come from a normal example, the sample median bounces around about 25% more than the sample mean. (The variance of the estimates is about 57% greater, so the standard deviations are about 25% greater.)

But what if you’re wrong? What if you think the data are coming from a normal distribution but they’re not. Maybe they’re coming from another distribution, say a Student-t. Or maybe they’re coming from a mixture of normals. Say 99% of the time the samples come from a normal distribution with one mean, but 1% of the time they come from a normal distribution with another mean. Now what happens? That is the question robustness is concerned with.

Say you have 100 data points, and one of them is replaced with ∞. What happens to the average? It becomes infinite. What happens to the median? Either not much or nothing at all depending on which data point was changed. The sample median is more robust than the mean because it is more resilient to this kind of change.

Asymptotic relative efficiency (ARE) is a way of measuring how much statistics bounce around as the amount of data increases. If I take n data points and look at √n times the difference between my estimator and the thing I’m estimating, often that becomes approximately normally distributed as n increases. If I do that for two different estimators, I can take the ratio of the variances of the normal distributions that this process produces for each. That’s the asymptotic relative efficiency.

Often efficiency and robustness are in tension and you have to decide how much efficiency you’re willing to trade off for how much robustness. ARE gives you a way of measuring the loss in efficiency if you’re right about the distribution of the data but choose a more robust, more cautious estimator. Of course if you’re significantly wrong about the distribution of the data (and often you are!) then you’re better off with the more robust estimator.

{ 2 comments }

DSLs in PowerShell

by John on March 5, 2009

In an earlier post, I quoted John Lam saying that one reason Ruby is such a good language for implementing DSLs (domain specific languages) is that function calls do not require parentheses. This allows DSL authors to create functions that look like new keywords. I believe I heard Bruce Payette say in an interview that Ruby had some influence on the design of PowerShell. Maybe Ruby influenced the PowerShell team’s decision to not use parentheses around function arguments. (A bigger factor was convenience at the command line and shell language tradition.)

In what ways has Ruby influenced PowerShell? And if Ruby is good for implementing DSLs, how good would PowerShell be?

Update: See Keith Hill’s blog post on PowerShell function names and DSLs.

{ 2 comments }

Interesting post from Brendan O’Connor:

Comparison of data analysis packages: R, Matlab, SciPy, Excel, SAS, SPSS, Stata

{ 2 comments }

Why Ruby is a good language for DSLs

by John on March 4, 2009

I’ve heard people say that Ruby is an excellent language to use for creating DSLs (domain specific languages, little custom languages written for a specific problem domain) but I never understood why. I don’t know Ruby, so it was hard for me to imagine. But I heard John Lam give a very clear explanation on the ALT.NET podcast. He said that since Ruby doesn’t require parentheses around function call arguments, you can make function names look like keywords. For example, you could write functions “skin” and “this” and make them look like keywords. The expression “skin this cat” is actually parsed as skin(this, cat) but the former can look more natural.

Another interesting quote from John Lam in that interview was “only a masochist would program office from C#.” He said that because of late binding, automating Microsoft Office is much easier from Ruby. The office object model was designed to be used from a language with late binding  (i.e. VB) and so Ruby is easier to use than C#.

{ 4 comments }