C++ templates may reduce memory footprint

One of the complaints about C++ templates is that they can cause code bloat. But Scott Meyers pointed out in an interview that some people are using templates in embedded systems applications because templates result in smaller code.

C++ compilers only generate code for template methods that are actually used in an application, so it’s possible that code using templates may result in a smaller executable than code that a more traditional object oriented approach.

Greek letters and math symbols in (X)HTML

It’s not hard to use Greek letters and math symbols in (X)HTML, but apparently it’s not common knowledge either. Many pages insert little image files every time they need a special character. Such web pages look a little like ransom notes with letters cut from multiple sources.  Sometimes this is necessary but often it can be avoided.

I’ve posted a couple pages on using Greek letters and math symbols in HTML, XML, XHTML, TeX, and Unicode. I included TeX because it’s the lingua franca for math typography, and I included Unicode because the X(HT)ML representation of symbols is closely related to Unicode.

The notes give charts for encoding Greek letters and some of the most common math symbols. They explain how HTML and XHTML differ in this context and also discuss browser compatibility issues.

Solving the problem with Visual Source Safe and time zones

If you use Microsoft Visual SourceSafe (VSS) with developers in more than one time zone, you may be in for an unpleasant surprise.

VSS uses the local time on each developer’s box as the time of a check in/out. If every developer’s time is set by the same reference, say an NNTP server, then no problems will result. But what if one developer is in a different time zone? Say one developer is in Boston and one in Houston. The Boston developer checks in a file at 2:00 PM Eastern time, then 10 minutes later the Houston developer checks out the file, quickly makes a change, and checks the file back in at 2:20 Eastern time, 1:20 Central local time. VSS now says the latest version of the file is the one made at 2:00 Eastern time. When the Houston developer looks at VSS, the Boston developer’s changes were make 40 minutes into the future!

This has been a problem with VSS for all versions prior to VSS 2005, and is still a problem in the most recent version by default. Starting with the 2005 version, you can configure VSS 2005 to use “server local time.” This means all transactions will use the time on the server where the VSS repository is located. The time is stored internally as UTC (GMT) but displayed to each user according to his own time zone. In the example above, the server would record the Boston check-in as 7:00 PM UTC and the Houston check-in as 7:20 PM UTC. The Boston user would see the check-ins as happening at 2:00 and 2:20 Eastern time, and the Houston user would see the check-ins as happening at 1:00 and 1:20 Central time. Importantly, everyone agrees which check-in occurred first.

A more subtle version of the problem can occur even if all users are in the same time zone but have not synchronized their clocks. This is a good reason to use server local time even if everyone works in the same city.

Although it is possible to set server local time in VSS 2005, it still uses client local time by default, presumably for backward compatibility. You have to turn on server local time by opening the VSS Administrator tool and clicking on Tools/Options and going to the Time Zone tab.

SourceSafe Options dialog for time zones

Microsoft has written about this problem at http://support.microsoft.com/kb/931804. Note that the solution only applies to Visual SourceSafe 2005 and later.

When you set the time zone, you will get dire warnings discouraging you from doing so.

To avoid unintended data loss, do not change the time zone for a Visual SourceSafe database after it has been established and is being used.

You may have to let VSS set idle for as many hours as your developers span time zones to let everything synchronize.

You have a web site?

I was talking to my wife about my web site last night. One my daughters interrupted with “You have a web site?!” Then one of her sisters put things in perspective. “Yeah, but it doesn’t have any games.”

Programming language subsets

I just found out that Douglas Crockford has written a book JavaScript: The Good Parts. I haven’t read the book, but I imagine it’s quite good based on having seen the author’s JavaScript videos.

Crockford says JavaScript is an elegant and powerful language at its core, but it suffers from numerous egregious flaws that have been impossible to correct due to its rapid adoption and standardization.

I like the idea of carving out a subset of a language, the good parts, but several difficulties come to mind.

  1. Although you may limit yourself to a certain language subset, your colleagues may choose a different subset. This is particularly a problem with an enormous language such as Perl. Coworkers may carve out nearly disjoint subsets for their own use.
  2. Features outside your intended subset may be just a typo away. You have to have at least some familiarity with the whole language because every feature is a feature you might accidentally use.
  3. The parts of the language you don’t want to use still take up space in your reference material and make it harder to find what you’re looking for.

One of the design principles of C++ is “you only pay for what you use.” I believe the primary intention was that you shouldn’t pay a performance penalty for language features you don’t use, and C++ delivers on that promise. But there’s a mental price to pay for language features you don’t use. As I’d commented about Perl before, you have to use the language several hours a week just to keep it loaded in your memory.

There’s an old saying that when you marry a girl you marry her family. A similar principle applies to programming languages. You may have a subset you love, but you’re going to have to live with the rest of the language.

Revenue per megabyte

Here are some numbers on the revenue data carriers make per megabyte for various kinds of data.

  • Backbone Internet = $.0001 / MB
  • Residential Internet = $.01 / MB
  • Wireline voice calls = $.10 / MB
  • Cell voice calls = $1 / MB
  • SMS = $1000 / MB

Taken from Scott Lemon’s notes from Telecosm 2008.

Robust priors

Yesterday I posted a working paper version of an article I’ve been working on with Jairo Fúquene and Luis Pericchi: A Case for Robust Bayesian priors with Applications to Binary Clinical Trials.

Bayesian analysis begins with a prior distribution, a function summarizing what is believed about an experiment before any data are collected. The prior is updated as data become available and becomes the posterior distribution, a function summarizing what is currently believed in light of the data collected so far. As more data are collected, the relative influence of the prior decreases and the influence of the data increases. Whether a prior is robust depends on the rate at which the influence of the prior decreases.

There are essentially three approaches to how the influence of the prior on the posterior should vary as a function of the data.

  1. Robustness with respect to the prior. When the data and the prior disagree, give more weight to the data.
  2. Conjugate priors. The influence of the prior is independent of the extent to which it agrees with the data.
  3. Robustness with respect to the data. When the data and the prior disagree, give more weight to the prior.

When I say “give more weight to the data” or “give more weight to the prior,” I’m not talking about making ad hoc exceptions to Bayes theorem. The weight given to one or the other falls out of the usual application of Bayes theorem. Roughly speaking, robustness has to do with the relative thickness of the tails of the prior and the likelihood. A model with thicker tails on the prior will be robust with respect to the prior, and a model with thicker tails on the likelihood will be robust with respect to the data.

Each of the three approaches above are appropriate in different circumstances. When priors come from well-understood physical principles, it may make sense to use a model that is robust with respect to the data, i.e. to suppress outliers. When priors are based on vague beliefs, it may make more sense to be robust with respect to the prior. Between these extremes, particularly when a large amount of data is available, conjugate priors may be appropriate.

When the data and the prior are in rough agreement, the contribution of a robust prior to the posterior is comparable to the contribution that a conjugate prior would have had. (And so using robust proper priors leads to greater variance reduction than using improper priors.) But as the level of agreement decreases, the contribution of a robust prior to the posterior also decreases.

In the paper, we show that with a binomial likelihood, the influence of a conjugate prior grows without bound as the prior mean goes to infinity. However, with a Student-t prior, the influence of the prior is bounded as the prior mean increases. For a Cauchy prior, the influence of the prior is bounded as the location parameter goes to infinity.

It’s easy to confuse a robust prior and a vague conjugate prior. Our paper shows how in a certain sense, even an “informative” Cauchy distribution is less informative than a “non-informative” conjugate prior.

Using Photoshop on experimental results

Greg Wilson pointed out an article in The Chronicle of Higher Education about scientists using Photoshop to manipulate the graphs of their results. The article has this to say about The Journal of Cell Biology.

So far the journal’s editors have identified 250 papers with questionable figures. Out of those, 25 were rejected because the editors determined the alterations affected the data’s interpretation.

This immediately raises suspicions of fraud which is, of course. However, I’m more concerned about carelessness than fraud. As Goethe once said,

…misunderstandings and neglect create more confusion in this world than trickery and malice. At any rate, the last two are certainly much less frequent. 

Even if researchers had innocent motivations for manipulating their graphs, they’ve made it impossible for someone else to reproduce their results and have cast doubts on their integrity.

Specialization is for insects

From Robert A. Heinlein:

A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.

What's most important in a basic stat class?

I’m teaching part of a basic medical statistics class this summer. It’s been about a decade since I’ve taught basic probability and statistics and I now have different ideas about what is important. For example, I now think it’s more important that a beginning class understand the law of small numbers than the law of large numbers.

One reason for my change of heart is that over the intervening years I’ve talked with people who have had a class like the one I’m teaching now and I have some idea what they got out of it. They might summarize their course as follows.

First we did probability. You know, coin flips and poker hands. Then we did statistics. That’s where you look up these numbers in tables and if the number is small enough then what you’re trying to prove is true, and otherwise it’s false.

Too many people get through a course in probability and statistics without understanding what probability has to do with statistics. I think we’d be better off “covering” far less material but trying to insure that students really grok two or three big ideas by the time they leave.

Experienced programmers and lines of code

I heard of a study recently that concluded inexperienced and experienced programmers write about the same number of lines of code per day. The difference is that experienced programmers keep more of those lines of code, making steady progress toward a goal. Less experienced programmers write large chunks of code only to rip them out and rewrite the same chunk many times until the code appears to work. Or instead of ripping out the code, they debug for days on end, changing one or two lines at a time, almost at random, until the code appears to work.

As Greg Wilson pointed out in his interview, focusing on quality in software development often results in increased productivity as well. More effort goes into forward progress and less goes into re-work.

Not only do experienced programmers produce more lines of code worth keeping each day, they also accomplish more per line of code, sometimes dramatically more. But that’s not news. It’s well known that the best programmers aren’t just a little more productive than average, they’re one or two orders of magintude more productive. (See, for example, Joel Spolsky’s book Smart and Gets Things Done.) More interesting is that the best programmers don’t seem to have a much larger capacity for producing and understanding lines of code.

There have also been studies that show programmers produce about the same number of lines of code per day independent of the language they use. You might think that someone working in assembly language could produce more lines of per day than someone writing in a higher level language such as VB or Java, but that’s not the case. It seems that while counting lines of code is a terrible way to measure productivity, it is a good way to measure what you can expect someone to be able to hold in their head.