The duck-billed platypus is the most recent species to have its genome sequenced. These odd animals are even more strange at the DNA level. Some features of their DNA are avian, some are reptilian, and of course some are mammalian. See the Science Daily article for more details.

It’s infuriating to read published sample code that’s wrong. Sometimes code given in books is not even syntactically correct. I’ve wondered why publishers didn’t have a way to verify that the code at least compiles, and maybe even check that it gives the stated output.
Dave Thomas said in recent interview that his publishing company, The Pragmatic Programmers, does just that. Authors write in a logical mark-up language and software turns that into a publishable form, compiling code samples and inserting the output. Sample code from one of their books is more likely to work the first time you type it in than code from other publishers.
The Stack Overflow podcast, episode 4, mentioned in passing that the Wikipedia database is about 10 GB. I was surprised it isn’t bigger. If that size is correct, you could download a snapshot of Wikipedia to your local hard drive.
Regular expressions are not a part of the C++ Standard Library quite yet, but there is a document (Technical Report 1, or TR1) that includes among other things a specification for regular expression support that will probably be added to the C++ standard eventually.
The Boost library has supported TR1 for a while. Microsoft just released a feature pack for Visual Studio 2008 a month ago that includes support for most of TR1. (They’ve left out support for mathematical special functions.) And Dinkumware sells a complete TR1 implementation.
I’ve added some notes to my web site for getting started with C++ TR1 regular expressions. I took my PowerShell regex notes as a starting point and implemented some of the same examples in C++. I changed the organization though, because the C++ implementation is fairly different from PowerShell.
Working with regular expressions is harder in C++ than in scripting languages such as Perl or Python, but not unnecessarily so. C++ is optimized for fine-grained control and efficiency rather than ease of use; that’s what C++ is for. The TR1 implementation is internally consistent and elegant in its own way.
It’s easy to find API-level documentation but harder to find examples for getting started. (I’ve heard good things about Pete Becker’s book The C++ Standard Library Extensions but I haven’t read it.) So I decided to keep some notes as I played with the Visual Studio implementation. I imagine most of the content applies to other implementations, but I’ve only tested the examples using Visual Studio.
Update: GCC just added support for C++ TR1 two days ago with their verion 4.3 release. However, it appears support for regular expressions is not included.
An unbiased estimator, very roughly speaking, is a statistic that gives the correct result on average. For a precise definition, see Wikipedia. Unbiasedness is an intuitively desirable property. In fact, it seems indispensable at first.
In the colloquial sense, “bias” is practically synonymous with self-serving dishonesty. Who wants a self-serving, dishonest statistical estimate? But it’s important to remember that “bias” in statistical sense has a technical meaning that may not correspond to the colloquial meaning.
Here’s the big problem with statistical bias: if U is an unbiased estimator of θ, f(U) is NOT an unbiased estimator of f(θ) in general. For example, standard deviation is the square root of variance, but the square root of an unbiased estimator for variance is not an unbiased estimator for standard deviation. This shows bias has nothing to do with accuracy, since the square root of an accurate estimation of variance is an accurate estimate of standard deviation. In fact, unbiased estimators can be terrible.
The fact that unbiasedness is not preserved under transformations calls into question its usefulness. People seldom care directly about abstract statistical parameters directly. Instead they care about some calculation based on those parameters. An unbiased estimate of the parameters does not generally lead to an unbiased estimate of what people really want to estimate.
Roy Osherove just posted an article about his Introducing LINQ to Regex project.
LINQ stands for Language INtegrated Query, a way of baking query support into .NET programming languages. Microsoft has been promising a unified way to query all kinds of data for years now. Along the way they came out with a score of new libraries that were going to be the solution. They’d work for all kinds of data that happened to look very much like a relational database. But now with LINQ they’ve finally delivered something that works well not only with relational data but also with hierarchical data such as XML. With LINQ to Regex, you can query unstructured text with LINQ as well.
There are two big advantages to LINQ. First, you can query different kinds of data sources with similar code. Second, “language integrated” means that your programming language knows about your query language, making strong typing and better tool support possible. (By contrast, if you have a SQL statement inside VB, for example, VB knows nothing about SQL. The SQL command is just a string as far as VB is concerned. If the SQL is malformed, you won’t know until runtime. But with LINQ, malformed queries generate compile errors.)
Update: See Scott Hanselman’s discussion of LINQ to Regex.
I’ve made a few changes to my blog and my personal web site and would welcome your feedback.
I added a widget on my blog sidebar to make it easy to subscribe. It seems to work well. Let me know if you have problems.
I added tags to my blog posts. The tag links should help people find more closely related articles if they’re interested.I’m still figuring out how I want to use tags and categories. For now, categories are high-level groupings and tags are more detailed. Also, posts generally fall into one category, maybe two, but often have multiple tags. I appreciate what Thomas Guest said on his blog about eliminating categories and just having tags, but I haven’t decided I want to do. I’ve thought about adding a tag cloud, but I don’t want the sidebar to be too cluttered. Maybe I’ll add a cloud and cut out the category list. I would appreciate your suggestions.
My personal website now has a sitemap for humans. I’ve had a sitemap for search engines but realized I needed to make it easier for humans to find things on the site as the number of pages has increased.
Update: I just looked at this site with Internet Explorer 6 for the first time. All the content that is supposed to be at the top of the right sidebar is at the bottom, and the main content is pushed off to the left. Has the site always looked bad under IE 6 or did a recent change cause this? Any suggestions how to fix it?
In a recent interview, Donald Knuth made this comment about reusable code.I also must confess to a strong bias against the fashion for reusable code.
I also must confess to a strong bias against the fashion for reusable code. To me, “re-editable code” is much, much better than an untouchable black box or toolkit. I could go on and on about this. If you’re totally convinced that reusable code is wonderful, I probably won’t be able to sway you anyway, but you’ll never convince me that reusable code isn’t mostly a menace.
Knuth didn’t elaborate on what he means by “re-editable” code, but I assume he means code that is easy to maintain. The best chance most code has at reuse is remaining useful in its original project over multiple versions, so maybe we’d get more reuse if we focused more on maintainability.
I think whether code should be editable or in “an untouchable black box” depends on the number of developers involved, as well as their talent and motivation. Knuth is a highly motivated genius working in isolation. Most software is developed by large teams of programmers with varying degrees of motivation and talent. I think the further you move away from Knuth along these three axes the more important black boxes become.
Here is my list of the top five gotchas when learning Windows PowerShell.
5. PowerShell will not run scripts by default.
4. PowerShell requires .\ to run a script in the current directory.
3. PowerShell uses -eq, -gt, etc. for comparison operators.
2. PowerShell uses backquote as the escape character.
1. PowerShell separates function arguments with spaces, not commas.
See PowerShell gotchas for more details and an explanation for why PowerShell made the design decisions it did. As surprising as these features are, there are good reasons for each.
Windows has never made it easy to read long environment variables. If I display the path on one machine I get something like this, both from cmd and from PowerShell.
C:\bin;C:\bin\Python25;C:\bin\TeX\miktex\bin;C:\bin\TeX\MiKTeX\miktex\bin;C:\bin\Perl\bin\;C:\ProgramFiles\Compaq\Compaq Management Agents\Dmi\Win32\Bin; ...
The System Properties window is worse since you can only see a tiny slice of your path at a time.

Here’s a PowerShell one-liner to produce readable path listing:
$env:path -replace ";", "`n"
This produces
C:\bin
C:\bin\Python25\
C:\bin\TeX\miktex\bin
C:\bin\TeX\MiKTeX\miktex\bin
C:\bin\Perl\bin\
C:\Program Files\Compaq\Compaq Management Agents\Dmi\Win32\Bin
...
(If you’re not familiar with PowerShell, note the backquote before the n to indicate the newline character to replace semicolons. This is one of the most unconventional features of PowerShell since backslash is the escape character in most contexts. Because Windows uses either forward or backward slashes as path separators, PowerShell could not use backslash as an escape character. Think of the backquote as a little backslash. Once you get over the initial shock, you get used to the backquote quickly.)
Update: It occurred to me after the original post that there’s an even simpler way to display the path.
$env:path.split(';')