From the category archives:

Computing

Managing passwords

by John on May 23, 2008

When everything you do requires a different password, how do you keep up with them all? The most common solution is to use the same username and password in as many contexts as possible. Not only is this ill-advised, it’s not all that practical. Maybe someone else is using your favorite username. Maybe your favorite password is too short or too long for some contexts, etc. So you end up with dozens of minor variations on a preferred username/password pair.

One solution is to keep all your passwords in place and have a strong password that unlocks your password collection. A security professional friend of mine recommends Password Safe for this purpose. It works well as long as you’re at your own computer or at a computer where you can access Password Safe on a flash drive, but not if you’re using a public computer.

Another solution is to use a third party authentication service like OpenID. Jeff Atwood posted a thorough discussion of the pros and cons of OpenID on his blog yesterday. OpenID can reduce the number of passwords you need to manage, but it won’t cut the number down much until more sites accept OpenID.

{ 1 comment }

Wikipedia in 10 GB

by John on May 9, 2008

The Stack Overflow podcast, episode 4, mentioned in passing that the Wikipedia database is about 10 GB. I was surprised it isn’t bigger. If that size is correct, you could download a snapshot of Wikipedia to your local hard drive.

{ 0 comments }

Preventing an unpleasant Sweave surprise

by John on April 29, 2008

Sweave is a tool for making statistical analyses more reproducible by using literate programming in statistics. Sweave embeds R code inside LaTeX and replaces the code with the result of running the code, much like web development languages such as PHP embed code inside HTML.

Sweave is often launched from an interactive R session, but this can defeat the whole purpose of the tool. When you run Sweave this way, the Sweave document inherits the session’s state. Here’s why that’s a bad idea.

Say you’re interactively tinkering with some plots to make them look like you want. As you go, you’re copying R code into an Sweave file. When you’re done, you run Sweave on your file, compile the resulting LaTeX document, and get beautiful output. You congratulate yourself on having gone to the effort to put all your R code in an Sweave file so that it will be self-contained and reproducible. You forget about your project then revisit it six months later. You run Sweave and to your chagrin it doesn’t work. What happened? What might have happened is that your Sweave file depended on a variable that wasn’t defined in the file itself but happened to be defined in your R session. When you open up R months later and run Sweave, that variable may be missing. Or worse, you happen to have a variable in your session with the right name that now has some unrelated value.

I recommend always running Sweave from a batch file. On Windows you can save the following two lines to a file, say sw.bat, and process a file foo.Rnw with the command sw foo.

  R.exe -e "Sweave('%1.Rnw')"
  pdflatex.exe %1.tex

This assumes R.exe and pdflatex.exe are in your path. If they are not, you could either add them to your path or put their full paths in the batch file.

Running Sweave from a clean session does not insure that your file is self-contained. There could still be other implicit dependencies. But running from a clean session improves the chances that someone else will be able to reproduce the results.

See Troubleshooting Sweave for some suggestions for how to prevent or recover from other possible problems with Sweave.

Update: See the links provided by Gregor Gorjanc in the first comment below for related batch files and bash scripts.

{ 1 comment }

60-second description of feeds

by John on April 28, 2008

If you don’t know what a “feed” is, as in RSS feed etc., here’s a 60-second audio explanation.

Audio clip from Sixty Second Tech

Transcript

{ 1 comment }

IPv6

by John on April 17, 2008

The most recent RunAs Radio podcast interviews Sean Siler on IPv6, the eventual replacement for IPv4, the current Internet protocol address scheme. At the current rate, we will run out of IPv4 addresses in May 2010. Previous estimated dates for running out of IP addresses have come and gone, and we keep coming up with ways to postpone the date. Still, no one disputes that we’ll run out of IPv4 addresses some day.

IPv4 uses 32-bit addresses and IPv6 uses 128-bit addresses. The latter uses four times as many bits, but that represents 296 times as many addresses. Siler illustrates this by saying that if the IPv4 address space were as wide as an atomic nucleus, the IPv6 address space would be a light-month wide.

I checked Siler’s calculation. Thirty light-days is 9.8 x 296 femtometers, so the analogy is correct for a nucleus of diameter 9.8 fm. According to Wikipedia, atomic nuclei range from about 1.6 fm (hydrogen) to 15 fm (uranium), so there’s some element in between for which he’s right.

{ 0 comments }

XHTML is essentially a stricter form of HTML, but not quite. For the most part, you can satisfy the requirements of both standards at the same time. However, when it comes to closing tags, the two standards are incompatible. For example, the line break tag in HTML is <br> but in XHTML is <br/>. Most browsers will tolerate the unnecessary backslash before the closing tag in HTML, especially if you put a space before it. But it’s not strictly correct.

So is this just a pedantic point of markup language grammar? Chris Maunder says an error with closing tags caused Google to stop indexing his web site. He had XHTML-style end tags but had set his DOCTYPE to HTML.

I’ve also heard of browsers refusing to render a page at all because it had DOCTYPE set to XHTML but contained an HTML entity not supported in XHTML. I believe the person reporting this said that he had run the XHTML page through a validator that failed to find the error. Unfortunately I’ve forgotten where I saw this. Does anyone know about this?

{ 0 comments }

Why Unicode is subtle

by John on April 5, 2008

On it’s surface, Unicode is simple. It’s a replacement for ASCII to make room for more characters. Joel Spolsky assures us that it’s not that hard. But then how did Jukka Korpela have enough to say to fill his 678-page book Unicode Explained? Why is the Unicode standard 1472 printed pages?

It’s hard to say anything pithy about Unicode that is entirely correct. The best way to approach Unicode may be through a sequence of partially true statements.

The first approximation to a description of Unicode is that it is a 16 bit character set. Sixteen bits are enough to represent the union of all previous character set standards. It’s enough to contain nearly 30,000 CJK (Chinese-Japanese-Korean) characters with space left for mathematical symbols, braille, dingbats, etc.

Actually, Unicode is a 32-bit character set. It started out as a 16-bit character set. The first 16 bit range of the Unicode standard is called the Basic Multilingual Plane (BMP), and is complete for most purposes. The regions outside the BMP contain characters for archaic and fictional languages, rare CJK characters, and various symbols.

So essentially Unicode is just a catalog of characters with each character assigned a number and a standard name. What could be so complicated about that?

Well, for starters there’s the issue of just what constitutes a character. For example, Greek writes the letter sigma as σ in the middle of a word but as ς at the end of a word. Are σ and ς two representations of one character or two characters? (Unicode says two characters.) Should the Greek letter π and the mathematical constant π be the same character? (Unicode says yes.) Should the Greek letter Ω and the symbol for electrical resistence in Ohms Ω be the same character? (Unicode says no.) The difficulties get more subtle (and politically charged) when considering Asian ideographs.

Once have agreement on how to catalog tens of thousands of characters, there’s still the question of how to map the Unicode characters to bytes. You could think of each byte representation as a compression or compatibility scheme. The most commonly used systems are UTF-8, and  UTF-16. The former is more compact (for Western languages) and compatible with ASCII. The latter is simpler to process. Once you agree on a byte representation, there’s the issue of how to order the bytes (endianness).

Once you’ve resolved character sets and encoding, there remain issues of software compatibility. For example, which web browsers and operating systems support which representations of Unicode? Which operating systems supply fonts for which characters? How do they behave when the desired font is unavailable? How do various programming languages support Unicode? What software can be used to produce Unicode? What happens when you copy a Unicode string from one program and paste it into another?

Things get even more complicated when you want to process Unicode text because this brings up internationalization and localization issues. These are extremely complex, though they’re not complexities with Unicode per se.

For more links, see my Unicode resources.

{ 0 comments }

Contrasting Microsoft Word and LaTeX

by John on April 3, 2008

Here’s an interesting graph from Marko Pinteric comparing Microsoft Word and Donald Knuth’s LaTeX.

Comparing MS Word and LaTeX

According to the graph, LaTeX becomes easier to use relative to Microsoft Word as the task becomes more complex. That matches my experience, though I’d add a few footnotes.

  1. Most people spend most of their time working with documents of complexity to the left of the cross over.
  2. Your first LaTeX document will take much longer to write than your first Word document.
  3. Word is much easier to use if you need to paste in figures.
  4. LaTeX documents look better, especially if they contain mathematics.

See Charles Petzold’s notes about the lengths he went to in order to produce is upcoming book in Word. I imagine someone of less talent and persistence than Petzold could not have pulled it off using Word, though they would have stood a better chance using LaTeX.

Before the 2007 version, Word documents were stored in an opaque binary format. This made it harder to compare two documents. A version control system, for example, could not diff two Word documents the same way it could diff two text files. It also made Word documents difficult to troubleshoot since you had no way to look beneath the WYSIWYG surface.

However, a Word 2007 document is a zip file containing a directory of XML files and embedded resources. You can change the extension of any Office 2007 file to .zip and unzip it, inspect and possibly change the contents, the re-zip it. This opens up many new possibilities.

I’ve written some notes that may be useful for people wanting to try out LaTeX on Windows.

{ 2 comments }

Innovation III

by John on March 25, 2008

In his book Diffusion of Innovations Everett Rogers lists five factors in determining rate of adoption of an innovation.

First is the relative advantage of the innovation. This is not limited to objective improvements but also includes factors such as social prestige.

The second is compatibility with existing systems and values.

Third is complexity, especially perceived complexity.

The fourth is trialability, how easily someone can try out the innovation without making a commitment.

The fifth is observability, whether the advantages of the innovation are visible.

Innovators are often criticized for compatibility, for not making a larger break from the past. After Bjarne Stroustrup invented the C++ programming language, many people said he should have sacrificed compatibility with C in order to make C++ a better language. However, had he done so, C++ would not have become popular enough to gain the critics’ attention. As Stroustrup said in an interview, ”There are just two kinds of languages: the ones everybody complains about and the ones nobody uses.”

{ 0 comments }

Alphabetical order is wrong

by John on March 12, 2008

Seth Godin posted an article today entitled Alphabetical order is obsolete. He makes a good argument that alphabetically sorted lists are often not the best user interface. I particularly liked his idea that an email client should sort junk mail according to the probability that each message is spam.

{ 1 comment }

A mountain of DVDs

by John on March 6, 2008

The March 6 Nature podcast has a story about the Large Hadron Collider. The LHC is expected to gather 15 petabytes (15,000,000 gigabytes) of data. One of the people interviewed said that 15 petabytes of data would require a stack of DVDs the height of Mount Blanc.


Mont Blanc and Dome du Gouter

{ 0 comments }

UT Ranger supercomputer

by John on March 3, 2008

Sun Microsystem CEO Jonathan Schwartz posted an article on his blog this morning about UT’s Ranger supercomputer.

Ranger computer with longhorn horns on top

Schwartz calls Ranger the world’s largest supercomputing cloud, and no doubt by some definition it is. The world probably has dozens of largest clusters depending on how you define the term.

{ 0 comments }

Code to make an XML sitemap

by John on February 25, 2008

Here’s some Python code to create a sitemap in the format specified by sitemaps.org and read by search engines. Download the file sitemapmaker.txt and change the extension from .txt to .py.

Change the url variable in the script before running it or else you’ll point search engines to my web site rather than yours. Also, edit the file extensions_to_keep variable if you want to index any file types besides HTML and PDF.

Copy the file sitemapmaker.py to the directory on your computer where you have your files. Run the script and direct its output to a file, sitemapmaker.py > sitemap.xml. See sitemaps.org for instructions on how to let search engines know about your sitemap.

This code assumes all the files to index in your sitemap are in one directory, the directory you run the script from. It also assumes the timestamps on your computer match those on your web server. Optional fields are left out of the sitemap.

{ 0 comments }

OpenDNS

by John on February 21, 2008

Phil Windley recently released a podcast about OpenDNS. My first thought was to wonder why anyone would want to tweak their DNS, except for the most sophisticated users. But in some ways, the least sophisticated users have the most to gain from a service like OpenDNS since it provides extra protection from online mischief.

{ 0 comments }

Introduction to Mac for Windows developers

by John on February 13, 2008

Here are a couple podcasts introducing Windows developers to software development on the Macintosh.

Scott Hanselman: What’s it like for Mac Developers, an nterview with Steven Frank

.NET Rocks: Miguel de Icaza and Geoff Norton on Mono, mostly about .NET development on the Mac

Also, there are a lot of Mac-related talks on the GeekCruise podcast. The talks from January 2007 were directed at a general audience new to the Mac.

Hanselman’s podcast talks about some of the cultural difference between Microsoft and Apple customers. For example, Mac users update their OS more often and complain less about OS changes that break software.

{ 1 comment }