Syntax coloring for code samples in HTML

Syntax coloring makes it much easier to read source code, especially when you become accustomed to a particular color scheme. For example, I’m used to the default color scheme in Visual Studio: comments are green, keywords are blue, string literals are red, etc. Once you get used to color-coded source code, it’s hard to go back to black-and-white. However, the code samples here are monochrome and I’ve been thinking about doing something about that.

It’s possible to mark up code samples on the web just like any other chuck of HTML, but that would be time-consuming. Also, you want to leave code samples as plain text so readers can copy the source and paste it into their code. So what people usually do is use client-side JavaScript to change the way code samples are displayed in the browser. That way the code appears to be marked up but it’s still unadorned underneath.

I like the way code samples look on Scott Hanselman’s blog, and so I started to use the JavaScript solution that he recommends, SyntaxHighlighter. However, I decided to hold off when I read this comment on his blog.

I tried to use it. Unfortunately the scrolling errors I see on your and other sites are too much to bear.

More recently I heard that StackOverflow is using prettify.js to add syntax coloring for their code samples. I haven’t noticed any problems on that site, so I’m starting to try out prettify. I’ve heard that the script doesn’t always handle VB code correctly, but that doesn’t matter to me.

The prettify script is easy to use. You don’t have to tell it what programming language you’re highlighting. However, you do have the option of  specifying the language, which presumably can’t hurt. I’ve tried it on a page containing C++ code and on another page containing Python code and they both look OK. My only complaint is that the default color scheme is not what I would have chosen. However, the color scheme can be modified by editing a style sheet and I intend to do that.

I’m going to start by experimenting with static files on my site. I’m more cautious about incorporating syntax coloring with the blog since I don’t know what interaction problems there could be with the WordPress software on the server or with blog reader software on the clients. Scott Hanselman says on his blog

… I couldn’t find a syntax highlighting solution that worked in EVERY feed reader. There are lots of problems with online ones like Google Reader and BlogLines if I, as the publisher, try to get tricky with the CSS.

So I may leave the code samples on the blog alone. Also, I’m also not sure what I’ll do about PowerShell samples. The prettify script works well with C-family languages, but PowerShell syntax may be too far afield from what it expects.

Does anyone have suggestions in general? For PowerShell in particular?

RDFa

Phil Windley had a recent interview with Elias Torres and Ben Adida on RDFa. This is an emerging standard for adding semantic information to HTML documents. The “a” in RDFa stands for attributes. Rather than creating new documents, RDFa allows you to add RDF-like semantic information to your existing HTML pages via element attributes. A great deal of thought has gone into how to accomplish this without unwanted side effects such as changing how pages are rendered.

As I listened to the interview, I tried to think of how I would apply this to my web sites. There are standardized vocabularies for expressing friendship relationships, address and calendar information, audio file meta data, etc. However, little of this is relevant to my web sites. I’m ready to jump on the RDFa bandwagon once there are standard vocabularies more applicable to the kind of content I publish.

To learn more about RDFa, listen to Phil Windley’s podcast or read the RDFa primer.

Manipulating the clipboard with PowerShell

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into &lt; and &gt; etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

$a = get-clipboard;
$a = $a -replace "&", "&amp;";
$a = $a -replace "<", "&lt;";
$a = $a -replace ">", "&gt;";
$a = $a -replace '"', "&quot;"
$a = $a -replace "'", "&#39;"
out-clipboard $a

So this C++ code

double& x = y;
char c = 'k';
string foo = "hello";
if (p < q) ...

turns into this HTML code

double&amp; x = y;
char c = &#39;k&#39;;
string foo = &quot;hello&quot;;
if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

Version:0.9
StartHTML:00000136
EndHTML:00000214
StartFragment:00000170
EndFragment:00000178
SourceURL:http://www.codeproject.com/
<html><body>
<!--StartFragment-->snippets<!--EndFragment-->
</body>
</html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX; it has been assigned a work item.

Comparing Google and Yahoo automatic translation

I played around with Google’s translator a little after adding some notranslate directives as discussed in my previous post. Google did honor my requests to mark some sections as literal text to not be translated. Google’s translator was also able to recognize my name as a name without special markup. Yahoo, on the other hand, translated my name, turning “Cook” into “Cuisinier” in French.

Google treated text inside <code> tags as literals that should not be translated. That is, Google would leave my source code snippets alone and only translate the English prose surrounding the code. Yahoo, on the other hand, would translate everything, including source code. For example, I had some PowerShell code on my page with the keyword matches that Google left alone but Yahoo translated into “allumettes,” presumably good French prose but not a legal PowerShell keyword.

One puzzling thing about the Google translation engine was that it would change which text was hyperlinked. For example, the text “My résumé” was changed to “Mon CV,” linking on the translation for “my.” Yahoo produced what I expected, “Mon résumé.” There were several other instances in which Google produced odd links, such as hyperlinking the | marker between words that were linked before. For example, the footer of my web site has these links:

Home | Sitemap | My blog | Search

Yahoo turned this into

Maison | Sitemap | Mon blog | Recherche

while Google produced

Accueil | Plan du site | Mon blog | Recherche

So Google incorporated the separator bars as part of words, and moved the last link from “Recherche” to the bar separating “blog” and “Rescherche.”

One advantage of Google’s translation is that it lets you hover your mouse over a line of translated text and see the original text.

Giving hints to automatic translators

One problem with machine translation is that machines don’t know when to stop translating. For example Yahoo’s Babel Fish translator translates my last name “Cook” literally to “Cocinero” in Spanish and “Cuisinier” in French.

Today Google announced a way to tell its translator that text should not be translated. Place such text inside a <span> tag with the attribute class="notranslate". I tried this on a web page that explained that a certain piece of code printed out “Hello world.” Since “Hello world” is literal output, it should be left untranslated, not turned into, for example, “Bonjour le monde.” The solution was to modify the HTML to say

The code above prints &ldquo;<span class="notranslate">Hello world</span>.&rdquo;

To prevent an entire page from being translated, add the following tag in the <head> section of the page.

<meta name="google" value="notranslate">

I suppose other machine translation efforts, such as those from Microsoft and Yahoo, will follow Google’s lead and support the class=notranslate directive.

The Holy Grail of CSS

Basic tasks are simple in CSS, but even slightly harder tasks can be incredibly difficult. Controlling fonts, margins, and so forth is a piece of cake. But controlling page layout is another matter. In his book Refactoring HTML, Elliotte Rusty Harold describes a technique as

so tricky that it took any smart people quite a few years of experimentation to develop the technique show here.  In fact, so many people searched for this while believing that it didn’t actually exist that this technique goes under the name “The Holy Grail.”

What is the incredibly difficult task that took so many years to discover? Teaching a web browser to play chess using only style sheets? No, three column layout. I kid you not. He goes on to say

The goal is simple: two fixed-width columns on the left and the right and a liquid center for the content in the middle.  (That something so frequently needed was so hard to invent doesn’t speak well of CSS as a language, but it s the language we have to work with.)

You can read more about the Holy Grail of CSS in an article by Matthew Levine.

I appreciate the advantages of CSS, though I do wish it didn’t have such a hockey stick learning curve. I’ve heard people say not to bother learning overly difficult technologies because if you find it too difficult, so will everyone else and it will die off. But CSS seems to be firmly established with no competitor.

Side benefits of accessibility

Elliotte Rusty Harold makes the following insightful observation in his new book Refactoring HTML:

Wheelchair ramps are far more commonly used by parents with strollers, students with bicycles, and delivery people with hand trucks than they are by people in wheelchairs. When properly done, increasing accessibility for the disabled increases accessibility for everyone.

For example, web pages accessible to the visually impaired are also more accessible to search engines and mobile devices.

Migrating from HTML to XHTML

I migrated the HTML pages on my web site to XHTML this weekend. I’ve been hesitant to do this after hearing a couple horror stories of how a slight error could have big consequences (for example, see how extra slashes caused Google to stop indexing CodeProject) but I took the plunge. Mostly this was a matter of changing <br> to <br /> etc. But I did discover a few errors such as missing </p> tags. I was surprised to find such things because I had previously validated the site as HTML 4.01 strict.

I thought that HTML entities such as &beta; for the Greek letter β were illegal in XHTML. Apparently not. Three validators (Microsoft Expression Web 2, W3C, and WDG) all seem to think they’re OK. Apparently they’re defined in XHTML though not in XML in general. I looked at the official W3C docs and didn’t see anything ruling these out.

Also, I’ve read that <i> and <b> are not allowed in strict XHTML. That’s what Elliotte Rusty Harold says in Refactoring HTML, and he certainly knows more about (X)HTML that I do. But the three validators I mentioned before all approved of these tags on pages marked as XHTML strict. I changed the <i> and <b> tags to <em> and <strong> respectively just to be safe, but I didn’t see anything in the W3C docs suggesting that the <i> and <b> tags were illegal or even deprecated. (I understand that italic and bold refer to presentation rather than content, but it seems pedantic to me to suggest that <em> and <strong> are any different than their frowned-upon counterparts.)

Validating a web site

The W3C HTML validator is convenient for validating HTML on a single page. The site lets you specify a file to validate by entering a URL, uploading a file, or pasting the content into a form. But the site is not designed for bulk validation. There is an offline version of the validator intended for bulk use, but it’s a Perl script and difficult to install, at least on Windows. There is also a web service API, but apparently it has no WSDL file and so is inaccessible from tools requiring such a file.

The WDG HTML validator is easier to use for a whole web site. It lets you enter a URL and it will crawl the entire site report the validation results for each page. (It’s not clear how it knows what pages to check. Maybe it looks for a sitemap or follows internal links.) If you’d like more control over the list of files it validates, you can paste a list of URLs into a form. (This was the motivation for my previous post on filtering an XML sitemap to create a list of URLs.)