HTML parsing landmines

Posted on 10 November 2008 by John

Phil Haack explains why parsing HTML isn’t as easy as it sounds in The Landmine of Parsing HTML and Stripping HTML comments.

Update: Phil Haack’s follow-up post.

Manipulating the clipboard with PowerShell

Posted on 17 October 2008 by John

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

Update: Looks like

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into < and > etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

    $a = get-clipboard;
    $a = $a -replace "&", "&amp;";
    $a = $a -replace "<", "&lt;";
    $a = $a -replace ">", "&gt;";
    $a = $a -replace '"', "&quot;"
    $a = $a -replace "'", "&#39;"
    out-clipboard $a

So this C++ code

    double& x = y;
    char c = 'k';
    string foo = "hello";
    if (p < q) ...

turns into this HTML code

    double&amp; x = y;
    char c = &#39;k&#39;;
    string foo = &quot;hello&quot;;
    if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

    Version:0.9
    StartHTML:00000136
    EndHTML:00000214
    StartFragment:00000170
    EndFragment:00000178
    SourceURL:https://www.codeproject.com/
    <html><body>
    <!--StartFragment-->snippets<!--EndFragment-->
    </body>
    </html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX.

Comparing Google and Yahoo automatic translation

Posted on 14 October 2008 by John

I played around with Google’s translator a little after adding some notranslate directives as discussed in my previous post. Google did honor my requests to mark some sections as literal text to not be translated. Google’s translator was also able to recognize my name as a name without special markup. Yahoo, on the other hand, translated my name, turning “Cook” into “Cuisinier” in French.

Google treated text inside <code> tags as literals that should not be translated. That is, Google would leave my source code snippets alone and only translate the English prose surrounding the code. Yahoo, on the other hand, would translate everything, including source code. For example, I had some PowerShell code on my page with the keyword matches that Google left alone but Yahoo translated into “allumettes,” presumably good French prose but not a legal PowerShell keyword.

One puzzling thing about the Google translation engine was that it would change which text was hyperlinked. For example, the text “My résumé” was changed to “Mon CV,” linking on the translation for “my.” Yahoo produced what I expected, “Mon résumé.” There were several other instances in which Google produced odd links, such as hyperlinking the | marker between words that were linked before. For example, the footer of my website has these links:

Home | Sitemap | My blog | Search

Yahoo turned this into

Maison | Sitemap | Mon blog | Recherche

while Google produced

Accueil | Plan du site | Mon blog | Recherche

So Google incorporated the separator bars as part of words, and moved the last link from “Recherche” to the bar separating “blog” and “Rescherche.”

One advantage of Google’s translation is that it lets you hover your mouse over a line of translated text and see the original text.

Giving hints to automatic translators

Posted on 14 October 2008 by John

One problem with machine translation is that machines don’t know when to stop translating. For example Yahoo’s Babel Fish translator translates my last name “Cook” literally to “Cocinero” in Spanish and “Cuisinier” in French.

Today Google announced a way to tell its translator that text should not be translated. Place such text inside a  tag with the attribute class="notranslate". I tried this on a web page that explained that a certain piece of code printed out “Hello world.” Since “Hello world” is literal output, it should be left untranslated, not turned into, for example, “Bonjour le monde.” The solution was to modify the HTML to say

The code above prints “Hello world.”

To prevent an entire page from being translated, add the following tag in the <head> section of the page.

<meta name="google" value="notranslate">

I suppose other machine translation efforts, such as those from Microsoft and Yahoo, will follow Google’s lead and support the class=notranslate directive.

The Holy Grail of CSS

Posted on 14 August 2008 by John

Basic tasks are simple in CSS, but even slightly harder tasks can be incredibly difficult. Controlling fonts, margins, and so forth is a piece of cake. But controlling page layout is another matter. In his book Refactoring HTML, Elliotte Rusty Harold describes a technique as

so tricky that it took any smart people quite a few years of experimentation to develop the technique show here. In fact, so many people searched for this while believing that it didn’t actually exist that this technique goes under the name “The Holy Grail.”

What is the incredibly difficult task that took so many years to discover? Teaching a web browser to play chess using only style sheets? No, three column layout. I kid you not. He goes on to say

The goal is simple: two fixed-width columns on the left and the right and a liquid center for the content in the middle. (That something so frequently needed was so hard to invent doesn’t speak well of CSS as a language, but it s the language we have to work with.)

You can read more about the Holy Grail of CSS in an article by Matthew Levine.

I appreciate the advantages of CSS, though I do wish it didn’t have such a hockey stick learning curve. I’ve heard people say not to bother learning overly difficult technologies because if you find it too difficult, so will everyone else and it will die off. But CSS seems to be firmly established with no competitor.

Side benefits of accessibility

Posted on 3 August 2008 by John

Elliotte Rusty Harold makes the following insightful observation in his new book Refactoring HTML:

Wheelchair ramps are far more commonly used by parents with strollers, students with bicycles, and delivery people with hand trucks than they are by people in wheelchairs. When properly done, increasing accessibility for the disabled increases accessibility for everyone.

For example, web pages accessible to the visually impaired are also more accessible to search engines and mobile devices.

Migrating from HTML to XHTML

Posted on 3 August 2008 by John

I migrated the HTML pages on my website to XHTML this weekend. I’ve been hesitant to do this after hearing a couple horror stories of how a slight error could have big consequences (for example, see how extra slashes caused Google to stop indexing CodeProject [link died]) but I took the plunge. Mostly this was a matter of changing   to   etc. But I did discover a few errors such as missing tags. I was surprised to find such things because I had previously validated the site as HTML 4.01 strict.

I thought that HTML entities such as β for the Greek letter β were illegal in XHTML. Apparently not. Three validators (Microsoft Expression Web 2, W3C, and WDG) all seem to think they’re OK. Apparently they’re defined in XHTML though not in XML in general. I looked at the official W3C docs and didn’t see anything ruling these out.

Also, I’ve read that  and  are not allowed in strict XHTML. That’s what Elliotte Rusty Harold says in Refactoring HTML, and he certainly knows more about (X)HTML that I do. But the three validators I mentioned before all approved of these tags on pages marked as XHTML strict. I changed the  and  tags to  and  respectively just to be safe, but I didn’t see anything in the W3C docs suggesting that the  and  tags were illegal or even deprecated. (I understand that italic and bold refer to presentation rather than content, but it seems pedantic to me to suggest that  and  are any different than their frowned-upon counterparts.)

Validating a website

Posted on 1 August 2008 by John

The W3C HTML validator is convenient for validating HTML on a single page. The site lets you specify a file to validate by entering a URL, uploading a file, or pasting the content into a form. But the site is not designed for bulk validation. There is an offline version of the validator intended for bulk use, but it’s a Perl script and difficult to install, at least on Windows. There is also a web service API, but apparently it has no WSDL file and so is inaccessible from tools requiring such a file.

The WDG HTML validator is easier to use for a whole website. It lets you enter a URL and it will crawl the entire site report the validation results for each page. (It’s not clear how it knows what pages to check. Maybe it looks for a sitemap or follows internal links.) If you’d like more control over the list of files it validates, you can paste a list of URLs into a form. (This was the motivation for my previous post on filtering an XML sitemap to create a list of URLs.)

Accented letters in HTML, TeX, and MS Word

Posted on 14 July 2008 by John

I frequently need to look up how to add diacritical marks to letters in HTML, TeX, and Microsoft Word, though not quite frequently enough to commit the information to my long-term memory. So today I wrote up a set of notes on adding accents for future reference. Here’s a chart summarizing the notes.

Accent	HTML	TeX	Word
grave	`grave`	\`	`CTRL` + `
acute	`acute`	`\'`	`CTRL` + `'`
circumflex	`circ`	`\^`	`CTRL` + `^`
tilde	`tidle`	`\~`	`CTRL` + `SHIFT` + `~`
umlaut	`uml`	`\"`	`CTRL` + `SHIFT` + `:`
cedilla	`cedil`	`\c`	`CTRL` + `,`
æ, Æ	`æ`, `Æ`	`\ae`, `\AE`	`CTRL` + `SHIFT` + `&` + a or A
ø, Ø	`ø`, `Ø`	`\o`, `\O`	`CTRL` + `/` + o or O
å, Å	`å`, `Å`	`\aa`, `\AA`	`CTRL` + `SHIFT` + `@` + a or A

The notes go into more details about how accents function in each environment and what limitations each has. For example, LaTeX will let you combine any accent with any letter, but MS Word and HTML only support letter/accent combinations that are common in spoken languages.

Greek letters and math symbols in (X)HTML

Posted on 14 June 2008 by John

It’s not hard to use Greek letters and math symbols in (X)HTML, but apparently it’s not common knowledge either. Many pages insert little image files every time they need a special character. Such web pages look a little like ransom notes with letters cut from multiple sources. Sometimes this is necessary but often it can be avoided.

I’ve posted a couple pages on using Greek letters and math symbols in HTML, XML, XHTML, TeX, and Unicode. I included TeX because it’s the lingua franca for math typography, and I included Unicode because the X(HT)ML representation of symbols is closely related to Unicode.

The notes give charts for encoding Greek letters and some of the most common math symbols. They explain how HTML and XHTML differ in this context and also discuss browser compatibility issues.