Notes on HTML, XML, TeX, and Unicode

This week’s resource post: some notes on typesetting, Unicode, etc.

See also blog posts tagged LaTeX, HTML, and Unicode and the Twitter account TeXtip.

Last week: C++ resources

Next week: Special functions

Gutenberg + Readability

Here’s a very simple idea: Use Project Gutenberg for content and Readability for style.

Project Gutenberg has a large collection of public domain books in digital form. The books are available in several formats, none of which are ideal for reading. Project Gutenberg provides text without much styling in order to make it easier for people to use the content as they please.

You can go to the HTML version of a book on Gutenberg and use Readability (or Instapaper) to format it for easier reading. Importing the HTML page to a Kindle similarly improves the formatting.

***

Has anyone made a style sheet to approximate the look of Readability or Instapaper? I’d like to use something like that to improve the appearance of the static HTML pages on my site.

Styling HTML for mobile devices

Yesterday I thought about adding a style sheet for mobile devices to some static HTML pages. How hard could it be? CSS has a media type. Just set the media to handheld,  specify a style sheet for mobile browsers, and you’re done.

Style sheet media types

One problem is that hand-held devices don’t always look for the handheld style sheet. According to Ben Henick, the handheld media type is “poorly supported by all but the very recently marketed devices, as of 2009.” From what I gather, most web sites try to infer the browser type on the server side and generate different HTML for mobile devices using PHP etc. Apparently static HTML markup will have its limitations at this point in time.

The iPhone doesn’t consider itself a hand-held device as far as CSS is concerned. Fair enough: perhaps the handheld designation is more for tiny screens like more traditional cell phones. But it’s not a desktop either.

You can’t target the iPhone with a simple media type, but you can use the following CSS.

<link rel="stylesheet" type="text/css"  href="iphone.css"
media="only screen and (max-device-width: 480px)" />

Of course a device other than an iPhone could grab this style sheet, and the iPhone will not grab the style sheet if they ever add one more pixel to the browser width. But this is the approach Apple gives in their online documentation.

Testing

It’s difficult to find emulators to test how pages appear on mobile devices. Someone said on Stack Overflow that Opera has a Small Screen view that is useful for emulating mobile devices. However, that recommendation was from November 2008. The current version of Opera either no longer supports that feature or has moved it somewhere else.

The Web Developer plug-in for Firefox lets you specify whether you want to display your page with the screen or handheld style sheet, but it does not emulate a hand-held device.

Apple makes an emulator for the iPhone, but only for Macintosh computers. MobiOne makes an emulator for the iPhone that runs on Windows. However, the emulator does not recognize the CSS statement above. I don’t have an iPhone, but I was able to borrow an iPod Touch, which runs the same browser as the iPhone. My pages worked correctly on the iPod when they did not work on the MobiOne emulator.

Suggestions?

Does anyone have any suggestions for making static HTML pages more friendly to mobile browsers? Any suggestions for testing?

Complexity of HTML and LaTeX

Sometime around 1994, my office mate introduced me to HTML by saying it was 10 times simpler than LaTeX. At the time I thought he was right. Now I’m not so sure. Maybe he was right in 1994 when the expectations for HTML were very low.

It is easier to bang out a simple, ugly HTML page than to write your first LaTeX document. When you compare the time required to make an attractive document, the effort becomes more comparable. The more sophisticated you get, the simpler LaTeX becomes by comparison.

Of course the two languages are not exactly comparable. HTML targets a web browser while LaTeX targets paper. HTML would be much simpler if people only used it to create documents to print out on their own printer. A major challenge with HTML is not knowing how someone else will use your document. You don’t know what browser they will view it with, at what resolution, etc. For that matter, you don’t know whether they’re even going to view it at all — they may use a screen reader to listen to the document.

Writing HTML is much more complicated than writing LaTeX if you take a broad view of all that is required to do it well: learning about accessibility and internationalization, keeping track of browser capabilities and market shares, adapting to evolving standards, etc. The closer you look into it, the less HTML has in common with LaTeX. The two languages are not simply two systems of markup; they address different problems.

Related links:

The good parts

I’ve written before about how I liked Douglas Crockford’s book JavaScript: The Good Parts and how I wish someone would write the corresponding book for R. I just found out this week that O’Reilly has published three more books along the lines of Crockford’s book:

I’m reading the HTML & CSS book. It’s a good read, but not quite what you might expect from the title. It’s not an introductory book on HTML or CSS. It assumes the reader is familiar with the basics of both languages. Instead it focuses on strategy for how to use the two languages.

HTML & CSS: The Good Parts reminds me of Scott Meyers’ Effective C++ books. These books assumed you knew the syntax of C++ but were looking for strategic advice for making the best use of the language. Some have argued that the fact Meyers needed to write these books is evidence that C++ is too complicated. The same could be said of HTML and especially CSS. Both C++ and web standards have evolved over time and are burdened with backward compatibility. But as Bjarne Stroustrup remarked

There are just two kinds of languages: the ones everybody complains about and the ones nobody uses.

Related post:

Sharps and flats in HTML

Apparently there’s no HTML entity for the flat symbol, ♭. In my previous post, I just spelled out B-flat because I thought that was safer; it’s possible not everyone would have the fonts installed to display B♭ correctly.

So how do you display music symbols for flat, sharp, and natural in HTML? You can insert any symbol if you know its Unicode value, though you run the risk that someone viewing the page may not have the necessary fonts installed to view the symbol. Here are the Unicode values for flat, natural, and sharp.

Since the flat sign has Unicode value U+266D, you could enter &#x266d; into HTML to display that symbol.

The sharp sign raises an interesting question. I’m sure most web pages referring to G-sharp would use the number sign # (U+0023) rather than the sharp sign ♯ (U+266F). And why not? The number sign is conveniently located on a standard keyboard and the sharp sign isn’t. It would be nice if people used sharp symbols rather than number signs. It would make it easier to search on specifically musical terms. But it’s not going to happen.

Update: See this post on font support for Unicode. Most people can see all three symbols, but some, especially Android users, might not see the natural sign.

Related posts:

Google Reader and HTML lists

Yesterday I wrote a post about how to start numbering a list in HTML at some point other than 1. Mark Reid and Thomas Guest pointed out that my example did not show up correctly in Google Reader.

Here’s how the list shows up when I browse directly to the post using Firefox 3 on Windows XP.

browser screen shot

But here’s what the same list looks like when I look at the post inside Google Reader.

Google Reader screen shot

I don’t understand why the difference. In fact, I don’t understand in general why posts often look different in Google Reader. For example, the screenshots above are centered when you visit the blog directly, but are left-aligned in Google Reader. Also, the space between the images and the text is removed.

Do other RSS readers similarly mangle HTML? Any suggestions how to fix the problem?

Update: Changed the way images are centered per Thomas Guest’s suggestion.

Starting number for HTML lists

I recently found out how to make an HTML list start numbering somewhere other than at 1. This is handy when you have a list interrupted by some text and want to continue the original numbering without starting over. I’ve only been using HTML for 15 years. Maybe one of these days I’ll really learn it.

In the <ol> tag, add the attribute start="7", for example, to make the list start numbering with 7.  The start attribute can be any integer, even negative.

For example, the seven dwarfs are

  1. Dopey
  2. Grumpy
  3. Doc
  4. Happy
  5. Bashful
  6. Sneezy

and last but not least

  1. Sleepy.

Update: As pointed out in the comments below, the example in this post may not render correctly in your reader. See this post for a discussion of the problem.

Faint praise for Expression Web

I really like Expression Web, when it doesn’t crash. It generates standard-compliant XHTML, it’s pleasant to use, etc. But I’ve had it crash many times. It crashes when I have too many files open, or when I edit too big a file for too long.

I need to find something better, but I haven’t taken the time to evaluate other HTML editing environments. I’d like to keep using it if Microsoft would fix the crashes. They have posted an update, but the description implies it doesn’t fix the problems I’ve seen.

Syntax coloring for code samples in HTML

Syntax coloring makes it much easier to read source code, especially when you become accustomed to a particular color scheme. For example, I’m used to the default color scheme in Visual Studio: comments are green, keywords are blue, string literals are red, etc. Once you get used to color-coded source code, it’s hard to go back to black-and-white. However, the code samples here are monochrome and I’ve been thinking about doing something about that.

It’s possible to mark up code samples on the web just like any other chuck of HTML, but that would be time-consuming. Also, you want to leave code samples as plain text so readers can copy the source and paste it into their code. So what people usually do is use client-side JavaScript to change the way code samples are displayed in the browser. That way the code appears to be marked up but it’s still unadorned underneath.

I like the way code samples look on Scott Hanselman’s blog, and so I started to use the JavaScript solution that he recommends, SyntaxHighlighter. However, I decided to hold off when I read this comment on his blog.

I tried to use it. Unfortunately the scrolling errors I see on your and other sites are too much to bear.

More recently I heard that StackOverflow is using prettify.js to add syntax coloring for their code samples. I haven’t noticed any problems on that site, so I’m starting to try out prettify. I’ve heard that the script doesn’t always handle VB code correctly, but that doesn’t matter to me.

The prettify script is easy to use. You don’t have to tell it what programming language you’re highlighting. However, you do have the option of  specifying the language, which presumably can’t hurt. I’ve tried it on a page containing C++ code and on another page containing Python code and they both look OK. My only complaint is that the default color scheme is not what I would have chosen. However, the color scheme can be modified by editing a style sheet and I intend to do that.

I’m going to start by experimenting with static files on my site. I’m more cautious about incorporating syntax coloring with the blog since I don’t know what interaction problems there could be with the WordPress software on the server or with blog reader software on the clients. Scott Hanselman says on his blog

… I couldn’t find a syntax highlighting solution that worked in EVERY feed reader. There are lots of problems with online ones like Google Reader and BlogLines if I, as the publisher, try to get tricky with the CSS.

So I may leave the code samples on the blog alone. Also, I’m also not sure what I’ll do about PowerShell samples. The prettify script works well with C-family languages, but PowerShell syntax may be too far afield from what it expects.

Does anyone have suggestions in general? For PowerShell in particular?

RDFa

Phil Windley had a recent interview with Elias Torres and Ben Adida on RDFa. This is an emerging standard for adding semantic information to HTML documents. The “a” in RDFa stands for attributes. Rather than creating new documents, RDFa allows you to add RDF-like semantic information to your existing HTML pages via element attributes. A great deal of thought has gone into how to accomplish this without unwanted side effects such as changing how pages are rendered.

As I listened to the interview, I tried to think of how I would apply this to my web sites. There are standardized vocabularies for expressing friendship relationships, address and calendar information, audio file meta data, etc. However, little of this is relevant to my web sites. I’m ready to jump on the RDFa bandwagon once there are standard vocabularies more applicable to the kind of content I publish.

To learn more about RDFa, listen to Phil Windley’s podcast or read the RDFa primer.

Manipulating the clipboard with PowerShell

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into &lt; and &gt; etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

$a = get-clipboard;
$a = $a -replace "&", "&amp;";
$a = $a -replace "<", "&lt;";
$a = $a -replace ">", "&gt;";
$a = $a -replace '"', "&quot;"
$a = $a -replace "'", "&#39;"
out-clipboard $a

So this C++ code

double& x = y;
char c = 'k';
string foo = "hello";
if (p < q) ...

turns into this HTML code

double&amp; x = y;
char c = &#39;k&#39;;
string foo = &quot;hello&quot;;
if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

Version:0.9
StartHTML:00000136
EndHTML:00000214
StartFragment:00000170
EndFragment:00000178
SourceURL:http://www.codeproject.com/
<html><body>
<!--StartFragment-->snippets<!--EndFragment-->
</body>
</html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX; it has been assigned a work item.

Comparing Google and Yahoo automatic translation

I played around with Google’s translator a little after adding some notranslate directives as discussed in my previous post. Google did honor my requests to mark some sections as literal text to not be translated. Google’s translator was also able to recognize my name as a name without special markup. Yahoo, on the other hand, translated my name, turning “Cook” into “Cuisinier” in French.

Google treated text inside <code> tags as literals that should not be translated. That is, Google would leave my source code snippets alone and only translate the English prose surrounding the code. Yahoo, on the other hand, would translate everything, including source code. For example, I had some PowerShell code on my page with the keyword matches that Google left alone but Yahoo translated into “allumettes,” presumably good French prose but not a legal PowerShell keyword.

One puzzling thing about the Google translation engine was that it would change which text was hyperlinked. For example, the text “My résumé” was changed to “Mon CV,” linking on the translation for “my.” Yahoo produced what I expected, “Mon résumé.” There were several other instances in which Google produced odd links, such as hyperlinking the | marker between words that were linked before. For example, the footer of my web site has these links:

Home | Sitemap | My blog | Search

Yahoo turned this into

Maison | Sitemap | Mon blog | Recherche

while Google produced

Accueil | Plan du site | Mon blog | Recherche

So Google incorporated the separator bars as part of words, and moved the last link from “Recherche” to the bar separating “blog” and “Rescherche.”

One advantage of Google’s translation is that it lets you hover your mouse over a line of translated text and see the original text.