Validating a web site

The W3C HTML validator is convenient for validating HTML on a single page. The site lets you specify a file to validate by entering a URL, uploading a file, or pasting the content into a form. But the site is not designed for bulk validation. There is an offline version of the validator intended for bulk use, but it’s a Perl script and difficult to install, at least on Windows. There is also a web service API, but apparently it has no WSDL file and so is inaccessible from tools requiring such a file.

The WDG HTML validator is easier to use for a whole web site. It lets you enter a URL and it will crawl the entire site report the validation results for each page. (It’s not clear how it knows what pages to check. Maybe it looks for a sitemap or follows internal links.) If you’d like more control over the list of files it validates, you can paste a list of URLs into a form. (This was the motivation for my previous post on filtering an XML sitemap to create a list of URLs.)

PowerShell one-liner to filter a sitemap

Suppose you have an XML sitemap and you want to extract a flat list of URLs. This PowerShell code will do the trick.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc}

This code calls Get-Content, using the shortcut gc, to read the file sitemap.xml and casts the file to an XML document object. It then makes an array of all blocks of XML inside a <url> tag. It then pipes the array to the foreach command, using the shortcut %, and selects the content of the <loc> tag which is the actual URL.

Now if you want to filter the list further, say to pull out all the PDF files, you can pipe the previous output to a Where-Object filter.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc} |
        ? {$_ -like *.pdf}

This code uses the ? shortcut for the Where-Object command. The -like filter uses command line style matching. You could use -match to filter on a regular expression.

Related resources: PowerShell script to make an XML sitemap, Regular expressions in PowerShell

PowerShell script to make an XML sitemap

A while back I wrote a post on how to create a sitemap in the standard sitemap.org format using Python. This post does the same task using PowerShell. The solution presented here is an ideomatic PowerShell solution using pipes, not a direct translation of the Python code. I’ll introduce the script in pieces, then present the entire script at the end.

The final line of the script is

dir | wrap | out-file -encoding ASCII sitemap.xml

The heart of the script is the function wrap that wraps each file’s properties in the necessary XML tags. This function uses the pipeline, and so it has begin, process, and end blocks. The begin block prints out the XML header and the opening <urlset> tag. The end block prints out the closing </urlset> tag. In between is the process block that does most of the work.

Since all unassigned expressions are returned from PowerShell functions, the code is very clean. No need for print statements, just state the strings that make up the output. Variable interpolation helps keep the code succinct as well: simply use the name of a variable where you want to insert that variable’s value in a string. (Be sure to use double quotes if you want interpolation.)

The wrap function uses the implicit variable $_ which means “the next thing in the pipeline.” Since we’re piping in the output of dir (alias for Get-ChildItem), $_ represents a FileSystemInfo object. We look at the extension property on this object to see whether the file is one of the types we want to include in the sitemap. In this case, .html, .htm, or .pdf. Obviously you can edit the value of the variable $extensions if you want to include different file types in your sitemap.

Getting the file timestamp in the necessary format is particularly easy. The format specifier {0:s} causes the date and time to be written in the ISO 8601 format that the sitemap standard requires. The Z tacked on at the end says that time is UTC rather than some other time zone.

This script will produce a file sitemap.xml in the standard format. Once you upload the sitemap to your server, you’ve got to let the search engines know how to find it. The simplest way to do this is to create a file called robots.txt at the top of your site containing one line, Sitemap: followed by the URL of your sitemap.

Sitemap: http://www.yourdomain.com/sitemap.xml

Now here’s the full script.

# Change this to your URL
$domain = "http://www.yourdomain.com"

# file extensions to include in sitemap
$extensions = ".htm", ".html", ".pdf"

# wrap file information in XML tags
function wrap
{
    begin
    {
        '<?xml version="1.0" encoding="UTF-8"?>'
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
    }

    process
    {
        if ($extensions -contains $_.extension)
        {
        "`t<url>"
        "`t`t<loc>$domain/$_</loc>"
        "`t`t<lastmod>{0:s}Z</lastmod>" -f $_.LastWriteTimeUTC
        "`t</url>"
        }
    }

    end
    {
        "</urlset>"
    }
}

dir | wrap | out-file -encoding ASCII sitemap.xml

Code to make an XML sitemap

Here’s some Python code to create a sitemap in the format specified by sitemaps.org and read by search engines. Download the file sitemapmaker.txt and change the extension from .txt to .py.

Change the url variable in the script before running it or else you’ll point search engines to my web site rather than yours. Also, edit the file extensions_to_keep variable if you want to index any file types besides HTML and PDF.

Copy the file sitemapmaker.py to the directory on your computer where you have your files. Run the script and direct its output to a file, sitemapmaker.py > sitemap.xml. See sitemaps.org for instructions on how to let search engines know about your sitemap.

This code assumes all the files to index in your sitemap are in one directory, the directory you run the script from. It also assumes the timestamps on your computer match those on your web server. Optional fields are left out of the sitemap.