PowerShell one-liner to filter a sitemap

Suppose you have an XML sitemap and you want to extract a flat list of URLs. This PowerShell code will do the trick.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc}

This code calls Get-Content, using the shortcut gc, to read the file sitemap.xml and casts the file to an XML document object. It then makes an array of all blocks of XML inside a <url> tag. It then pipes the array to the foreach command, using the shortcut %, and selects the content of the <loc> tag which is the actual URL.

Now if you want to filter the list further, say to pull out all the PDF files, you can pipe the previous output to a Where-Object filter.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc} |
        ? {$_ -like *.pdf}

This code uses the ? shortcut for the Where-Object command. The -like filter uses command line style matching. You could use -match to filter on a regular expression.

Related resources: PowerShell script to make an XML sitemap, Regular expressions in PowerShell

Launch PowerShell 6x faster

In the latest Windows PowerShell blog post Jeffrey Snover points to an earlier post he wrote about how to make PowerShell launch much faster. On my desktop, the time to launch PowerShell went from around 13 seconds to around 2 seconds after applying the fix Snover recommends.

Comparing PowerShell and Bash

On the Windows PowerShell blog, Jeffrey Snover links to a article in Linux Magazine by Narcus Nasarek comparing Windows PowerShell and Linux’s bash shell.

The article’s sequence is unexpected. Not until near the end of the article does Nasarek get to the main difference between PowerShell and bash: PowerShell pipes objects, not text. Nasarek says regarding PowerShell’s object pipeline “Bash cannot compete here.” He says that the disadvantage of bash in this regard is that “it relies on the abilities of external programs to handle data structures.” That is an understatement. The disadvantage of bash is that it requires fragile, ad hoc text manipulation to pluck data out of the pipeline.

Nasarek is being fair to PowerShell, but he was limited by space. He had only two pages for his article, and only about half of those two pages were devoted to text.

Monitoring legacy code that fails silently

Clift Norris and I just posted an article on CodeProject entitled Monitoring Unreliable Scheduled Tasks about some software Clift wrote to resolve problems we had calling some legacy software that would fail silently. His software adds from the outside monitoring and logging functions that better software would have provided on the inside.

The monitoring and logging software, called RunAndWait, kicks off a child process and waits a specified amount of time for the process to complete. If the child does not complete in time, a list of people are notified by email. The software also checks return codes and writes all its activity to a log.

RunAndWait is a simple program, but it has proven very useful over the last year and a half since it was written. We use RunAndWait in combination with PowerShell for scheduling our nightly processes to interact with the legacy system. Since PowerShell has verbose error reporting, calling RunAndWait from PowerShell rather than from cmd.exe gives additional protection against possible silent failures.

PowerShell script to make an XML sitemap

A while back I wrote a post on how to create a sitemap in the standard sitemap.org format using Python. This post does the same task using PowerShell. The solution presented here is an idiomatic PowerShell solution using pipes, not a direct translation of the Python code. I’ll introduce the script in pieces, then present the entire script at the end.

The final line of the script is

dir | wrap | out-file -encoding ASCII sitemap.xml

The heart of the script is the function wrap that wraps each file’s properties in the necessary XML tags. This function uses the pipeline, and so it has begin, process, and end blocks. The begin block prints out the XML header and the opening <urlset> tag. The end block prints out the closing </urlset> tag. In between is the process block that does most of the work.

Since all unassigned expressions are returned from PowerShell functions, the code is very clean. No need for print statements, just state the strings that make up the output. Variable interpolation helps keep the code succinct as well: simply use the name of a variable where you want to insert that variable’s value in a string. (Be sure to use double quotes if you want interpolation.)

The wrap function uses the implicit variable $_ which means “the next thing in the pipeline.” Since we’re piping in the output of dir (alias for Get-ChildItem), $_ represents a FileSystemInfo object. We look at the extension property on this object to see whether the file is one of the types we want to include in the sitemap. In this case, .html, .htm, or .pdf. Obviously you can edit the value of the variable $extensions if you want to include different file types in your sitemap.

Getting the file timestamp in the necessary format is particularly easy. The format specifier {0:s} causes the date and time to be written in the ISO 8601 format that the sitemap standard requires. The Z tacked on at the end says that time is UTC rather than some other time zone.

This script will produce a file sitemap.xml in the standard format. Once you upload the sitemap to your server, you’ve got to let the search engines know how to find it. The simplest way to do this is to create a file called robots.txt at the top of your site containing one line, Sitemap: followed by the URL of your sitemap.

Sitemap: http://www.yourdomain.com/sitemap.xml

Now here’s the full script.

# Change this to your URL
$domain = "http://www.yourdomain.com"

# file extensions to include in sitemap
$extensions = ".htm", ".html", ".pdf"

# wrap file information in XML tags
function wrap
{
    begin
    {
        '<?xml version="1.0" encoding="UTF-8"?>'
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
    }

    process
    {
        if ($extensions -contains $_.extension)
        {
        "`t<url>"
        "`t`t<loc>$domain/$_</loc>"
        "`t`t<lastmod>{0:s}Z</lastmod>" -f $_.LastWriteTimeUTC
        "`t</url>"
        }
    }

    end
    {
        "</urlset>"
    }
}

dir | wrap | out-file -encoding ASCII sitemap.xml

Uninitialized variables in PowerShell

I just got a bug report about an uninitialized variable in a PowerShell script I’d written. I’d gone through and renamed most instances of a variable, but not all. If I’d put Set-PsDebug -strict in my profile, the instance I missed would have been caught as an uninitialized variable error. I always used the analogous warning feature in other languages, such as use strict in Perl or option explicit in VB, but I haven’t gotten into the habit yet of using Set-PsDebug -strict in PowerShell.

Jeffrey Snover published an article a few days ago about the new Set-StrictMode cmdlet that will be part of version 2 of PowerShell and will replace Set-PsDebug -strict. The new feature will be more strict and will be more finely configurable.

Rounding and integer division in PowerShell

The way PowerShell converts floating point numbers to integers makes perfect sense, unless you’ve been exposed to another way first. PowerShell rounds floating point numbers to the nearest integer when casting to int. For example, in PowerShell [int] 1.25 evaluates to 1 but [int] 1.75 evaluates to 2.

When there isn’t a unique nearest integer, i.e. when the decimal part of a number is exactly 0.5, PowerShell rounds to the nearest even integer. This is known as banker’s rounding or round-to-even. So, for example, [int] 1.5 would round to 2 but so would [int] 2.5. The motivation for banker’s rounding is that is unbiased in the sense that numbers of the form n + 0.5 will round up as often as down on average.

Apart from the detail of handing numbers ending in exactly one half, PowerShell does what most people would expect. However, people who program in C and related languages have different expectations. These languages truncate when converting floating point numbers to integers. For example, in C++ both int(1.25) and int(1.75)evaluate to 1. When I learned C, I found it’s behavior surprising. But now that PowerShell does what I once expected C to do, I find PowerShell surprising.

The PowerShell folks make the right decision for a couple reasons. For one, they are being consistent with their decision to break with tradition when necessary to do what they believe is right. Also, their primary audience is system administrators, not programmers steeped in C++ or C#.

Another way PowerShell breaks from C tradition is integer division. For example, 5/4evaluates to 1 in C, but 1.25 in PowerShell. Both language designs make sense in context. C is explicitly typed, and so the ratio of two integers is an integer. PowerShell is implicitly typed, so integers are converted to doubles when necessary.

(As an aside, Python initially followed the C tradition regarding integer division, but future versions of the language will act more like PowerShell. In the future, the / operator will perform floating point division and the new // operator will perform integer division.)

Customizing the PowerShell command prompt II

I just picked up a copy of Windows PowerShell Cookbook by Lee Holmes. One of the first examples in the book is customizing the PowerShell command prompt. His example sets the command window title as part of the prompt function. For example, adding

$host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"

to the function given in my previous post would display the computer name and full path to the working directory in the title bar. The full code would be

function prompt
{
    $m = 30 # maximum prompt length
    $str = $pwd.Path
    if ($str.length -ge $m)
    {
        # The prompt will begin with "...",
        # end with ">", and in between contain
        # as many of the path characters as will fit,
        # reading from the end of the path.
        $str = "..." + $str.substring($str.length - $m + 4)
    }
    $host.UI.RawUI.WindowTitle = "$env:computername $pwd.path"
    "$str> "
}