PowerShell one-liner to filter a sitemap

by John on August 1, 2008

Suppose you have an XML sitemap and you want to extract a flat list of URLs. This PowerShell code will do the trick.

        ([xml] (gc sitemap.xml)).urlset.url | % {$_.loc}

This code calls Get-Content, using the shortcut gc, to read the file sitemap.xml and casts the file to an XML document object. It then makes an array of all blocks of XML inside a <url> tag. It then pipes the array to the foreach command, using the shortcut %, and selects the content of the <loc> tag which is the actual URL.

Now if you want to filter the list further, say to pull out all the PDF files, you can pipe the previous output to a Where-Object filter.

        ([xml] (gc sitemap.xml)).urlset.url | % {$_.loc} |
        ? {$_ -like *.pdf}

This code uses the ? shortcut for the Where-Object command. The -like filter uses command line style matching. You could use -match to filter on a regular expression.

Related resources: PowerShell script to make an XML sitemap, Regular expressions in PowerShell

{ 2 comments… read them below or add one }

1

Codewiz51 08.03.08 at 17:50

John,

You are really into this PowerShell thing. The syntax reminds me of Perl and bash. How long did it take you to get up to speed?

2

John 08.03.08 at 18:04

It doesn’t take long to get up to speed with PowerShell. It’s very consistent. Bruce Payette’s book is a good place to start. He has an appendix on coming to PowerShell from various backgrounds: Unix shells, DOS, Perl.

Everyone runs into the same small number of gotchas when learning PowerShell.

There are a ton of resources at PowerShellCommunity.org including a forum where you can post a chunk of code and have people critique it.

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>