PowerShell one-liner to filter a sitemap

Suppose you have an XML sitemap and you want to extract a flat list of URLs. This PowerShell code will do the trick.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc}

This code calls Get-Content, using the shortcut gc, to read the file sitemap.xml and casts the file to an XML document object. It then makes an array of all blocks of XML inside a <url> tag. It then pipes the array to the foreach command, using the shortcut %, and selects the content of the <loc> tag which is the actual URL.

Now if you want to filter the list further, say to pull out all the PDF files, you can pipe the previous output to a Where-Object filter.

        ([ xml ] (gc sitemap.xml)).urlset.url | % {$_.loc} |
        ? {$_ -like *.pdf}

This code uses the ? shortcut for the Where-Object command. The -like filter uses command line style matching. You could use -match to filter on a regular expression.

Related resources: PowerShell script to make an XML sitemap, Regular expressions in PowerShell

3 thoughts on “PowerShell one-liner to filter a sitemap

  1. John,

    You are really into this PowerShell thing. The syntax reminds me of Perl and bash. How long did it take you to get up to speed?

  2. It doesn’t take long to get up to speed with PowerShell. It’s very consistent. Bruce Payette’s book is a good place to start. He has an appendix on coming to PowerShell from various backgrounds: Unix shells, DOS, Perl.

    Everyone runs into the same small number of gotchas when learning PowerShell.

    There are a ton of resources at PowerShellCommunity.org including a forum where you can post a chunk of code and have people critique it.

  3. I’m really enjoying the design and layout of your site. It’s a very easy on the eyes which makes it much more pleasant for me to
    come here and visit more often. Did you hire out a developer to create your theme?
    Outstanding work!

Comments are closed.