Table-driven text munging in PowerShell

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

Manipulating the clipboard with PowerShell

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

Update: Looks like

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into &lt; and &gt; etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

    $a = get-clipboard;
    $a = $a -replace "&", "&amp;";
    $a = $a -replace "<", "&lt;";
    $a = $a -replace ">", "&gt;";
    $a = $a -replace '"', "&quot;"
    $a = $a -replace "'", "&#39;"
    out-clipboard $a

So this C++ code

    double& x = y;
    char c = 'k';
    string foo = "hello";
    if (p < q) ...

turns into this HTML code

    double&amp; x = y;
    char c = &#39;k&#39;;
    string foo = &quot;hello&quot;;
    if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

    Version:0.9
    StartHTML:00000136
    EndHTML:00000214
    StartFragment:00000170
    EndFragment:00000178
    SourceURL:https://www.codeproject.com/
    <html><body>
    <!--StartFragment-->snippets<!--EndFragment-->
    </body>
    </html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX.

Experimenting with Out-Speech in PowerShell

I’ve played around with the PSCX script Out-Speech at home and at work. At home, running Vista, words come out in a natural female voice. At work, running XP, words come out in a robotic male voice.

The voice is somewhat configurable. I didn’t try it at home, but at work I opened the Speech Properties applet in the control panel. All three are mechanical voices. I went to Microsoft’s website to see if I could download a natural voice. The site said that Microsoft does not provide other voices but it gives a link to third party providers.

My guess is that Microsoft deliberately put lame voices in XP for fear of a lawsuit and that they were braver by the time Vista was released.

Another difference I noticed between Vista and XP is tolerance of misspellings. XP will correctly pronounce “Fahrenheit” but pronounces the incorrect “Farenheit” so that it rhymes with “heat” rather than “height”. Vista correctly pronounces the misspelled word.

Depend on objects, not their presentation

The most recent blog post by Jeffrey Snover emphasizes that PowerShell pipes objects, not text. When you use single PowerShell commands, you can get the impression that they output text. But everything is an object until the pipeline spills onto the command line.

In UNIX, text output is effectively a programming contract because that is what the whole system is built upon. One command outputs text and other programs know what to expect so they parse the text to get the appropriate data elements so that they can code against it. In this model, if you change the text output of a command—you run the risk of breaking a bunch of scripts. … In PowerShell … We reserve the right to radically change our text rendering to improve our customer experience.

(Emphasis in the original.)

The object interfaces won’t change, but the text rendering probably will.

PowerShell posts classified

Here’s a summary of the blog posts I’ve written so far regarding PowerShell, grouped by topic.

Three posts announced CodeProject articles related to PowerShell:  automated software builds, text reviews for software, and monitoring legacy code.

Three posts on customizing the command prompt: I, II, III.

Two posts on XML sitemaps: making a sitemap and filtering a sitemap.

Two Unix-related posts: cross-platform PowerShell and comparing PowerShell and bash.

The rest of the PowerShell posts I’ve written so far fall under miscellaneous.

Much to my surprise, the post on integer division in PowerShell has been one of the most popular.

PowerShell output redirection: Unicode or ASCII?

What does the redirection operator > in PowerShell do to text: leave it as Unicode or convert it to ASCII? The answer depends on whether the thing to the right of the > operator is a file or a program.

Strings inside PowerShell are 16-bit Unicode, instances of .NET’s System.String class. When you redirect the output to a file, the file receives Unicode text. As Bruce Payette says in his book Windows PowerShell in Action,

myScript > file.txt is just syntactic sugar for myScript | out-file -path file.txt

and out-file defaults to Unicode. The advantage of explicitly using out-file is that you can then specify the output format using the -encoding parameter. Possible encoding values include Unicode, UTF8, ASCII, and others.

If the thing on the right side of the redirection operator is a program rather than a file, the encoding is determined by the variable $OutputEncoding. This variable defaults to ASCII encoding because most existing applications do not handle Unicode correctly. However, you can set this variable so PowerShell sends applications Unicode. See Jeffrey Snover’s blog post OuputEncoding to the rescue for details.

Of course if you’re passing strings between pieces of PowerShell code, everything says in Unicode.

Thanks to J_Tom_Moon_79 for suggesting a blog post on this topic.

Improved PowerShell prompt

A while back I wrote a post on how to customize your PowerShell prompt. Last week Tomas Restrepo posted an article on a PowerShell prompt that adds color and shortens the path in a more subtle way. I haven’t tried it out yet, but his prompt looks much better than what I’ve been using.

If you’re a long-time Windows user you might be worried that all this PowerShell stuff is starting to look a lot like Unix. Well, it is. Some of the folks on the PowerShell team have a Unix background and they’re bringing some of the best of Unix to Windows. The Unix world has more experience operating from the command line and so it’s wise to learn from them.

On the other hand, PowerShell is emphatically not bash for Windows. PowerShell is thoroughly object oriented and in that respect unlike any Unix shell. Also, PowerShell is strongly tied to Microsoft libraries, particularly .NET but also COM and WMI.