Unicode function names

Keith Hill has a fun blog post on using Unicode characters in PowerShell function names. Here’s an example from his article using the square root symbol for the square root function.

PS> function √($num) { [Math]::Sqrt($num) }
PS> √ 81
9

As Keith points out, these symbols are not practical since they’re difficult to enter, but they’re fun to play around with.

Here’s another example using the symbol for pounds sterling

for the function to convert British pounds to US dollars.

PS> function £($num) { 1.44*$num }
PS> £ 300.00
432

(As I write this, a British pound is worth $1.44 USD. If you wanted to get fancy, you could call a web service in your function to get the current exchange rate.)

I read once that someone (Larry Wall?) had semi-seriously suggested using the Japanese Yen currency symbol

for the “zip” function in Perl 6 since the symbol looks like a zipper.

Mathematica lets you use Greek letters as variable and function names, and it provides convenient ways to enter these characters, either graphically or via their TeX representations. I think this is a great idea. It could make mathematical source code much more readable. But I don’t use it because I’ve never got into the habit of doing so.

There are some dangers to allowing Unicode characters in programming languages. Because Unicode characters are semantic rather than visual, two characters may have the same graphical representation. Here are a couple examples. The Roman letter A (U+0041) and the capital Greek letter Α (U+0391) look the same but correspond to different characters. Also, the the Greek letter Ω (U+03A9) and the symbol Ω (U+2126) for Ohms (unit of electrical resistance) have the same visual representation but are different characters. (Or at least they may have the same visual representation. A font designer may choose, for example, to distinguish Omega and Ohm, but that’s not a concern to the Unicode Consortium.)

How to grep Twitter

Twitter has an extensive search API. To build the URL for a query, start with the base http://search.twitter.com/search.atom?q=. To search for a word, just append that word to the base, such as http://search.twitter.com/search.atom?q=Coltrane to search for tweets containing “Coltrane.”

To search for a term within a particular user’s tweet stream, start with the base URL and append +from%3A and the user’s name. (The %3A is a URL- encoded colon.) See the search API page for other options, such as specifying the number of requests per page to return (look for rpp) or restricting the language (look for lang).

As far as I can tell, the API does not support regular expressions, but you could loop over the search results programmatically. Here’s how you’d do it in PowerShell.

First, grab the search results as a string. Say we want to search through the latest tweets from Hal Rottenberg, @halr9000.

$base = "http://search.twitter.com/search.atom?q="
$query = "from%3Ahalr9000"
$str = (new-object net.webclient).DownloadString($base + $query)

Now $str contains an XML string of results formatted as an Atom feed. The root element is <feed> and individual tweets are contained in <entry> tags under that. The text of the tweet is in the <title> tag and the other tags contain auxiliary data such as time stamps. The following code shows how to search for the regular expression d{4}. (Look for four-digit numbers.)

(

$str).feed.entry | where-object {$_.title -match "d{4}"}
In English, the code says to cast $str to an XML document object and pipe the <entry> contents to the filter that selects those objects whose <title> strings match the regular expression.

The search API limits the number of entries it will return, so it’s best to do as much filtering as you can via the Twitter site before doing your own filtering.

Related posts:

Regular expressions in PowerShell and Perl
Table-driven text munging in PowerShell

DSLs in PowerShell

In an earlier post, I quoted John Lam saying that one reason Ruby is such a good language for implementing DSLs (domain specific languages) is that function calls do not require parentheses. This allows DSL authors to create functions that look like new keywords. I believe I heard Bruce Payette say in an interview that Ruby had some influence on the design of PowerShell. Maybe Ruby influenced the PowerShell team’s decision to not use parentheses around function arguments. (A bigger factor was convenience at the command line and shell language tradition.)

In what ways has Ruby influenced PowerShell? And if Ruby is good for implementing DSLs, how good would PowerShell be?

Update: See Keith Hill’s blog post on PowerShell function names and DSLs.

Table-driven text munging in PowerShell

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &amp;
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

Manipulating the clipboard with PowerShell

The PowerShell Community Extensions contain a couple handy cmdlets for working with the Windows clipboard: Get-Clipboard and Out-Clipboard. One way to use these cmdlets is to copy some text to the clipboard, munge it, and paste it somewhere else. This lets you avoid creating a temporary file just to run a script on it.

For example, occasionally I need to copy some C++ source code and paste it into HTML in a <pre> block. While <pre> turns off normal HTML formatting, special characters still need to be escaped: < and > need to be turned into &lt; and &gt; etc. I can copy the code from Visual Studio, run a script html.ps1 from PowerShell, and paste the code into my HTML editor. (I like to use Expression Web.)

The script html.ps1 looks like this.

$a = get-clipboard;
$a = $a -replace "&", "&amp;";
$a = $a -replace "<", "&lt;";
$a = $a -replace ">", "&gt;";
$a = $a -replace '"', "&quot;"
$a = $a -replace "'", "&#39;"
out-clipboard $a

So this C++ code

double& x = y;
char c = 'k';
string foo = "hello";
if (p < q) ...

turns into this HTML code

double&amp; x = y;
char c = &#39;k&#39;;
string foo = &quot;hello&quot;;
if (p &lt; q) ...

Of course the PSCX clipboard cmdlets are useful for more than HTML encoding. For example, I wrote a post a few months ago about using them for a similar text manipulation problem.

If you’re going to do much text manipulation, you may may want to look at these notes on regular expressions in PowerShell.

The only problem I’ve had with the PSCX clipboard cmdlets is copying formatted text. The cmdlets work as expected when copying plain text. But here’s what I got when I copied the word “snippets” from the CodeProject home page and ran Get-Clipboard:

Version:0.9
StartHTML:00000136
EndHTML:00000214
StartFragment:00000170
EndFragment:00000178
SourceURL:http://www.codeproject.com/
<html><body>
<!--StartFragment-->snippets<!--EndFragment-->
</body>
</html>

The Get-Clipboard cmdlet has a -Text option that you might think would copy content as text, but as far as I can tell the option does nothing. This may be addressed in a future release of PSCX; it has been assigned a work item.

Regular expressions in PowerShell and Perl

This is one of the most popular pages on my web site:

Regular expressions in PowerShell and Perl

It’s about how you use regular expressions in PowerShell — how to do matches, replacements, etc. — rather than the grammar of regular expressions. It makes comparisons to Perl, in case you’re already familiar with how to use regular expressions there.

Experimenting with Out-Speech in PowerShell

I’ve played around with the PSCX script Out-Speech at home and at work. At home, running Vista, words come out in a natural female voice. At work, running XP, words come out in a robotic male voice.

The voice is somewhat configurable. I didn’t try it at home, but at work I opened the Speech Properties applet in the control panel. All three are mechanical voices. I went to Microsoft’s web site to see if I could download a natural voice. The site said that Microsoft does not provide other voices but it gives a link to third party providers.

My guess is that Microsoft deliberately put lame voices in XP for fear of a lawsuit and that they were braver by the time Vista was released.

Another difference I noticed between Vista and XP is tolerance of misspellings. XP will correctly pronounce “Fahrenheit” but pronounces the incorrect “Farenheit” so that it rhymes with “heat” rather than “height”. Vista correctly pronounces the misspelled word.

Depend on objects, not their presentation

The most recent blog post by Jeffrey Snover emphasizes that PowerShell pipes objects, not text. When you use single PowerShell commands, you can get the impression that they output text.  But everything is an object until the pipeline spills onto the command line.

In UNIX, text output is effectively a programming contract because that is what the whole system is built upon.  One command outputs text and other programs know what to expect so they parse the text to get the appropriate data elements so that they can code against it.  In this model, if you change the text output of a command – you run the risk of breaking a bunch of scripts.  … In PowerShell … We reserve the right to radically change our text rendering to improve our customer experience.

(Emphasis in the original.)

The object interfaces won’t change, but the text rendering probably will.

PowerShell posts classified

Here’s a summary of the blog posts I’ve written so far regarding PowerShell, grouped by topic.

Three posts announced CodeProject articles related to PowerShell:  automated software builds, text reviews for software, and monitoring legacy code.

Three posts on customizing the command prompt: I, II, III.

Two posts on XML sitemaps: making a sitemap and filtering a sitemap.

Two Unix-related posts: cross-platform PowerShell and comparing PowerShell and bash.

The rest of the PowerShell posts I’ve written so far fall under miscellaneous.

gotchas
PolyMon
here-strings
readable paths
uninitialized variables
ASP.NET in PowerShell
clipboard and command line
launching PowerShell faster
rounding and integer division
one program to rule them all
redirection and Unicode vs ASCII

Much to my surprise, the post on integer division in PowerShell has been one of the most popular.

PowerShell output redirection: Unicode or ASCII?

What does the redirection operator > in PowerShell do to text: leave it as Unicode or convert it to ASCII? The answer depends on whether the thing to the right of the > operator is a file or a program.

Strings inside PowerShell are 16-bit Unicode, instances of .NET’s System.String class. When you redirect the output to a file, the file receives Unicode text. As Bruce Payette says in his book Windows PowerShell in Action,

myScript > file.txt is just syntactic sugar for myScript | out-file -path file.txt

and out-file defaults to Unicode. The advantage of explicitly using out-file is that you can then specify the output format using the -encoding parameter. Possible encoding values include Unicode, UTF8, ASCII, and others.

If the thing on the right side of the redirection operator is a program rather than a file, the encoding is determined by the variable $OutputEncoding. This variable defaults to ASCII encoding because most existing applications do not handle Unicode correctly. However, you can set this variable so PowerShell sends applications Unicode. See Jeffrey Snover’s blog post OuputEncoding to the rescue for details.

Of course if you’re passing strings between pieces of PowerShell code, everything says in Unicode.

Thanks to J_Tom_Moon_79 for suggesting a blog post on this topic.