Posts tagged as:

Regular expressions

Twitter daily tip news

by John on February 8, 2010

I have five Twitter accounts that send out one tip per day, including a new one I just added last week.

Regular expressions

@RegexTip started over today. It’s a cycle of tips for learning regular expressions. It sticks to the regular expression features common to Python, Perl, C#, and many other programming languages. This account posts Monday through Friday.

Keyboard shortcuts

@SansMouse gives one tip a day on using Windows without a mouse. By practicing one keyboard shortcut a day, you can get into the habit of using your mouse less and your keyboard more. This cycle of tips started over January 29 with the most common and most widely useful shortcuts. I’m also sprinkling in a few extra tips that are less well known. This account also posts Monday through Friday.

Math

I have three mathematical accounts. These post seven days a week.

@AlgebraFact, just started February 2. It will be a mixture of linear algebra, number theory, group theory, etc.

@ProbFact gives one fact per day from probability. Usually these facts are theorems, but sometimes they include a note on history or applications.

@AnalysisFact gives facts from real and complex analysis. The topics range from elementary to advanced.

What if I don’t use Twitter?

You can visit the page for a Twitter account just like any other web page. And every Twitter account has an RSS feed link allowing you to subscribe just as you would subscribe to a blog.

How do you write these?

I write up content for these accounts in bulk. I may sit down on a Saturday and come up with several weeks worth of tips. Then I use HootSuite to schedule the tips weeks in advance. Sometimes I’ll post something spontaneously, such as link to something relevant, but most of the work is done in advance. I use my personal Twitter account for live interaction.

Related links:

Using Windows without a mouse

Regular expressions in

Chart of probability distribution relationships

{ 2 comments }

Regular expressions in R

by John on January 4, 2010

Notes on using regular expressions in R. R uses POSIX regular expression syntax by default but you can ask it to use Perl’s flavor of regular expressions.

Related links:

Regular expressions in C++, Mathematica, Python, R, PowerShell
R for programmers coming from other languages
R: The good parts
Daily regular expression tips

{ 0 comments }

Regular expressions in Mathematica

by John on January 4, 2010

Regular expressions are fairly portable. There are two main flavors of regular expressions — POSIX and Perl — and more languages these days use the Perl flavor. There are some minor differences in what it means to be “like Perl” but for the most part languages that say they follow Perl’s lead specify regular expressions the same way. The differences lie in how you use regular expressions: how you form matches, how you replace strings, etc.

Mathematica uses Perl’s regular expression flavor. But how do you use regular expressions in Mathematica? I’ll give a few tips here and give more details in the notes Regular expressions in Mathematica.

First of all, unlike Perl, Mathematica specifies regular expressions with ordinary strings. This means that metacharacters have to be doubly escaped. For example, to represent the regular expression \d{4} you must use the string "\\d{4}".

The function StringCases returns a list of all matches of a regular expression in a string. If you simply want to know whether there was a match, you can use the function StringFreeQ. However, note the you probably want the opposite of the return value from StringFreeQ because it returns whether a string does not contain a match.

By default, the function StringReplace replaces all matches of a regular expression with a given replacement pattern. You can limit the number of replacements it makes by specifying an addition argument.

Related links:

Regular expressions in Mathematica
Tips for getting started with regular expressions
Languages that are easy to pick back up
Regular expressions in C++, Python, R, PowerShell

{ 0 comments }

New daily tip feeds: RegexTip and ProbFact

by John on December 1, 2009

A few weeks ago I started a Twitter account @SansMouse with daily tips on Windows keyboard shortcuts. That’s gone well, so I decided start two more daily tip accounts: @RegexTip and @ProbFact. If you don’t use Twitter, you can follow these tip via your blog reader. Here are the RSS feeds for RegexTip and ProbFact.

RegexTip will give one tip per day on using regular expressions. I’ll sometimes post more than once a day, but I’ll only give one tip. Other posts might be housekeeping notes etc. The idea is that people who have intended to learn regular expressions but don’t have the time can make time to absorb one tip per day.

ProbFact will give one fact per day from probability. I’ll often have a link with each fact for more details. Many of these facts will be theorems, but some will be statements about applications or history.

The SansMouse and RegexTip posts are loosely arranged in order of familiarity, starting with the most basic and most widely used material. ProbFact posts will be more random. Some will be elementary, some more advanced.

I have scheduled tips to come out at regular times starting tomorrow. SansMouse will post at 9 AM, RegexTip at 10 AM, and ProbFact at 11 AM. These times are Central Standard Time (UTC-6).

Related links:

Diagram of probability distribution relationships
Regular expressions in PowerShell, Python, and C++

{ 4 comments }

Table-driven text munging in PowerShell

by John on October 17, 2008

In my previous post, I mentioned formatting C++ code as HTML by doing some regular expression substitutions. I often need to write something that carries out a list of pattern substitutions, so I decided to rewrite the previous script to read a list of patterns from a file. Another advantage of putting the list of substitutions in an external file is that the same file could be used from scripts written in other languages.

Here’s the code:

param($regex_file)

$lines = get-content $regex_file

$a = get-clipboard

foreach( $line in $lines )
{
    $line = ($line.trim() -replace "\s+", " ")
    $pair = $line.split(" ", [StringSplitOptions]::RemoveEmptyEntries)
    $a = $a -replace $pair
}

out-clipboard $a

The part of the script that is unique to formatting C++ as HTML is moved to a separate file, say cpp2html.txt, that is pass in as an argument to the script.

&  &
<  &lt;
>  &gt;
"  &quot;
'  &#39;

Now I could use the same PowerShell script for any sort of task that boils down to a list of pattern replacements. (Often this kind of rough translation does not have to be done perfectly. It only has to be done well enough to reduce the amount of left over manual work to an acceptable level. You start with a small list of patterns and add more patterns until it’s less work to do the remaining work by hand than to make the script smarter.)

Note that the order of the lines in the file can be important. Substitutions are done from the top of the list down. In the example above, we want to first convert & to &amp; then convert < to &lt;. Otherwise, < would first become &lt; and then become &amp;lt;.

{ 0 comments }

API symmetry

by John on October 14, 2008

Symmetric APIs are easier to use. I was reminded of this when doing some regular expression programming in Python and comparing it to Perl. Perl’s regular expression operators for search and replace are symmetric in a way that their Python counterparts are not.

Perl uses m/pattern/ for matching and s/pattern/replacement/ for substitution. Both apply to the first instance of a pattern in a string by default. The g option following a match or substitute operator causes the command to apply to all instances of the pattern. The i option after either a match or substitute command causes the pattern to apply in a case-insensitive manner. Matching and substitution are symmetric.

Python uses re.search() for matching and re.sub() for substitution. The search function can only apply to the first instance of a pattern; to match all instances of a pattern, use re.findall(). The function re.sub() applies to all instances by default, but it has a max parameter that can be set to limit the number of instances it applies to. To make a search pattern case-insensitive, pass in re.IGNORECASE flag. To make a substitution case-insensitive, modify the regular expression itself by adding (?i).

In general, I find Python syntax much cleaner than Perl, but regular expressions are implemented more elegantly in Perl.

{ 2 comments }

Regular expressions in PowerShell and Perl

by John on October 7, 2008

This is one of the most popular pages on my web site:

Regular expressions in PowerShell and Perl

It’s about how you use regular expressions in PowerShell — how to do matches, replacements, etc. — rather than the grammar of regular expressions. It makes comparisons to Perl, in case you’re already familiar with how to use regular expressions there.

{ 0 comments }

Tips for using regular expressions

by John on June 27, 2008

Jeff Atwood just posted a good article on regular expressions. Not the syntax of regular expressions but rather the strategy of when and how to use them.

{ 1 comment }

Regular expressions in C++ TR1

by John on May 7, 2008

Regular expressions are not a part of the C++ Standard Library quite yet, but there is a document (Technical Report 1, or TR1) that includes among other things a specification for regular expression support that will probably be added to the C++ standard eventually.

The Boost library has supported TR1 for a while. Microsoft just released a feature pack for Visual Studio 2008 a month ago that includes support for most of TR1. (They’ve left out support for mathematical special functions.) And Dinkumware sells a complete TR1 implementation.

I’ve added some notes to my web site for getting started with C++ TR1 regular expressions. I took my PowerShell regex notes as a starting point and implemented some of the same examples in C++. I changed the organization though, because the C++ implementation is fairly different from PowerShell.

Working with regular expressions is harder in C++ than in scripting languages such as Perl or Python, but not unnecessarily so. C++ is optimized for fine-grained control and efficiency rather than ease of use; that’s what C++ is for. The TR1 implementation is internally consistent and elegant in its own way.

It’s easy to find API-level documentation but harder to find examples for getting started. (I’ve heard good things about Pete Becker’s book The C++ Standard Library Extensions but I haven’t read it.) So I decided to keep some notes as I played with the Visual Studio implementation. I imagine most of the content applies to other implementations, but I’ve only tested the examples using Visual Studio.

Update: GCC just added support for C++ TR1 two days ago with their verion 4.3 release.  However, it appears support for regular expressions is not included.

{ 0 comments }

LINQ to Regex

by John on May 6, 2008

Roy Osherove just posted an article about his Introducing LINQ to Regex project.

LINQ stands for Language INtegrated Query, a way of baking query support into .NET programming languages. Microsoft has been promising a unified way to query all kinds of data for years now.  Along the way they came out with a score of new libraries that were going to be the solution. They’d work for all kinds of data that happened to look very much like a relational database. But now with LINQ they’ve finally delivered something that works well not only with relational data but also with hierarchical data such as XML. With LINQ to Regex, you can query unstructured text with LINQ as well.

There are two big advantages to LINQ. First, you can query different kinds of data sources with similar code. Second, “language integrated” means that your programming language knows about your query language, making strong typing and better tool support possible. (By contrast, if you have a SQL statement inside VB, for example, VB knows nothing about SQL. The SQL command is just a string as far as VB is concerned. If the SQL is malformed, you won’t know until runtime. But with LINQ, malformed queries generate compile errors.)

Update: See Scott Hanselman’s discussion of LINQ to Regex.

{ 1 comment }

Readable path listings

by John on May 1, 2008

Windows has never made it easy to read long environment variables. If I display the path on one machine I get something like this, both from cmd and from PowerShell.

C:\bin;C:\bin\Python25;C:\bin\TeX\miktex\bin;C:\bin\TeX\MiKTeX\miktex\bin;C:\bin\Perl\bin\;C:\ProgramFiles\Compaq\Compaq Management Agents\Dmi\Win32\Bin; ...

The System Properties window is worse since you can only see a tiny slice of your path at a time.

screen shot of path UI

Here’s a PowerShell one-liner to produce readable path listing:

$env:path -replace ";", "`n"

This produces

C:\bin
C:\bin\Python25\
C:\bin\TeX\miktex\bin
C:\bin\TeX\MiKTeX\miktex\bin
C:\bin\Perl\bin\
C:\Program Files\Compaq\Compaq Management Agents\Dmi\Win32\Bin
...

(If you’re not familiar with PowerShell, note the backquote before the n to indicate the newline character to replace semicolons. This is one of the most unconventional features of PowerShell since backslash is the escape character in most contexts. Because Windows uses either forward or backward slashes as path separators, PowerShell could not use backslash as an escape character. Think of the backquote as a little backslash. Once you get over the initial shock, you get used to the backquote quickly.)

Update: It occurred to me after the original post that there’s an even simpler way to display the path.

$env:path.split(';')

{ 1 comment }

Tips for learning regular expressions

by John on January 14, 2008

Here are a few realizations that helped me the most when I was learning regular expressions.

1. Regular expressions aren’t trivial. If you think they’re trivial, but you can’t get them to work, then you feel stupid. They’re not trivial, but they’re not that hard either. They just take some study.

2. Regular expressions are not command line wild cards. They contain some of the same symbols but they don’t mean the same thing. They’re just similar enough to cause confusion.

3. Regular expressions are a little programming language.Regular expressions are usually contained inside another programming language, like JavaScript or PowerShell. Think of the expressions as little bits of a foreign language, like a French quotation inside English prose. Don’t expect rules from the outside language to have any relation to the rules inside, no more than you’d expect English grammar to apply inside that French quote.

4. Character classes are a little sub-language within regular expressions. Character classes are their own little world. Once you realize that and don’t expect the usual rules for regular expressions outside character classes to apply, you can see that they’re not very complicated, just different. Failure to realize that they are different is a major source of bugs.

Once you’re ready to dive into regular expressions, read Jeffrey Friedl’s book. It’s by far the best book on the subject. Read the first few chapters carefully, but then flip the pages quickly when he goes off into NFA engines and all that.

{ 1 comment }