tcgrep: grep rewritten in Perl

In The Perl Cookbook, Tom Christiansen gives his rewrite of the Unix utility grep that he calls tcgrep. You don’t have to know Perl to use tcgrep, but you can send it Perl regular expressions.

Why not grep with PCRE?

You can get basically the same functionality as tcgrep by using grep with its PCRE option -P. Since tcgrep searches directories recursively, a more direct comparison would be

    grep -R -P

However, your version of grep might not support -P. And if it does, its Perl-compatible regular expressions might not be completely Perl-compatible. The man page for grep on my machine says

    -P, --perl-regexp
        Interpret the pattern as a Perl-compatible regular 
        expression (PCRE). This is experimental and grep -P 
        may warn of unimplemented features.

The one implementation of regular expressions guaranteed to be fully Perl-compatible is Perl.

If the version of grep on your system supports the -P option and is adequately Perl-compatible, it will run faster than tcgrep. But if you find yourself on a computer that has Perl but not a recent version of grep, you may find tcgrep handy.

Installation

tcgrep is included as part of the Unicode::Tussle Perl module; since tcgrep is a wrapper around Perl, it is as Unicode-compliant as Perl is. So you could install tcgrep (and several more utilities) with

    cpan Unicode::Tussle

This worked for me on Linux without any issues but the install failed on Windows.

I installed tcgrep on Windows by simply copying the source code. (I don’t recall now where I found the source code. I didn’t see it this morning when I searched for it, but I imagine I could have found it if I’d been more persistent.) I commented out the definition of %Compress to disable searching inside compressed files since this feature required Unix utilities not available on Windows.

Consistency

Another reason to use tcgrep is consistency. Perl is criticized for being inconsistent. The Camel book itself says

In general, Perl functions do exactly what you want—unless you want consistency.

But Perl’s inconsistencies are different, and in my opinion less annoying, than the inconsistencies of Unix tools.

Perl is inconsistent in the sense that functions behave differently in different contexts, such as a scalar context or a list context.

Unix utilities are inconsistent across platforms and across tools. For example, a tool like sed will have different features on different platforms, and it will not support the same regular expressions as another tool such as awk.

Perl was written to be a “portable distillation of Unix culture.” As inconsistent as Perl is, it’s more consistent that Unix.

Related posts

Finding pi in pi with Perl

Here’s a frivolous problem whose solution illustrates three features of Perl:

  1. Arbitrary precision floating point
  2. Lazy quantifiers in regular expressions
  3. Returning the positions of matched groups.

Our problem is to look for the digits 3, 1, 4, and 1 in the decimal part of π.

First, we get the first 100 digits of π after the decimal as a string. (It turns out 100 is enough, but if it weren’t we could try again with more digits.)

    use Math::BigFloat "bpi";

    $x = substr bpi(101)->bstr(), 2;

This loads Perl’s extended precision library Math::BigFloat, gets π to 101 significant figures, converts the result to a string, then lops off the first two characters “3.” at the beginning leaving “141592…”.

Next, we want to search our string for a 3, followed by some number of digits, followed by a 1, followed by some number of digits, followed by a 4, followed by some number of digits, and finally another 1.

A naive way to search the string would be to use the regex /3.*1.*4.*1/. But the star operator is greedy: it matches as much as possible. So the .* after the 3 would match as many characters as possible before backtracking to look for a 1. But we’d like to find the first 1 after a 3 etc.

The solution is simple: add a ? after each star to make the match lazy rather than greedy. So the regular expression we want is

   /3.*?1.*?4.*?1/

This will tell us whether our string contains the pattern we’re after, but we’d like to also know where the string contains the pattern. So we make each segment a captured group.

   /(3.*?)(1.*?)(4.*?)(1)/

Perl automatically populates an array @- with the positions of the matches, so it has the information we’re looking for. Element 0 of the array is the position of the entire match, so it is redundant with element 1. The advantage of this bit of redundancy is that the starting position of group $1 is in the element with index 1, the starting position of $2 is at index 2, etc.

We use the shift operator to remove the redundant first element of the array. Since shift modifies its argument, we can’t apply it directly to the constant array @-, so we apply it to a copy.

    if ($x =~ /(3.*?)(1.*?)(4.*?)(1)/) {
        @positions = @-;
        shift  @positions;
        print "@positions\n";
    }

This says that our pattern appears at positions 8, 36, 56, and 67. Note that these are array indices, and so they are zero-based. So if you count from 1, the first 3 appears in the 9th digit etc.

To verify that the digits at these indices are 3, 1, 4, and 1 respectively, we make the digits into an array, and slice the array by the positions found above.

    @digits = split(//, $x);
    print "@digits[@positions]\n";

This prints 3 1 4 1 as expected.

Expressiveness

Programmers like highly expressive programming languages, but programming managers do not. I wrote about this on Twitter a few months ago.

Q: Why do people like Lisp so much?

A: Because Lisp is so expressive.

Q: Why don’t teams use Lisp much?

A: Because Lisp is so expressive.

Q: Why do programmers complain about Java?

A: Because it’s not that expressive.

Q: Why do businesses use Java?

A: Because it’s not that expressive.

A highly expressive programming language offers lots of options. This can be a good thing. It makes programming more fun, and it can lead to better code. But it can also lead to more idiosyncratic code.

A large programming language like Perl allows developers to carve out language subsets that hardly overlap. A team member has to learn not only the parts of the language he understands and wants to use, but also all the parts that his colleagues might use. And those parts that he might accidentally use.

While Perl has maximal syntax, Lisp has minimal syntax. But Lisp is also very expressive, albeit in a different way. Lisp makes it very easy to extend the language via macros. While Perl is a big language, Lisp is an extensible language. This can also lead to each programmer practically having their own language.

With great expressiveness comes great responsibility. A team using a highly expressive language needs to develop conventions for how the language will be used in order to avoid fracturing into multiple de facto languages.

But what if you’re a team of one? Now you don’t need to be as concerned how other people use your language. You still may need to care somewhat. You want to be able to grab sample code online, and you may want to share code or ask others for help. It pays not to be entirely idiosyncratic, though you’re free to wander further from the mainstream.

Even when you’re working in a team, you still may have code that only you use. If your team is producing C# code, and you secretively use a Perl script to help you find things in the code, no one needs to know. On the other hand, there’s a tendency for personal code to become production code, and so personal tools in a team environment are tricky.

But if you’re truly working by yourself, you have great freedom in your choice of tools. This can take a long time to sort out when you leave a team environment to strike out on your own. You may labor under your previous restrictions for a while before realizing they’re no longer necessary. At the same time, you may choose to stick to your old tools, not because they’re optimal for your new situation, but because it’s not worth the effort to retool.

Related posts

(Regarding the last link, think myth as in Joseph Campbell, not myth as in Myth Busters.)

From shell to system

Routine computer tasks and system programming require different tools, though I’m not entirely sure why.

Many people have thought about how inconsistent shells and system programming languages are and tried to unite them. Wouldn’t it be nice to use one language for everything? But attempts to bring system languages down to the shell, or to push shell programming up to large programs, have not been very successful.

I learned Perl in college so I wouldn’t have to learn shell programming. That’s what Perl was initially designed to be: an alternative to shell scripting. Larry Wall called Perl a “distillation of Unix culture.”

Perl is the most disliked programming language according to Stack Overflow. And yet I imagine many who complain about Perl gladly use the menagerie of quirky tools that Perl was created to unify. Bash is popular while Perl is unpopular, and yet the quirkiest parts of Perl are precisely those it shares with bash.

I expect much of the frustration with Perl comes from using it as a language for writing larger programs. Perl is very terse and expressive. These features are assets for one-liners and individual use. They are liabilities for large programs and team development.

Compared to a system programming language like Java, Perl is complex, inconsistent, and unsafe. But compared to shell scripting, Perl is simple, consistent, and safe!

Related posts

The most disliked programming language

According to this post from Stack Overflow, Perl is the most disliked programming language.

I have fond memories of writing Perl, though it’s been a long time since I used it. I mostly wrote scripts for file munging, the task it does best, and never had to maintain someone else’s Perl code. Under different circumstances I probably would have had less favorable memories.

Perl is a very large, expressive language. That’s a positive if you’re working alone but a negative if working with others. Individuals can carve out their favorite subsets of Perl and ignore the rest, but two people may carve out different subsets. You may personally avoid some feature, but you have to learn it anyway if your colleague uses it. Also, in a large language there’s greater chance that you’ll accidentally use a feature you didn’t intend to. For example, in Perl you might use an array in a scalar context. This works, but not as you’d expect if you didn’t intend to do it.

I suspect that people who like large languages like C++ and Common Lisp are more inclined to like Perl, while people who prefer small languages like C and Scheme have opposite inclinations.

More Perl posts

Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

   
# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
     
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Perl regex twitter account

I’ve started a new Twitter account @PerlRegex for Perl regular expressions. My original account, @RegexTip, is for regular expressions in general and doesn’t go into much detail regarding any particular implementation. @PerlRegex goes into the specifics of regular expressions in Perl.

Why specifically Perl regular expressions? Because Perl has the most powerful support for regular expressions (strictly speaking, “pattern matching.”) Other languages offer “Perl compatible” regular expressions, though the degree of compatibility varies and is always less than complete.

I imagine more people have ruled England than have mastered the whole of the Perl language. But it’s possible to use Perl for regular expression processing without learning too much of the wider language.

PerlRegex icon

Update: I’ve stopped posting to this account. Here’s a list of my current accounts.

Perl as a better …

Today I ran across Minimal Perl: For UNIX and Linux People. The book was published a few years ago but I hadn’t heard of it because I haven’t kept up with the Perl world. The following chapters from the table of contents jumped out at me because I’ve been doing a fair amount of awk and sed lately.:


3. Perl as a (better) grep command
4. Perl as a (better) sed command
5. Perl as a (better) awk command
6. Perl as a (better) find command

These chapters can be read a couple ways. The most obvious reading would be “Learn a few features of Perl and use it as a replacement for a handful of separate tools.”

But if you find these tools familiar and are not looking to replace them, you could read the book as saying “Here’s an introduction to Perl that teaches you the language by comparing it to things you already know well.”

The book suggests learning one tool instead of several, and in the bargain getting more powerful features, such as more expressive pattern matching. It also suggests not necessarily committing to learn the entire enormous Perl language, and not necessarily committing to use Perl for every programming task.

Regarding Perl’s pattern matching, I could relate to the following quip from the book.

What’s the only thing worse than not having a particular metacharacter … in a pattern-matching utility? Thinking you do, when you don’t! Unfortunately, that’s a common problem when using Unix utilities for pattern matching.

That was my experience just yesterday. I wrote a regular expression containing \d for a digit and couldn’t understand why it wasn’t matching.

Most of the examples rely on giving Perl command line options such as -e so that it acts more like command line utility. The book gives numerous examples carrying out common tasks in grep etc. and with Perl one-liners. The latter tend to be a little more verbose. If a task falls in the sweet spot of a common tool, that tool’s syntax will be more succinct. But when a task falls outside that sweet spot, such as matching a pattern that cannot be easily expressed with traditional regular expressions, the Perl solution will be shorter.

More specifics

This is an update, written March 3, 2021.

If you’re going to use Perl as a replacement for command line tools, you’ll need to know about one-liners and quoting.

Here is a post that covers Perl as a better grep.

If your main use for sed is to run commands like s/foo/bar/g, you can do this in Perl with

    perl -ple 's/foo/bar/g'

I talk more about using Perl to replace sed here.

If you want to use Perl as a replacement for awk, the main thing you need to know about is the -a option. This populates an array @F which corresponds to $1, $2, $3, etc. in awk. Note however that Perl arrays are indexed from 0, so $F[0] corresponds to $1 etc. A few more correspondences between the languages are given in the table below.

    | awk | perl  |
    |-----+-------|
    | $0  | $_    |
    | $2  | $F[1] |
    | RS  | $/    |
    | ORS | $\    |
    | OFS | $,    |

Perl can have BEGIN and END blocks just like awk.

You can set the field separator in Perl with -F, such as -F: to make the field separator a colon. In newer versions of Perl 5 you don’t have to specify -a if you specify -F; it figures that if you’re setting the field separator, you must want an array of fields to play with.

Extreme syntax

In his book Let Over Lambda, Doug Hoyte says

Lisp is the result of taking syntax away, Perl is the result of taking syntax all the way.

Lisp practically has no syntax. It simply has parenthesized expressions. This makes it very easy to start using the language. And above all, it makes it easy to treat code as data. Lisp macros are very powerful, and these macros are made possible by the fact that the language is simple to parse.

Perl has complex syntax. Some people say it looks like line noise because its liberal use of non-alphanumeric characters as operators. Perl is not easy to parse — there’s a saying that only Perl can parse Perl — nor is it easy to start using. But the language was designed for regular users, not beginners, because you spend more time using a language than learning it.

There are reasons I no longer use Perl, but I don’t object to the rich syntax. Saying Perl is hard to use because of its symbols is like saying Greek is hard to learn because it has a different alphabet. It takes years to master Greek, but you can learn the alphabet in a day. The alphabet is not the hard part.

Symbols can make text more expressive. If you’ve ever tried to read mathematics from the 18th or 19th century, you’ll see what I mean. Before the 20th century, math publications were very verbose. It might take a paragraph to say what would now be said in a single equation. In part this is because notation has developed and standardized over time. Also, it is now much easier to typeset the symbols someone would use in handwriting. Perl’s repertoire of symbols is parsimonious compared to mathematics.

I imagine that programming languages will gradually expand their range of symbols.

People joke about how unreadable Perl code is, but I think a page of well-written Perl is easier to read than a page of well-written Lisp.  At least the Perl is easier to scan: Lisp’s typographical monotony makes it hard to skim for landmarks. One might argue that a page of Lisp can accomplish more than a page of Perl, and that may be true, but that’s another topic.

* * *

Any discussion of symbols and programming languages must mention APL. This language introduced a large number of new symbols and never gained wide acceptance. I don’t know that much about APL, but I’ll give my impression of why I don’t think APL’s failure is not proof that programmers won’t use more symbols.

APL required a special keyboard to input. That would no longer be necessary. APL also introduced a new programming model; the language would have been hard to adopt even without the special symbols. Finally, APL’s symbols were completely unfamiliar and introduced all at once, unlike math notation that developed world-wide over centuries.

* * *

What if programming notation were more like music notation? Music notation is predominately non-verbal, but people learn to read it fluently with a little training. And it expresses concurrency very easily. Or maybe programs could look more like choral music, a mixture of symbols and prose.

Learn one Perl command

A while back I wrote a post Learn one sed command. In a nutshell, I said it’s worth learning sed just do commands of the form sed s/foo/bar/ to replace “foo” with “bar.”

Dan Haskin and Will Fitzgerald suggested in their comments that instead of sed use perl -pe with the same command. The advantage is that you could use Perl’s more powerful regular expression syntax. Will said he uses Perl like this:

    cat file | perl -lpe "s/old/new/g" > newfile

I think they’re right. Except for the simplest regular expressions, sed’s regular expression syntax is too restrictive. For example, I recently needed to remove commas that immediately follow a digit and this did the trick:

    cat file | perl -lpe "s/(?<=d),//g" > newfile

Since sed does not have the look-behind feature or d for digits, the corresponding sed code would be more complicated.

I quit writing Perl years ago. I don’t miss Perl as a whole, but I do miss Perl’s regular expression support.

Learning Perl is a big commitment, but just learning Perl regular expressions is not. Perl is the leader in regular expression support, and many programming languages implement a subset of Perl’s regex features. You could just use a subset of Perl features you already know, but you’d have the option of using more features.

***

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon