Python triple quoted strings & multiline regular expressions

There are several ways to quote strings in Python. Triple quotes let strings span multiple lines. Line breaks in your source file become line break characters in your string. A triple-quoted string in Python acts something like “here doc” in other languages.

However, Python’s indentation rules complicate matters because the indentation becomes part of the quoted string. For example, suppose you have the following code outside of a function.

x = """\
abc
def
ghi
"""

Then you move this into a function foo and change its name to y.

def foo():
    y = """\
    abc
    def
    ghi
    """

Now x and y are different strings! The former begins with a and the latter begins with four spaces. (The backslash after the opening triple quote prevents the following newline from being part of the quoted string. Otherwise x and y would begin with a newline.) The string y also has four spaces in front of def and four spaces in front of ghi. You can’t push the string contents to the left margin because that would violate Python’s formatting rules. (Update: Oh yes you can! See Aaron Meurer’s comment below.)

We now give three solutions to this problem.

Solution 1: textwrap.dedent

There is a function in the Python standard library that will strip the unwanted space out of the string y.

import textwrap 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = textwrap.dedent(y)

This works, but in my opinion a better approach is to use regular expressions [1].

Solution 2: Regular expression with a flag

We want to remove white space, and the regular expression for a white space character is \s. We want to remove one or more white spaces so we add a + on the end. But in general we don’t want to remove all white space, just white space at the beginning of a line, so we stick ^ on the front to say we want to match white space at the beginning of a line.

import re 

def foo():
    y = """\
    abc
    def
    ghi
    """
    y = re.sub("^\s+", "", y)

Unfortunately this doesn’t work. By default ^ only matches the beginning of a string, not the beginning of a line. So it will only remove the white space in front of the first line; there will still be white space in front of the following lines.

One solution is to add the flag re.MULTILINE to the substitution function. This will signal that we want ^ to match the beginning of every line in our multi-line string.

    y = re.sub("^\s+", "", y, re.MULTILINE)

Unfortunately that doesn’t quite work either! The fourth positional argument to re.sub is a count of how many substitutions to make. It defaults to 0, which actually means infinity, i.e. replace all occurrences. You could set count to 1 to replace only the first occurrence, for example. If we’re not going to specify count we have to set flags by name rather than by position, i.e. the line above should be

    y = re.sub("^\s+", "", y, flags=re.MULTILINE)

That works.

You could also abbreviate re.MULTILINE to re.M. The former is more explicit and the latter is more compact. To each his own. There’s more than one way to do it. [2]

Solution 3: Regular expression with a modifier

In my opinion, it is better to modify the regular expression itself than to pass in a flag. The modifier (?m) specifies that in the rest of the regular the ^ character should match the beginning of each line.

    y = re.sub("(?m)^\s+", "", y)

One reason I believe this is better is that moves information from a language-specific implementation of regular expressions into a regular expression syntax that is supported in many programming languages.

For example, the regular expression

    (?m)^\s+

would have the same meaning in Perl and Python. The two languages have the same way of expressing modifiers [3], but different ways of expressing flags. In Perl you paste an m on the end of a match operator to accomplish what Python does with setting flasgs=re.MULTILINE.

One of the most commonly used modifiers is (?i) to indicate that a regular expression should match in a case-insensitive manner. Perl and Python (and other languages) accept (?i) in a regular expression, but each language has its own way of adding modifiers. Perl adds an i after the match operator, and Python uses

    flags=re.IGNORECASE

    flags=re.I

as a function argument.

More on regular expressions

[1] Yes, I’ve heard the quip about two problems. It’s funny, but it’s not a universal law.

[2] “There’s more than one way to do it” is a mantra of Perl and contradicts The Zen of Python. I use the line here as a good-natured jab at Python. Despite its stated ideals, Python has more in common with Perl than it would like to admit and continues to adopt ideas from Perl.

[3] Python’s re module doesn’t support every regular expression modifier that Perl supports. I don’t know about Python’s regex module.

8 thoughts on “Python triple quote strings and regular expressions”

Jonathan

30 January 2021 at 10:52

Heh. I wondered if it was deliberate when you invoked TIMTOWTDI. I think of Python and Perl as two brothers who disagree more intensely *because* they are so similar.
BobC

30 January 2021 at 15:35

I use similar techniques to convert canonical JSON to single-line form. Didn’t know about (?m). Thanks!
Frank Patz-Brockmann

30 January 2021 at 16:58

ICYMI: indentation is not required for multiline strings, so you can have exactly the whitespace you want. Emacs’s Python mode e.g. handles this correctly. Unindented multiline string literals in otherwise indented code are not exactly pretty, however. People tend to put them at the unindented top-level for that reason, or, alternatively, use multiple string fragments and explicit newlines (\n) with line continuations or in parentheses.
Waldir Pimenta

31 January 2021 at 03:43

The regex solution has the downside that any significant whitespace at the beginning of a line (e.g. indentation) will also be removed. Dedent only removes leading whitespace matching that of the line where the triple-quoted string begins.
John

31 January 2021 at 05:28

@Waldir: Thanks. I didn’t think about that. I’ve only used triple quoted strings that are flush left, but I could see how you might quote code or prose that has its own levels of indentation.

You could change the regex from \s+ to \s{4} to remove four spaces, but of course then the number would have to change if the level of indentation changes. But dedent keeps track of that for you.
Pablo Marin-Garcia

12 April 2021 at 18:15

I would prefer inspect.cleandoc() over textwrap.dedent() as the former deals correctly with the first line. Also if you have text with different indentation it only removes the beseline for all lines, not all the leading space of each line.

—-
def cleandoc(doc):
“””Clean up indentation from docstrings.

Any whitespace that can be uniformly removed from the second line
onwards is removed.”””
try:
lines = doc.expandtabs().split(‘\n’)
except UnicodeError:
return None
else:
# Find minimum indentation of any non-blank lines after first line.
margin = sys.maxsize
for line in lines[1:]:
content = len(line.lstrip())
if content:
indent = len(line) – content
margin = min(margin, indent)
# Remove indentation.
if lines:
lines[0] = lines[0].lstrip()
if margin < sys.maxsize:
for i in range(1, len(lines)): lines[i] = lines[i][margin:]
# Remove any trailing or leading blank lines.
while lines and not lines[-1]:
lines.pop()
while lines and not lines[0]:
lines.pop(0)
return '\n'.join(lines)

—
Aaron Meurer

4 August 2022 at 17:40

> You can’t push the string contents to the left margin because that would violate Python’s formatting rules.

Yes you can. I do this all the time.
John

5 August 2022 at 09:55

Thanks! I didn’t realize that. Was it not allowed at some point in the past? Maybe I just assumed it wouldn’t work. Will update the post.

Comments are closed.