Regex Backreferences

Summary: in this tutorial, you’ll learn how to use the regex backreferences and their practical applications.

Introduction to Regex Backreferences

When constructing a regular expression, you can put \n within the pattern. In this case, the \n is a backreference of a capturing group number n.

Regex Backreference examples

Let’s take some examples of using the regex backreferences.

1) Using regex backreferences to remove doubled word in the text

Suppose you have a text that has some doubled words. For example:

$str = "It's the the PHP 8.0";Code language: PHP (php)

In this example, the word 'the' is doubled in the text.

To detect the doubled word, you can use the following regular expression:

'/\b(\w+)\s+\1\b/'Code language: PHP (php)

In this regular expression:

  • \b matches the word boundary.
  • (\w+) matches one or more word characters. It’s also a capturing group with the number 1.
  • \s+ matches one or more whitespaces, including spaces.
  • \1 is a backreference that references the capturing group 1.

Here’s the complete code:

<?php

$str = "It's the the PHP 8.0";
$pattern = '/\b(\w+)\s+\1\b/';

if (preg_match($pattern, $str, $matches)) {
    print_r($matches);
}Code language: PHP (php)

Output:

Array
(
    [0] => the the
    [1] => the
)Code language: PHP (php)

2) Using regex backreferences to match the text in single & double quotes

Suppose you need to get the text inside double quotes ("), for example:

"text here"Code language: PHP (php)

or single quotes:

'text here'Code language: PHP (php)

But not mixed between double and single quotes like this:

'will not match."Code language: PHP (php)

To do that, you can use the backreferences as shown in the following regular expression:

'/([\'"])(.*?)\1/'Code language: PHP (php)

In this regular expression:

  • The [\'"] matches any text that starts with a single or double quote. Since we use a single-quoted string, we need to escape it using the backslash character (\).
  • The ([\'"]) creates the first capturing group with group number 1.
  • The (.*?) creates the second capturing group that has non-greedy, which matches as few characters (except the newline) as possible.
  • The \1 is a backreference that references the first capturing group.

Here’s the complete code:

<?php

$messages = [
    'They said: "PHP is awesome"',
    "They said: 'PHP is awesome'",
    'They said: "PHP\'s awesome"'
];

$pattern = '/([\'"])(.*?)\1/';

foreach ($messages as $message) {
    if (preg_match($pattern, $message, $matches)) {
        echo $matches[0] . PHP_EOL;
    }
}Code language: PHP (php)

Output:

"PHP is awesome"
'PHP is awesome'
"PHP's awesome"Code language: PHP (php)

Summary

  • The \n in the pattern is a backreference that references the capturing group n, where n is an integer greater than zero.
Did you find this tutorial useful?