Using RegEx Subexpressions in MySQL

As I have discussed, regular expressions can be a powerful tool for developers. When we use regular expressions to match a pattern and need to replace part of that pattern, subexpressions make it much easier. Subexpressions are specific parts of a pattern that we can reference elsewhere in the process.
The Requirement
The requirement for this exercise is simple. If the word ‘ipsum’ exists in our string, and there is a word before the instance of ‘ipsum’, replace that word with four asterisks. For example, if we had a string that contained the text ...Scott ipsum...
, or solution must return ...**** ipsum...
.
The Setup
This exercise will use the following query as our starting point.
I chose those specific rows because ipsum
is the second word in each of them. The result of this query resembles the following:
The Solution
As I said above, a subexpression is an annotated portion of our pattern in regex. In our regular expression, we separate subexpressions by wrapping them in parentheses ( )
. Each subexpression can then be referenced using a dollar sign $
followed by the subexpression number you want (numbered from left to right).
Let’s take a look at a query that satisfies our requirements.
The second argument of regexp_replace()
shows that two regex parts are wrapped in parentheses. This annotation means that we will have two subexpressions. Let’s break these down individually. We will start with (\\b.*\\b)
.
(
: The left parenthesis indicates the start of our subexpression. It is NOT used as part of the pattern match.\\b
: This part of the regex says we want to match a word boundary..+
: Next, we specify we want to match one or more characters.- The
.
indicates any character. - The
+
indicates one or more.
- The
\\b
: This part of the regex says we want to match a word boundary.)
: The right parenthesis indicates the end of our subexpression. It is NOT used as part of the pattern match.
The second subexpression, (ipsum)
, is easier to read.
(
: The left parenthesis indicates the start of our subexpression. It is NOT used as part of the pattern match.ipsum
: Indicates we want to match the literal string ‘ipsum’.)
: The right parenthesis indicates the end of our subexpression. It is NOT used as part of the pattern match.
When we combine these two, our pattern translates to a word boundary followed by one or more characters, followed by another word boundary, followed by the text ‘ipsum’.
We reference the second subexpression using $2
in the third argument. Now, regexp_repalce()
will replace our matching pattern with the literal string ****
and the value of our second subexpression.
Here are the results of this query:
The word before ipsum
in each row has been replaced with ****
.
Having Some Fun
The solution above satisfies our requirements, but let’s see how else we can use these subexpressions.
If we wanted to replace ipsum
with ****
, we could use this query:
The only change we made was to the text we want to use as a replacement. We are using the value of the first subexpression followed by ****
. The results of this query look like:
This example shows how to use subexpressions, but we would not need regex to replace the text ipsum
with ****
. But what if we wanted to swap the words that match our pattern?
We can accomplish that simply by updating our replacement text to $2 $1
like in the query below:
The result of this query would be:
Now, each snippet starts with ipsum
and is followed by the word that preceded it in the original text.
If this were a real-life scenario, I would probably add logic to ensure the capitalization made sense, but I think this is good enough for this example.
Wrap Up
Subexpressions in regex can help isolate parts of a pattern so we can reuse them as part of a replacement process. In this post, I showed how we can use each of two subexpressions in different configurations. For more information on regular expressions in MySQL, check out the documentation.
Photo by Ferdinand Stöhr on Unsplash