Db2 for IBM i: Confounding and Constricting Concatenations

This month’s title is a bit of a tongue twister, but the
title does capture a problematic SQL coding practice with concatenations that
has been popping up in recent SQL Performance Assessments performed by our IBM
Technology Services team. Speaking of tongue twisters, did you know that MIT
researchers developed the toughest tongue twister in the world of “Pad kid poured curd pulled pod”? I have no
idea what this phrase means, but I do know about the confusion and poor
performance that can be caused by suboptimal usage of the SQL concatenation
support.

Let’s start by looking at the concatenation coding practices
that we’ve been finding when analyzing customer’s SQL requests. The following
two SELECT statements contain the subpar coding pattern of comparing the concatenation
of two column values with the concatenated result of two other columns. This
concatenated comparison shows up on the join condition of the first query and
the WHERE clause of the second query.

SELECT * FROM tab1 a INNER JOIN tab2 b
ON a.c1 || a.c2 = b.c1 || b.c2

SELECT * FROM tab1 a
WHERE CONCAT(a.c1, a.c2) IN
(SELECT CONCAT(b.c1, b.c2) FROM tab2 b)

The
alternative to the concatenated column comparisons is comparing the individual
columns and then logically ANDing the results of those individual comparisons
(e.g., a.c1 = b.c1 AND a.c2 = b.c2). When comparing the two approaches, I
believe that the individual column comparisons are easier to read and
understand. While this opinion can be debated as programmer preference, there’s
no debating that the concatenated column comparison will result in slower query
performance.

There
are a couple of reasons that this concatenation coding practice is a poor
performer. First of all, comparisons that include an expression or derivation
severely limit the query optimizer’s ability to estimate how many rows in the table
will be selected or processed. When this number of rows estimate generated by
the query optimizer is inaccurate, there is a greater chance for poor query
performance. Second, the concatenated columns comparison prevents the use of
normal indexes to speed up the query execution. Indexes are often the fastest
way to perform the specified column comparison, but normal indexes are not an
option due to the concatenated expressions in these two examples. Yes, a
derived key index could be created that includes the concatenation in the
column example. However, there’s a good chance that you already have a normal
index created over the columns being concatenated.

Now that
you understand the drawbacks of the concatenated column comparisons, let’s look
at the optimal way of coding these two example queries. As you can see, the
join query can be rewritten in one of two ways to improve performance and
readability.

SELECT *
FROM tab1 a INNER JOIN tab2 b
ON a.c1 = b.c1 AND a.c2 = b.c2

SELECT *
FROM tab1 a INNER JOIN tab2 b
ON (a.c1, a.c2) = (b.c1, b.c2)

The
second join query utilizes a row value expression to simplify the join condition by removing the need to include
the logical AND operator. These two queries will perform exactly the same, so
this is strictly a programming style preference when deciding which syntax to
use.

The
rewrite of our second example SELECT statement also uses the row value
expression syntax to eliminate the concatenated column comparison.

SELECT * FROM tab1 a
WHERE (a.c1, a.c2) IN
(SELECT b.c1, b.c2 FROM tab2 b)

Like many programming languages, SQL offers more than one
way to get your work done. While this flexibility can be a good thing, you have
also now learned that not all SQL solutions are created equal in terms of
performance.

This is my last entry for 2022 – have a Merry Christmas
& wonderful holiday season and we’ll talk in 2023!