Introduction
I’d been wanting to post about the joys of SQL Server Transact SQL’s (new to SQL Server 2005) ranking functions since learning of them. I wrote this article originally as an explanation to my fellow development team members about what I had done and why I did it.
It then morphed into a blog/article posting that never found a home. It was lost forever until this morning when I resurrected a copy of it from a hard drive that had been sitting in a long dead notebook.
Overview
As a developer, there is so much to love about SQL Server 2005.
In my current role, our primary responsiblity, an Intranet, is largely AJAX based, so we’ve taken great pleasure in converting such things as the formerly clunky FOR XML EXPLICIT syntax to the far more elegant FOR XML PATH syntax; allowing us to generate XML documents in a largely “native” manner.
That specific need aside, the new feature that has afforded us the greatest improvements in code clarity and performance are the SQL Server ranking functions.
This article will describe a recently encountered problem and how it was solved using those ranking functions to improve:
- the performance of the code
- the readability of the code and;
- maintainability of the code
Boxing Algorithms for Mailouts
In about three weeks time, I’ll be starting a new job; as such, I am, in my current role, starting to wind down. This means that the majority of my time has been allocated to finishing up current projects, documenting what I know and transferring knowledge to other team members.
In between times, I’ve been devoting my “spare” time to those areas of the system that I’ve “always wanted to improve, but never had the time”.
A large sub-system of our Intranet is devoted to the selection of “contact” data and the subsequent mailout of invitations to those contacts. A couple of years ago, the company expanded it’s operations into New Zealand; this brought with it, some unique issues, not least of which were some very specific requirements from NZ Post, with respect to the “boxing” of bulk mail in order to receive bulk discounts.
In New Zealand, postcodes are allocated in groups to “Lodgement Centres”. The mail for each lodgement centre must be “boxed” into groups of 300, sorted by descending postcode.
Ever since we upgraded our system to SQL Server 2005 from it’s former SQL Server 2000, and I was introduced to T-SQL’s new ranking functions, I could see that this problem was crying out to be solved using them; but I never had the time to re-write existing code. Yesterday, I finally got around to doing so.
NB: You’ll see the use of the new XML datatype in both the old and new version of the code; this is due to that portion of the code already having been converted post SQL Server 2005 upgrade.
The Old Solution (Cursors – ewww)
The former solution uses a cursor to iterate through the contact/invitation records, such that we can deal with them per lodgement centre:
-- @xmlSeminars is actually passed into a stored procedure, but is explicitly declared for -- clarity in this example
DECLARE @xmlSeminars XML;
SELECT @xmlSeminars = '<seminars><seminar id="123" /><seminar id="456" /></seminars>';
CREATE TABLE #invite
(
intSeminarContactID INT,
intLodgementCentreID INT,
strPostCode VARCHAR(6)
);
CREATE TABLE #lodgement_centre
(
intOrder INT IDENTITY(0, 1),
intSeminarContactID INT,
strPostCode VARCHAR(6)
);
CREATE TABLE #invite_lodgement_centre
(
intOrder INT,
intSeminarContactID INT,
intLodgementCentreID INT,
strPostCode VARCHAR(6),
intBox INT
);
INSERT
#invite
SELECT
sc.intSeminarContactID,
lcb.intLodgementCentreID,
a.strPostCode
FROM
tblSeminarContact sc
JOIN
tblContact c ON c.intContactID = sc.intContactID
JOIN
tblAddress a ON a.intAddressID = c.intAddressID
JOIN
tblLodgementCentreBand lcb ON a.strPostCode BETWEEN lcb.strPostCode_From AND lcb.strPostCode_To
JOIN
@xmlSeminars.NODES('seminars/seminar') AS seminars(seminar) ON seminars.seminar.VALUE('@id', 'INT') = sc.intSeminarID;
DECLARE @intLodgementCentreID INT;
DECLARE curLodgementCentre CURSOR FOR
SELECT DISTINCT
#invite.intLodgementCentreID
FROM
#invite;
OPEN curLodgementCentre;
FETCH NEXT FROM curLodgementCentre
INTO @intLodgementCentreID;
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT
#lodgement_centre
SELECT
#invite.intSeminarContactID,
#invite.strPostCode
FROM
#invite
WHERE
#invite.intLodgementCentreID = @intLodgementCentreID
ORDER BY
#invite.strPostCode;
INSERT
#invite_lodgement_centre
SELECT
#invite.intOrder,
#inviteintSeminarContactID,
@intLodgementCentreID,
#invite.strPostCode,
#invite.intOrder / 300
FROM
#invite;
FETCH NEXT FROM curLodgementCentre
INTO @intLodgementCentreID;
END
CLOSE curLodgementCentre;
DEALLOCATE curLodgementCentre;
This causes the following problems:
- it hampers performance, due to:
- using programmatic iterative operations instead of set based operations, which is what SQL Server is good at
- the overhead of cursors
- it obfuscates the intention of the boxing algorithm
The New Solution (Ranking Functions – ROW_NUMBER and DENSE_RANK)
The new code solves all of these problems in two simply elegant, set based operations.
1. Ordering and Partitioning
-- @xmlSeminars is actually passed into a stored procedure, but is explicitly declared for clarity in this example
DECLARE @xmlSeminars XML;
SELECT @xmlSeminars = '<seminars><seminar id="123" /><seminar id="456" /></seminars>';
CREATE TABLE #invite
(
intSeminarContactID INT,
intLodgementCentreID INT,
strPostCode VARCHAR(6),
intOrder INT
);
CREATE TABLE #invite_lodgement_centre
(
intSeminarContactID INT,
intLodgementCentreID INT,
strPostCode VARCHAR(6),
intOrder INT,
intBox INT
);
INSERT
#invite
SELECT
sc.intSeminarContactID,
lcb.intLodgementCentreID,
a.strPostCode,
(ROW_NUMBER() OVER (PARTITION BY lc.intLodgementCentreID ORDER BY a.strPostCode DESCENDING)) - 1 -- (1)
FROM
tblSeminarContact sc
JOIN
tblContact c ON c.intContactID = sc.intContactID
JOIN
tblAddress a ON a.intAddressID = c.intAddressID
JOIN
tblLodgementCentreBand lcb ON a.strPostCode BETWEEN lcb.strPostCode_From AND lcb.strPostCode_To
JOIN
@xmlSeminars.NODES('seminars/seminar') AS seminars(seminar) ON seminars.seminar.VALUE('@id', 'INT') = sc.intSeminarID;
(1)
This initial INSERT statement makes use of the new ROW_NUMBER()
ranking function that allows us to generate an ordinal number (ORDER BY
) for each record in the set, optionally partitioning those ordinal numbers into groups (PARTITION BY
); the results generated look something like the following:
intSeminarContactID |
intLodgementCentreID |
strPostCode |
intOrder |
123 |
1 |
9999 |
0 |
234 |
1 |
9999 |
1 |
345 |
1 |
9998 |
2 |
…elided… |
456 |
1 |
9996 |
299 |
567 |
1 |
9996 |
300 |
678 |
1 |
9996 |
301 |
789 |
1 |
9995 |
302 |
4234 |
2 |
6666 |
0 |
1234 |
2 |
6666 |
1 |
1235 |
2 |
6666 |
2 |
1236 |
2 |
6665 |
3 |
1237 |
2 |
6665 |
4 |
1238 |
2 |
6665 |
5 |
1239 |
2 |
6664 |
6 |
1211 |
2 |
6664 |
7 |
…elided… |
NB: I decrement the ordinal number returned by ROW_NUMBER()
by one in (1) above, to allow a simple integer division calculation in the following statement:
2. Boxing
INSERT
#invite_lodgement_centre
SELECT
#invite.intSeminarContactID,
#invite.intLodgementCentreID,
#invite.strPostCode,
#invite.intOrder,
#invite.intOrder / 300 + DENSE_RANK() OVER (ORDER BY #invite.intLodgementCentreID); -- (2)
(2)
The first part of statement (2) divides #invite.intOrder / 300
our invitations into boxes of 300.
The second part of the statement “corrects” the boxing such that each lodgement centre has a distinct set of boxes; this works because the DENSE_RANK()
“ranks” the dataset according to the content of the ORDER BY CLAUSE
. We use DENSE_RANK()
instead of RANK()
so that the intBox
box identifiers are kept contiguous.
So now, our new dataset has each invitation record, allocated to a “box” of 300 or fewer, where each box contains records from only a single lodgement centre.
intSeminarContactID |
intLodgementCentreID |
strPostCode |
intOrder |
intBox |
123 |
1 |
9999 |
0 |
1 |
234 |
1 |
9999 |
1 |
1 |
345 |
1 |
9998 |
2 |
1 |
…elided… |
456 |
1 |
9996 |
299 |
1 |
567 |
1 |
9996 |
300 |
2 |
678 |
1 |
9996 |
301 |
2 |
789 |
1 |
9995 |
302 |
2 |
4234 |
2 |
6666 |
0 |
3 |
1234 |
2 |
6666 |
1 |
3 |
1235 |
2 |
6666 |
2 |
3 |
1236 |
2 |
6665 |
3 |
3 |
1237 |
2 |
6665 |
4 |
3 |
1238 |
2 |
6665 |
5 |
3 |
1239 |
2 |
6664 |
6 |
3 |
1211 |
2 |
6664 |
7 |
3 |
…elided… |
Conclusion
As you can see, the ranking functions clean up the code quite considerably; in addition to being far simpler, the code is now far easier to maintain, with the original intention of the code no longer obfuscated behind a wall of crazy cursor logic.
A DBMS will always be far happier working with set based operations; the introduction of ranking functions in SQL Server 2005 is one more way in which we can keep SQL Server, and those of us who work with it everyday, happy.