Pogo69's Blog

July 12, 2012

SQL Server (2005+) Ranking Functions – A Better Mailout Boxing Algorithm

Filed under: Cutting Code, SQL Server, Transact SQL — pogo69 [Pat Janes] @ 15:41

Introduction

I’d been wanting to post about the joys of SQL Server Transact SQL’s (new to SQL Server 2005) ranking functions since learning of them.  I wrote this article originally as an explanation to my fellow development team members about what I had done and why I did it.

It then morphed into a blog/article posting that never found a home.  It was lost forever until this morning when I resurrected a copy of it from a hard drive that had been sitting in a long dead notebook.

Overview

As a developer, there is so much to love about SQL Server 2005.

In my current role, our primary responsiblity, an Intranet, is largely AJAX based, so we’ve taken great pleasure in converting such things as the formerly clunky FOR XML EXPLICIT syntax to the far more elegant FOR XML PATH syntax; allowing us to generate XML documents in a largely “native” manner.

That specific need aside, the new feature that has afforded us the greatest improvements in code clarity and performance are the SQL Server ranking functions.

This article will describe a recently encountered problem and how it was solved using those ranking functions to improve:

  • the performance of the code
  • the readability of the code and;
  • maintainability of the code

Boxing Algorithms for Mailouts

In about three weeks time, I’ll be starting a new job; as such, I am, in my current role, starting to wind down. This means that the majority of my time has been allocated to finishing up current projects, documenting what I know and transferring knowledge to other team members.

In between times, I’ve been devoting my “spare” time to those areas of the system that I’ve “always wanted to improve, but never had the time”.

A large sub-system of our Intranet is devoted to the selection of “contact” data and the subsequent mailout of invitations to those contacts. A couple of years ago, the company expanded it’s operations into New Zealand; this brought with it, some unique issues, not least of which were some very specific requirements from NZ Post, with respect to the “boxing” of bulk mail in order to receive bulk discounts.

In New Zealand, postcodes are allocated in groups to “Lodgement Centres”. The mail for each lodgement centre must be “boxed” into groups of 300, sorted by descending postcode.

Ever since we upgraded our system to SQL Server 2005 from it’s former SQL Server 2000, and I was introduced to T-SQL’s new ranking functions, I could see that this problem was crying out to be solved using them; but I never had the time to re-write existing code. Yesterday, I finally got around to doing so.

NB: You’ll see the use of the new XML datatype in both the old and new version of the code; this is due to that portion of the code already having been converted post SQL Server 2005 upgrade.

The Old Solution (Cursors – ewww)

The former solution uses a cursor to iterate through the contact/invitation records, such that we can deal with them per lodgement centre:

        -- @xmlSeminars is actually passed into a stored procedure, but is explicitly declared for -- clarity in this example
        DECLARE @xmlSeminars XML;
        SELECT @xmlSeminars = '<seminars><seminar id="123" /><seminar id="456" /></seminars>';

        CREATE TABLE #invite
        (
            intSeminarContactID INT,
            intLodgementCentreID INT,
            strPostCode VARCHAR(6)
        );

        CREATE TABLE #lodgement_centre
        (
            intOrder INT IDENTITY(0, 1),
            intSeminarContactID INT,
            strPostCode VARCHAR(6)
        );

        CREATE TABLE #invite_lodgement_centre
        (
            intOrder INT,
            intSeminarContactID INT,
            intLodgementCentreID INT,
            strPostCode VARCHAR(6),
            intBox INT
        );

        INSERT
            #invite
        SELECT
            sc.intSeminarContactID,
            lcb.intLodgementCentreID,
            a.strPostCode
        FROM
            tblSeminarContact sc
        JOIN
            tblContact c ON c.intContactID = sc.intContactID
        JOIN
            tblAddress a ON a.intAddressID = c.intAddressID
        JOIN
            tblLodgementCentreBand lcb ON a.strPostCode BETWEEN lcb.strPostCode_From AND lcb.strPostCode_To
        JOIN
            @xmlSeminars.NODES('seminars/seminar') AS seminars(seminar) ON seminars.seminar.VALUE('@id', 'INT') = sc.intSeminarID;

        DECLARE @intLodgementCentreID INT;

        DECLARE curLodgementCentre CURSOR FOR
            SELECT DISTINCT
                #invite.intLodgementCentreID
            FROM
                #invite;

        OPEN curLodgementCentre;

        FETCH NEXT FROM curLodgementCentre
        INTO @intLodgementCentreID;

        WHILE @@FETCH_STATUS = 0
        BEGIN
            INSERT
                #lodgement_centre
            SELECT
                #invite.intSeminarContactID,
                #invite.strPostCode
            FROM
                #invite
            WHERE
                #invite.intLodgementCentreID = @intLodgementCentreID
            ORDER BY
                #invite.strPostCode;

            INSERT
                #invite_lodgement_centre
            SELECT
                #invite.intOrder,
                #inviteintSeminarContactID,
                @intLodgementCentreID,
                #invite.strPostCode,
                #invite.intOrder / 300
            FROM
                #invite;

            FETCH NEXT FROM curLodgementCentre
            INTO @intLodgementCentreID;
        END

        CLOSE curLodgementCentre;
        DEALLOCATE curLodgementCentre;

This causes the following problems:

  • it hampers performance, due to:
    • using programmatic iterative operations instead of set based operations, which is what SQL Server is good at
    • the overhead of cursors
  • it obfuscates the intention of the boxing algorithm

The New Solution (Ranking Functions – ROW_NUMBER and DENSE_RANK)

The new code solves all of these problems in two simply elegant, set based operations.

1. Ordering and Partitioning

        -- @xmlSeminars is actually passed into a stored procedure, but is explicitly declared for clarity in this example
        DECLARE @xmlSeminars XML;
        SELECT @xmlSeminars = '<seminars><seminar id="123" /><seminar id="456" /></seminars>';

        CREATE TABLE #invite
        (
            intSeminarContactID INT,
            intLodgementCentreID INT,
            strPostCode VARCHAR(6),
            intOrder INT
        );

        CREATE TABLE #invite_lodgement_centre
        (
            intSeminarContactID INT,
            intLodgementCentreID INT,
            strPostCode VARCHAR(6),
            intOrder INT,
            intBox INT
        );

        INSERT
            #invite
        SELECT
            sc.intSeminarContactID,
            lcb.intLodgementCentreID,
            a.strPostCode,
            (ROW_NUMBER() OVER (PARTITION BY lc.intLodgementCentreID ORDER BY a.strPostCode DESCENDING)) - 1      -- (1)
        FROM
            tblSeminarContact sc
        JOIN
            tblContact c ON c.intContactID = sc.intContactID
        JOIN
            tblAddress a ON a.intAddressID = c.intAddressID
        JOIN
            tblLodgementCentreBand lcb ON a.strPostCode BETWEEN lcb.strPostCode_From AND lcb.strPostCode_To
        JOIN
            @xmlSeminars.NODES('seminars/seminar') AS seminars(seminar) ON seminars.seminar.VALUE('@id', 'INT') = sc.intSeminarID;

(1)
This initial INSERT statement makes use of the new ROW_NUMBER() ranking function that allows us to generate an ordinal number (ORDER BY) for each record in the set, optionally partitioning those ordinal numbers into groups (PARTITION BY); the results generated look something like the following:

intSeminarContactID intLodgementCentreID strPostCode intOrder
123 1 9999 0
234 1 9999 1
345 1 9998 2
…elided…
456 1 9996 299
567 1 9996 300
678 1 9996 301
789 1 9995 302
4234 2 6666 0
1234 2 6666 1
1235 2 6666 2
1236 2 6665 3
1237 2 6665 4
1238 2 6665 5
1239 2 6664 6
1211 2 6664 7
…elided…

NB: I decrement the ordinal number returned by ROW_NUMBER() by one in (1) above, to allow a simple integer division calculation in the following statement:

2. Boxing

        INSERT
            #invite_lodgement_centre
        SELECT
            #invite.intSeminarContactID,
            #invite.intLodgementCentreID,
            #invite.strPostCode,
            #invite.intOrder,
            #invite.intOrder / 300 + DENSE_RANK() OVER (ORDER BY #invite.intLodgementCentreID);                                                         -- (2)

(2)
The first part of statement (2) divides #invite.intOrder / 300 our invitations into boxes of 300.

The second part of the statement “corrects” the boxing such that each lodgement centre has a distinct set of boxes; this works because the DENSE_RANK() “ranks” the dataset according to the content of the ORDER BY CLAUSE. We use DENSE_RANK() instead of RANK() so that the intBox box identifiers are kept contiguous.

So now, our new dataset has each invitation record, allocated to a “box” of 300 or fewer, where each box contains records from only a single lodgement centre.

intSeminarContactID intLodgementCentreID strPostCode intOrder intBox
123 1 9999 0 1
234 1 9999 1 1
345 1 9998 2 1
…elided…
456 1 9996 299 1
567 1 9996 300 2
678 1 9996 301 2
789 1 9995 302 2
4234 2 6666 0 3
1234 2 6666 1 3
1235 2 6666 2 3
1236 2 6665 3 3
1237 2 6665 4 3
1238 2 6665 5 3
1239 2 6664 6 3
1211 2 6664 7 3
…elided…

Conclusion

As you can see, the ranking functions clean up the code quite considerably; in addition to being far simpler, the code is now far easier to maintain, with the original intention of the code no longer obfuscated behind a wall of crazy cursor logic.

A DBMS will always be far happier working with set based operations; the introduction of ranking functions in SQL Server 2005 is one more way in which we can keep SQL Server, and those of us who work with it everyday, happy.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: