Friday, June 29, 2012

SQL 2012 Upgrade Part 5: Query Playback Testing

Trace replay was a very important part of our testing plan. Our desire was to complete the SQL Server 2012 upgrade without the end-users noticing any change.There is no practical way to completely ensure that we could accomplish this but being able to successfully execute a batch of queries that emulates the live production environment would go a long way towards getting the "warm and fuzzy" feeling.

Initiating SSAS Replay

Playing back the Analysis Services queries was pretty easy, if a bit on the manual side. I basically loaded up the traces collected previously in SQL Server Profiler and used the Start option from the Replay menu. This generates a prompt to log into a server which I connected to the cloned SSAS server. As I mentioned in my previous post, I set up the replay to minimize the amount of disk space used on the C: drive. I also configured it to save the results to disk and replay using multiple threads. In the latter case I was more interested in blasting through the test than being able to debug any problems.

The Replay Configuration dialog with the settings I chose for playback.

Initiating Data Engine Replay

For the Data Engine playback, I found Profiler to be difficult to use and, instead, turned to the SQL Server Distributed Replay tool. This comes with SQL Server 2012 and is actually very well documented. Rather than bloat this article with a ton of information regurgitated from the official documentation, I will just cover what I did differently from what the documentation suggests.

First of all, my scenario wasn't very "distributed". I just ran one replay client and ran it on the same machine as the controller. This kept security simple as I didn't have to configure Kerberos in my cloned environment.

When I first started running the playback it was incredibly slow. It turns out that the replay tool retains all the idle time between queries. I just wanted to the queries to fire off one after the other. To accomplish this I set the Maximum idle time (<MaxIdleTime>) setting in the DReplay.exe.preprocess.config file to "0" which "indicates that there is some activity going on at any given point in time." (See the Preprocess Configuration File topic in the documentation.)

That was it, pretty simple. Both the pre-process and replay steps ran pretty quickly. Some of the queries were slow but I didn't have a whole lot of traffic to replay. Most of our BI queries go against cubes so this part of the testing went very smoothly.

As a side note, the reason I could not use Profiler for this was that it was running the queries in exactly the order captured. This sounds like a good thing but consider the following scenario.
  1. Connection A established with DatabaseA
  2. Connection A runs a query
  3. Connection B established with DatabaseB
  4. Connection A runs another query and gets an error indicating that the object doesn't exist or the user doesn't have permission to use it
What's going on here?! Well, when Profiler plays the queries back it only uses one connection. Since the two connections in the original capture had their own context that couldn't conflict with each other. What happens during the replay is that Step 3 actually becomes a Change Database (USE DatabaseB) command and the query that Connection A sent won't work if it's in the wrong database.

Analyzing Replay Results

Once the replay is complete, it's pretty easy to see how things went. The final record in the results trace provides some metrics that look like this (highlight added) for an SSAS replay:
    Replay server: MYSSASSERVER
    Total events = 2537
    Replayed events = 2527
    Provider errors = 22
    Internal errors = 0
    Total errors = 22
    Provider hit ratio = 99.13 %
    Internal hit ratio = 100.00 %
    Overall hit ratio = 99.13 %
    Replay time = 0 hr 3 min 58 sec 908 ms

or this for a Data Engine replay:
    <?xml version="1.0" encoding="UTF-16"?>
        <ReplayDuration>63 hr 13 min 31 sec 265 ms</ReplayDuration>

Don't ask me why one is XML and the other isn't. You'd think it would be the other way around at the very least!

Those error counts are what you're looking for to determine success or failure. Non-zero don't necessarily indicate failure though; they just tell that you have some further investigation to do. Internal Errors indicate problems with the trace playback, network errors and the like. Provider Errors are the ones you really need to look at.

The easiest way I found to tackle the errors was to load the replay results into a SQL Server table and search for the errors.The Event Class for Provider Errors is 63485 so you can write a query like the one below to get a list of all your errors.

SELECT RowNumber,
  FROM dbo.Trace_Replay_Results
 WHERE EventClass = 63485

If this turns up some error messages that concern you and you want to see the query that resulted in the error you can query the same table for the row number the preceded the error you are interested in. For example, if you have a troublesome error at row number 1280, you can use the query below to get the event with the query that triggered the error.

SELECT TextData,
  FROM dbo.Trace_Replay_Results
 WHERE RowNumber = 1279

Some of the error scenarios we encountered (all of which were with queries against the SSAS servers) are described below with some sample error messages.

Original query was invalid

The query that was captured for playback was invalid and would have resulted in an error response for the original user or during playback.

Error Messages:
  • Parser: The following syntax error occurred during parsing: Invalid token, Line 1, Offset 1, [2012 2016.
  • Query (11, 8) The STRTOMEMBER function expects a member expression for the 1 argument. A string or numeric expression was used.

Excel query using temporary objects

Excel sometimes creates and uses temporary objects on the server side. If these objects are not created before the query is executed an error will result. The playback process is not able to handle these scenarios.

Error Messages:
  • The MyCube_XL_GROUPING0 cube either does not exist or has not been processed.
  • The level '&[1200001234]' object was not found in the cube when the string, [Item Details].[Line Num].&[11], was parsed.
  • The Current Flag hierarchy is used more than once in the Crossjoin function.

Old data or schema

The playback was performed against database that had not been processed against live data in several weeks or may have even been changed structurally. Queries against old data can result in errors if hierarchies are built on the data requested. Additionally, schema changes made in Production between the time the test databases were copied and when the replay queries were collected would not have been applied to the test environment.

Error Messages:
  • The [Order Item].[Order Number].[Order Number].[Actual Delivery- Confirmed Delivery] dimension attribute was not found.
  • The 'Employee History' dimension contains more than one hierarchy, therefore the hierarchy must be explicitly specified.


Overall, the replay testing went very well. The tools provided with SQL Server worked as expected for the most part. This check step gave us the confidence we needed to proceed with the upgrade. Have you had different results or anything to add to this information? If so, please share your experiences so that others may benefit.

1 comment:

  1. Dear Bryant Byrd,
    How do we know EventClass 63485 is a error?