One of the more challenging types of performance problem is the dreaded scenario in which the problem occurs only intermittently and is difficult to trap with your performance measurement tools. In such a situation, trace data can be an irreplaceable ally, allowing you to understand even phenomena that you’ve not yet actually measured.
Recently, a client came to us with a problem:
“We have this batch job. It processes pretty much the same amount of data every time we run it. It usually runs in a little over an hour, but sometimes—out of the blue—it’ll run nearly two and a half hours. We have no idea when it’s going to happen. There must be a pattern to it; we just can’t figure out what it is. It was slow last Tuesday, but it’s not slow every Tuesday. It’s slow sometimes between three and four o’clock, but not always, and sometimes it’s slow at other times. We thought maybe it was interference with our daily batch jobs, but we’ve proven that that’s not it, either. We just can’t correlate it to anything…”
The easiest way for us to kill a problem like this is to collect two trace files: (1) a trace of a fast execution (our baseline), and (2) a trace of a slow execution. Then we could see in vivid color what’s different when the program runs slowly.
The client ran a baseline trace of a fast execution right away, but they still haven’t traced the program in the act of running slowly. Here’s the top-level profile for fast run (created by our Method R Workbench software):
The thing that really caught my eye in this profile was the mean duration per “db file sequential read” call. I expect for disk read calls to consume 2–5 ms, but this program did over a million read calls with an average read latency of just 0.849 ms.
That’s awesome, but what’s going to happen someday when the storage array is too busy to serve up sub-millisecond read calls? Well, the profile makes it easy to find out. With a simple spreadsheet, I can calculate what the total job duration would be if any of the call counts or durations were to change.
The “baseline execution” table contains data from the profile. The “forecast execution” table contains the same rows and columns, but the duration for each subroutine uses the formula duration = calls × duration per call, so that the duration will recalculate whenever a call count or latency value in a green cell changes. Now I can what-if to my heart’s content.
The 0.005090 value shown in the figure is the answer to the question, “If all else were held equal, what ‘db file sequential read’ latency would it take to drive the job’s total duration to 8,700 s (which is approximately double its 4,358 s value)?” The feature makes it easy to figure out. The latency that would do that is 0.005090 s—about 5 ms.
This is interesting, because 5 ms is a typical read call latency on a lot of systems I see. …At a latency that is typical for lots of people, the job’s duration would double—like it has been doing intermittently.
Of course, the model doesn’t explain why the job did run slowly (we’ll still need the second trace file for that), but it certainly shows why it will run slowly if disk I/O latencies degrade. This program’s sensitivity to perturbations in disk latency is our sign to eliminate as many OS read calls as we can.
Once you can comprehend any program’s duration as nothing more complicated than the sum of a list of count × latency values, you’ll have a new lens through which previously intractable performance problems can surrender to logical explanation. Being able to “see” how a program spends your time can help you understand how that program will behave even under conditions that you have not yet measured. This is a vital capability to have.
This post is an excerpt from a new chapter in the forthcoming Third Edition of Cary Millsap’s The Method R Guide to Mastering Oracle Trace Data.
Author’s edit 2019-04-12: The Method R Guide to Mastering Oracle Trace Data is now released and available for purchase.