Categories
Blog Posts

The Problem with “Wait”

There’s a problem with the Oracle technical term wait.

The Oracle Wait Interface

In 1991, Oracle Corporation released some of the most important software instrumentation of all time: the “wait” statistics that were implemented in Oracle 7.0. Here’s part of the story, in Juan Loaiza’s words, as told in Nørgaard et. al (2004), Oracle Insights: Tales of the Oak Table.

This stuff was developed because we were running a benchmark that we could not get to perform. We had spent several weeks trying to figure out what was happening with no success. The symptoms were clear—the system was mostly idle—we just couldn’t figure out why.

We looked at the statistics and ratios and kept coming up with theories, the trouble was that none of them were right. So we wasted weeks tuning and fixing things that were not the problem. Finally we ran out of ideas and were forced to go back and instrument the code to figure out what the problem was.

Once the waits were instrumented the problem was diagnosed in minutes. We were having “free buffer” waits because the DBWR was not writing blocks fast enough. It’s amazing how hard that was to figure out with statistics, and how easy it was to figure out once the waits were instrumented.

…In retrospect a lot of the names could be greatly improved. The wait interface was added after the freeze date as a “stealth” project so it did not get as well thought through as it should have. Like I said, we were just trying to solve a problem in the course of a benchmark. The trouble is that so many people use this stuff now that if you change the names it will break all sorts of thing tools, so we have to leave them alone.

Before Juan’s team added this code, the Oracle kernel would show you only how much time its user calls (like parse, exec, and fetch) were taking. The new instrumentation, which included a set of new fixed views like v$session_wait and new WAIT lines in our trace files, showed how much time Oracle’s system calls (like reads, writes, and semops) were taking.

The Working-Waiting Model

The wait interface begat a whole new mental model about Oracle performance, based on the principle of working versus waiting:

Response Time = Service Time + Wait Time

In this formula, Oracle defines service time as the duration of the CPU used by your Oracle session (the duration Oracle spent working), and wait time as the sum of the durations of your Oracle wait events (the duration that Oracle spent waiting). Of course, response time in this formula means the duration spent inside the Oracle Database kernel.

Why I Don’t Say Wait, Part 1

There are two reasons I don’t use the word wait. The first is simply that it’s ambiguous.

The Oracle formula is okay for talking about database time, but the scope of my attention is almost never just Oracle’s response time—I’m interested in the business’s response time. And when you think about the whole stack (which, of course you do; see holistic), there are events we could call wait events all the way up and down:

  • The customer waits for an answer from a user.
  • The user waits for a screen from the browser.
  • The browser waits for an HTML page from the application server.
  • The application server waits for a database call from the Oracle kernel.
  • The Oracle kernel waits for a system call from the operating system.
  • The operating system’s I/O request waits to clear the device’s queue before receiving service.

If I say waits, the users in the room will think I’m talking about application response time, the Oracle people will think I’m talking about Oracle system calls, and the hardware people will think I’m talking about device queueing delays. Even when I’m not.

Why I Don’t Say Wait, Part 2

There is a deeper problem with wait than just ambiguity, though. The word wait invites a mental model that actually obscures your thinking about performance.

Here’s the problem: waiting sounds like something you’d want to avoid, and working sounds like something you’d want more of. Your program is waiting?! Unacceptable. You want it to be working. The connotations of the words working and waiting are unavoidable. It sounds like, if a program is waiting a lot, then you need to fix it; but if it’s working a lot, then it is probably okay. Right?

Actually, no.

The connotations “work is a virtue” and “waits are a vice” are false connotations in Oracle. One is not inherently better or worse than the other. Working and waiting are not accurate value judgments about Oracle software. On the contrary, they’re not even meaningful; they’re just arbitrary labels. We could just as well have been taught to say that an Oracle program is “working on disk I/O” and “waiting to finish its CPU instructions.”

The terms working and waiting really just refer to different subroutine call types:

“Oracle is workingmeans“your Oracle kernel process is executing a user call”
“Oracle is waitingmeans“your Oracle kernel process is executing a system call”

The working-waiting model implies a distinction that does not exist, because these two call types have equal footing. One is no worse than the other, except by virtue of how much time it consumes. It doesn’t matter whether a program is working or waiting; it only matters how long it takes.

Working-Waiting Is a Flawed Analogy

The working-waiting paradigm is a flawed analogy. I’ll illustrate. Imagine two programs that consume 100 seconds apiece when you run them:

Program AProgram B
DurationCall typeDurationCall type
98system calls (waiting)98user calls (working)
2user calls (working)2system calls (waiting)
100Total100Total

To improve program A, you should seek to eliminate unnecessary system calls, because that’s where most of A’s time has gone. To improve B, you should seek to eliminate unnecessary user calls, because that’s where most of B’s time has gone. That’s it. Your diagnostic priority shouldn’t be based on your calls’ names; it should be based solely on your calls’ contributions to total duration. Specifically, conclusions like, “Program B is okay because it doesn’t spend much time waiting,” are false.

A Better Model

I find that discarding the working-waiting model helps people optimize better. Here’s how you can do it. First, understand the substitute phrasing: working means executing a user call; and waiting means executing a system call. Second, understand that the excellent ideas people use to optimize other software are excellent ideas for optimizing Oracle, too:

  1. Any program’s duration is a function of all of its subroutine call durations (both user calls and system calls), and
  2. A program is running as fast as possible only when (1) its unnecessary calls have been eliminated, and (2) its necessary calls are running at hardware speed.

Oracle’s wait interface is vital because it helps us measure an Oracle program’s complete execution duration—not just Oracle’s user calls, but its system calls as well. But I avoid saying wait to help people steer clear of the incorrect bias introduced by the working-waiting analogy.

Categories
Blog Posts Videos

Why You Need a Profiler for Oracle

If you use Oracle and you care about performance, then you need a profiler for Oracle. This video, created and narrated by Cary Millsap, explains why.

Categories
Blog Posts

Quantifying SQL Shareability

Do you have SQL in your system that doesn’t use placeholders, like this?

Categories
Blog Posts

Oceans, Islands, and Rivers

Connection pools help solve a big performance problem, but they also make using trace data more difficult. Method R Tools, part of the Method R Workbench software package, makes it easier to measure individual user response time experiences on connection pooling systems. Now you can look at performance problems the way you’ve always wanted to see them.

Categories
Blog Posts

The Story behind “Thinking Clearly about Performance”

When I wrote Optimizing Oracle Performance with Jeff Holt back in 2003, my goal was to define a reliable, teachable method for fixing software performance problems. After a few months of contentment having finishing the project, I began to notice a trend in how people were responding to it. Many of the questions coming in had started to repeat themselves: “Sure, fixing problems is important, but how can I prevent them?” “The book was fun, but of course I didn’t read the chapter on queueing theory; is that stuff really important, anyway?” “Where does capacity planning fit in?” …And, sadly, I continued to see people make some of the same mistakes that we had tried to warn people about in the book.

Categories
Blog Posts Testimonials

Method R beats the “free” stuff: an OraSRP comparison

The Profiler (combined with Method R) has enabled me to resolve performance problems that could not have been resolved in any way other than by pure luck. The return of the modest investment cannot be beat. I saved a failing project as soon as I got it. I highly recommend it over the so-called “free” stuff. The authors of the free tools are trying to contribute to the Oracle community but I found it detrimental to my efforts. And I wanted to kick myself for trying to write my own for so long when I should have conducted a simple buy-vs-build assessment.