Categories
Blog Posts

Coherency Delay

Today, a reader of How to Make Things Faster asked a question about coherency delay (chapter 47): How does coherency delay manifest itself in trace files?

My own awareness of the concept of coherency delay came from studying Neil Gunther’s work, where he defines it mathematically as the strength β that’s required to make your scalability curve fit the Universal Scalability Law (USL) model.

Such a definition can be correct without being especially helpful. It doesn’t answer the question of what kind of Oracle behaviors cause the value of β to change.

think that one answer may be “the gc stuff.” But I’m not sure. One way to find out would be to do the following:

  1. Find a workload that motivates a lot of gc events on a multi-node RAC system.
  2. Fit that configuration’s behavior to Gunther’s USL model.
  3. Run the same workload on a single node.
  4. Fit that configuration’s behavior to USL.
  5. If the value of β decreases significantly from the first workload to the second, then there’s been a reduction in coherency delay, and there is a strong possibility that the gc events were the cause of it.

That may not feel like a particularly practical proof (it would require a lot of assets to conduct), but it’s the best proposal I can think of here at the moment.

The biggest problem with executing this “proof” (whether it really is a proof or not is subject to debate) is that there’s probably not much payout at the end of it. Because what does it really matter if you know how to categorize a particular response time contributor? If a type of event (like “gc calls”) dominates your response time, then who cares what its name or its category is, your job is to reduce its call count and call durations.

(Lots of people have already discovered this peril of categorization when they realized that—oops!—an event that contributes to response time is important, even if the experts have all agreed to categorize it using a disparaging name, such as “idle event.”)

Whether the gc events are aptly categorizable as “coherency delays” or not, my teammates and I have certainly seen RAC configurations where the important user experience durations are dominated by gc events. In fact, our very first Profiler product customer back in roughly the year 2000 was having ~20% of their response times consumed by gc stuff on a two-node cluster back when RAC was called Oracle Parallel Server (OPS).

We solved their problem by helping them optimize their application (indexes, SQL rewrites, etc.), so that their workload ended up fitting comfortably on a single node. When they decommissioned the second node of their 2-node cluster, their gc event counts and response time contributions dropped to 0. And another department got a new computer the company didn’t have to pay for.

The way you’d fix that problem if you could not run single-instance is to make all your buffer cache accesses as local as possible. It’s usually not easy, but that’s the goal. And of course, RAC does a much better job of minimizing gc event durations than OPS did 23 years ago, so globally it’s not as big of a problem as it used to be.

Bottom line, it might be an interesting beer-time conversation to wonder whether gc events are coherency delays or not, but the categorization exercise is only a curiosity. It’s not something you have to do in order to fix real problems.