Blog Posts

Troubleshooting a Complex Network of Overnight Batch Jobs

On Oracle-L this week is a discussion about buying Dynatrace vs. Oracle Enterprise Manager. Both tools work harmoniously together, but there’s a third choice you should consider, too: Method R Workbench. Our Workbench helps you mine unbelievable detail from Oracle trace files, even if you’re getting thousands of trace files at a time. Workbench works harmoniously with both Dynatrace and OEM and fills a gap that no other tool can fill.

Every Method R Workbench customer is an OEM customer. That’s just a given. OEM does a million things and is enormously popular. We see Dynatrace less often, but it’s a wonderful tool, and we love meeting people who own it. It means that they already care about precision performance measurement, like we do.

A few years ago, my team and I used Method R Workbench at a client who used Dynatrace extensively. I told this story in chapter 18 of my book called Faster. Here’s a synopsis, in a little more detail:

  1. Client had a complex network of overnight batch jobs. The aggregate duration of this batch work kept creeping longer and longer until it affected their customers. The nightly batch was running beyond 8:30 a.m. When the batch work wasn’t finished by 8:00, our client had to pay cash penalties to their customers for missing their SLAs.
  2. We recommended tracing the entire batch. Everything. Literally dbms_monitor.database_trace_enable for the whole night.
  3. The client was fearful that so much tracing would make their problem even worse. But the situation was so painful that they agreed to test it.
  4. They traced the whole production database for an hour. Dynatrace measurements proved that enabling trace (with waits=>true, binds=>false, plan_stat='first_execution') made no discernible difference in program durations. This gave them confidence to trace the whole night’s job.
  5. So they did. The database-wide overnight trace resulted in more than 5,000 trace files, which they zipped up and sent to us. Can you imagine making sense of 5,000 traces with tkprof? Method R Workbench processed all 5,000 traces in about 3½ minutes. This gave us program start and end times, resource consumptions, network round-trip counts, and even a detailed HTML report for every single program that had run through the night.
  6. We used the Method R Workbench “Details by file, for plotting in Excel” feature to create data that’s easily ingestible by Excel. 
  7. In Excel, I created a Gantt chart that showed the time relationships among all 5,000 files. It took me about an hour to do that, including the time it took me to learn how to create a Gantt chart in Excel.
  8. Once we saw the Gantt chart, the problem stuck out like a sore thumb. Everybody, both technical and non-technical, could see it. The conversation went like this:
    • “Hold on. Why is there a huge gap in time between the end of this program’s execution and the beginning of that program’s execution?”
    • “I don’t know. What are the two programs?”
    • “Well, this program’s final SQL statement was SELECT ABC…, and that program’s first SQL statement was SELECT DEF… So it’s between those.”
    • “Oh. I know exactly where in the application that is. Huh. It was #42 on my list of things to look at someday. I had no idea it was actually our bottleneck. I’ll get on that right now.”
    • A couple hours later, after manipulating something in the application (had nothing to do with Oracle or SQL), the network of overnight jobs was running quickly enough that it could finish by 5:00 a.m. Problem solved.

It was a feel-good story for everyone involved. We solved a huge problem, quickly, using a standard technology (Oracle tracing), that nobody else had considered using because they didn’t know that a tool existed that could cope with all its detail (Method R Workbench).

Nobody was ever going to solve that problem until someone got to #42 on his list.

By Cary Millsap

Owner and president of Method R Corporation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.