Categories
Blog Posts

How Did You Make mrskew 20× Faster?

A couple of years ago (June 30, 2020), we released Method R Workbench version 9.0.0.66. It had 113 new features and bug fixes. One of those features was case 7800: “mrskew is now 10–20× faster than before.”

We’re prone to sneaking in performance improvements like that. It’s because we, too, use the software we sell, and we don’t like waiting around for answers any more than you do.

mrskew, in case you don’t know, creates flexible, variable-dimension profiles. It’s a skew analyzer for Oracle trace files.

…A what?

It’s a tool that can query across trace files (thousands of them, if that’s how many you have) and answer questions like these:

  1. What kinds of calls dominated the response time of your user’s experience? Imagine for the sake of this example that the answer is “read calls.” How much time did read calls take? How many read calls did your program make?
  2. Were all your read calls the same duration? Or did some take longer than the others? How much time could you save if you eliminated the slowest 10,000 read calls?
  3. How many blocks did the longest read calls read?
  4. What are the file and block IDs of the longest read calls?
  5. Are the slowest read calls associated with a particular file?
  6. Are they associated with a particular SQL statement?
  7. On what line of what trace file can you find information about your longest read call?

Our mrskew tool can answer questions like these and more.

Here are the commands to do it. Don’t let these scare you. You can summon any one of them (or any of 30+ others) with just a click, in our Method R Workbench:

  1. mrskew *.trc
  2. mrskew --name='read' --rc=p10.rc *.trc
  3. mrskew --name='read' --group='"$p3"' --gl='"BLKS/READ"' *.trc
  4. mrskew --name='read' --group='"$p1:$p2"' --gl='"FILE#:BLOCK#"' *.trc
  5. mrskew --name='read' --group='"$p1"' --gl='"FILE"' *.trc
  6. mrskew --name='read' --group='"$sqlid"' --gl='"SQLID"' *.trc
  7. mrskew --name='read' --group='"$base:$line"' --gl='"FILE:LINE"' *.trc

Now, imagine trying to ask 2GB of trace data all these questions. Without mrskew, it would probably take you a day or more to fish the answers out of your trace files (don’t bother looking in AWR or ASH; they’re not there).

A Workbench 8 mrskew execution on 2GB of input takes about 4 minutes. That’s about half an hour to run all seven commands. That’s pretty good compared to a day or two of fishing.

A Workbench 9 mrskew execution on the same input takes only about 12 seconds. That’s less than 2 minutes to answer all the questions I’ve posed here. That’s remarkable.

2019 MacBook Pro (Intel)mrskew *.trc (2GB)
Method R Workbench 8240 seconds (4 minutes)
Method R Workbench 912 seconds
mrskew execution times before and after the Method R Workbench 9 upgrade.

So, an interesting question, then, might be, “How did you do that?”

Well, that’s easy: a long time ago, I hired Jeff Holt.

How did Jeff do it?

Simple. He rewrote mrskew in C.

In Workbench 8, mrskew was a Perl program that I had written in 2009. Perl is admittedly slow, but I was interested in having a program that users could interact with using Perl’s full expression syntax.

mrskew worked really well, and we used it a lot. But it always felt weird that it was so much slower than our other utilities that do even more work (like mrprof). So Jeff, in his spare time, investigated whether he could rewrite mrskew in C. It was no small feat, given that I insisted upon keeping the full Perl expression interface.

One day he surprised me. The new C version of mrskew was passing all our automated tests and we could probably ship it now. I asked him how much faster it was. He said about 20×.

I’m used to this kind of thing with Jeff by now. But still.

The result of Jeff’s investigation is that now we have a skew analysis tool that works just as fast as our other outrageously fast tools, even when you’re battling data by the gigabyte. Today, mrskew is a standard feature of pretty much every performance improvement project we hook into, and we’re grateful that it doesn’t make us wait a long time for the answers we need.

To get a better understanding of what skew is and why it’s important, see chapter 38 of How to Make Things Faster. If you’re interested in more detail about mrskew, visit our mrskew manual page.

Categories
Blog Posts

How Slow Programs are Like Christmas

Method R Corporation makes systems faster. And we make people who make systems faster, faster. We train people to become performance optimization heroes.

Here is my story about why you should be interested in our training.


Slow programs remind me of Christmas when I was a kid. In early December, my parents would put a present for me under our Christmas tree. It would be wrapped up so I couldn’t see what was inside it. The unwrapping would not happen until Christmas morning, December 25. (That was the best-case scenario. If my Dad couldn’t be home because of work, then Christmas morning would come a day or two late.)

So, every day, for nearly a month, I would see that present under the tree and wonder what was in it. 

I’d take any clue I could find. What shape is it? What does it weigh? What does it sound like when I shake it? (Sometimes, my Mom and Dad would prohibit me from shaking it.) No matter how desperate the curiosity, all I could do was guess.

When Christmas came, I’d finally get to tear the paper off, and I would now see, plain as day, what had been in that box the whole time. All the clues and possibilities would collapse into a single reality. Finally, there was no more mystery, no more guessing.

Slow programs are like that. The clues aren’t enough. You guess a lot. But with slow programs, there’s no specially designated morning when your programs reveal their mysteries to you. They just keep irritating you, with no end in sight. You need somebody who knows how to tear the wrapping paper off those programs so you can see what they’re doing wrong. 

That’s the somebody I like being. The role scares a lot of people, but it doesn’t scare me. That’s because I trust my preparation. I know that I have three particular assets that tilt the game in my favor. 

Those assets are knowledgetools, and community. With the knowledge and tools I have, I don’t get stumped very often. But when I do, I have a network of friends who’ll help me out. My friends and I can solve just about anything. These three assets are huge for both my effectiveness and my confidence.

These three assets aren’t just for me. They’re for you, too. 

That’s the aim of my online course called “Mastering Oracle Trace Data.” In this course, I’ve bundled everything you need to claim those three assets for your own:

  1. You’ll learn the details about Oracle traces and the stories they’re trying to tell you. This is your knowledge asset.
  2. You’ll have, for the duration of your choosing, full-feature access to Method R Workbench, the most comprehensive software in the world for mining, managing, and manipulating Oracle traces. This is your tools asset.
  3. You’ll have access to our Slack channel, a global community of Oracle trace enthusiasts that can help you whenever you get stumped. You won’t be alone. You’ll have people who are there for you. This is your community asset. 

You can also fortify all three of your new assets by purchasing office hours for the course. Office hours are Zoom sessions, where you can spend time with my friends and me, discussing your questions about the material.

If you’re interested in becoming a more effective and confident optimizer, you can get started now. Just visit our course page for details.


That’s my story. I hope you’ll contact us if you’re interested.

And if you like stories like this, you’ll find a lot more in my How To Make Things Faster book, available wherever books are sold.

Categories
Blog Posts

A Design Decision

This week, my team at Method R devoted some time to an enhancement request that required an interesting design decision. This post is about the analysis behind that decision.

The enhancement request was for our flagship product called Method R Workbench. It’s an application that people use to mine, manage, and manipulate Oracle trace files.

One of its features, called mrskew, is a tool that allows a Workbench user to filter, group, and sort the raw dbcall and syscall data that Oracle Database processes write to their trace files. You can use mrskew from within the Workbench application, or from a *nix command line.

Here’s an example of a mrskew command. It’s what you would use to find out how long your program spent reading Oracle blocks from secondary storage. It will show you which blocks it read, how long they took, and how many times each block was read:

mrskew --name='db.*read' \
   --group='sprintf("%8d %8d", $p1, $p2)' \
   x_ora_1492.trc

Here’s the output:

sprintf("%8d %8d", $p1, $p2) DURATION      % CALLS     MEAN ... 
---------------------------- -------- ------ ----- --------
                  2        2 0.072918   1.0%    26 0.002805 ...
                 33   698186 0.051940   0.7%     1 0.051940 ...
                 50   339841 0.049261   0.7%     1 0.049261 ...
...

The important thing in this report is the meaning of the $p1 and $p2 variables. The combination of these two variables happens to represent the data block address (the file number and block number) of an Oracle block that was read by some kind of an Oracle read call. It would be nice for the report to tell you that instead of just telling you that the first two columns of numbers are the output of an sprintf function call.

We have a command-line option for that. The ‑‑group-label option lets you assign your own title for the group column. So, with some careful character counting, you could use…

‑‑group-label='    FILE    BLOCK'

…to get exactly the heading you want:

    FILE    BLOCK DURATION      % CALLS     MEAN ... 
----------------- -------- ------ ----- -------- 
       2        2 0.072918   1.0%    26  0.002805 ...
      33   698186 0.051940   0.7%     1  0.051940 ...
      50   339841 0.049261   0.7%     1  0.049261 ... 
...

That makes sense. Now it’s easy to see that Oracle has read one block (file #2, block #2) 26 times, consuming a total of 0.072918 seconds reading it.

The group label fits the output, only because of the careful character counting. The enhancement request was to allow the ‑‑group-label option to take an expression, not just a string. Like this:

--group-label='sprintf("%8s %8s", "FILE", "BLOCK")'

That way, he could print out the header he wanted, perfectly aligned, by just syncing his ‑‑group‑label expression to his ‑‑group expression, without having to count space characters that are literally invisible.

It’s a smart idea. The group label option should have been designed that way from the beginning. We eagerly approved the enhancement request and began thinking about the design.

When we thought it through, we ended up with two different ideas about how we could implement this new idea:

  1. Redefine ‑‑group‑label to take an expression instead of a string. mrskew will calculate the value of the expression before printing the column label.
  2. Create a new option, say, ‑‑new‑group‑label, that takes an expression as its argument. And leave ‑‑group‑label as it is.

The first idea is how the enhancement request was worded. The second idea entered our minds because the first idea creates a compatibility problem: if we change the spec of the existing ‑‑group‑label option, it will break some existing mrskew scripts. For example, these will work in Workbench 9.2.5:

--group-label=FILE
--group-label="FILE BLOCK"

But if we redefine ‑‑group‑label to take an expression instead of a string, then these won’t work anymore. People will need to quote their string expressions like this:

--group-label='"FILE"' 
--group-label='"FILE BLOCK"'

In the end, we decided to redefine the existing option and live with the compatibility breach.

The way we make decisions like this is that we create strenuous arguments for each idea. Here are some of the arguments we considered en route to our decision.

First, the customer experience (cognitive expenditure).

Everyone who participated in the debate had the customer experience foremost in mind. But how can we objectively measure “customer experience”? How do you structure a scientific debate about the superiority of one experience over another?

One way to do it is to measure cognitive expenditure—the amount of mental effort that a user has to invest to get the desired outcome from our software. We want to minimize cognitive expenditure, to maximize a customer’s return on investment of effort.

We began by realizing that responding to this enhancement request with one of our two ideas would necessarily force the user into one of two new regimes:

  1. The syntax of ‑‑group-label has changed.
  2. There’s a new ‑‑new-group-label option.

In regime 1, our users would have to learn the new syntax. That’s a cognitive expenditure. But it’s a one-time expenditure, which is good. The new syntax would be consistent with the existing ‑‑group syntax, which is actually a cognitive savings for our users over what we have now. However, if a customer had saved any scripts that used the old syntax, then the customer would have to convert those scripts. That’s a cognitive expenditure in a loop (one for each script), which is bad.

In regime 2, our users would have to learn about ‑‑new-group‑label, which is a cognitive expenditure. They’d still have to remember (or relearn) about ‑‑group‑label, too, which is a similar cognitive expenditure as the one in regime 1. They wouldn’t have to modify any old scripts, but they would have to make the choice of whether to use ‑‑group‑label or ‑‑new-group‑label, every time they wrote a script in the future. That’s another cognitive expenditure in a loop (one for each script), which is bad.

Second, the developer experience (technical debt).

We also need to consider the developer’s experience. We don’t want to create code that increases technical debt that makes the product unnecessarily difficult to support.

If we redefine ‑‑group-label, there’s no long-term effect to worry about. But if we add ‑‑new‑group‑label to the story, I would expect for people to wonder, why are there two such similar group label options, when one (the one that takes an expression) is clearly superior? And why does the inferior one have the better name?

At some point in the future, I envision wanting to clean up the cruft and have just the one group label feature. Naturally, the right name for it would be ‑‑group‑label. But of course, changing the spec that way would introduce a compatibility problem. To make things worse, this would occur in the future when—one would hope, if our business is growing—such a decision would impact even more customers than it would today. So then, why create the cruft in the first place? It’ll be a worse problem later than it is now.

The question that really seals the deal, is who will the change really affect? It’s really a probability question about customer experiences.

Most users who use the Workbench application will never experience our group label option directly. It’s there for everybody to use, but our Workbench has so many predefined reports built into it, most users never need to touch the group label option for themselves. When they do need to modify it, they’re usually tweaking a report that we’ve predefined for them, which is a low–cognitive-expenditure experience.

In the end, the Method R bears almost the entire cost of the ‑‑group‑label redefinition. It required us to revise:

Most users will experience the benefit of the ‑‑group‑label change, without ever knowing that, once upon a time, it changed. And that’s the way we want it. We want the product to be as smart as possible so that our customers get the most out of their investment, both cognitive and financial.

Categories
Blog Posts Videos

Insum Insider: How to Optimize a System with Cary Millsap

Today, Michelle Skamene, Monty Latiolais, and Richard Soule of Insum chatted with me for an hour on their live stream. I hope you’ll enjoy it as much as I did.

Categories
Papers

Batch Queue Management and the Magic of ‘2’

Trying to run too many simultaneous Concurrent Manager jobs hurts system performance. The temptation to make this error is powerful: the Concurrent Manager lets you specify any number of queues that you want, and certainly your intuition might tell you that more queues might yield better batch throughput. But executing more than about two simultaneous batch jobs on a CPU degrades response times for everyone and actually reduces the total amount of work the system can do for your business. This paper explains the science behind why the number 2.0 is the magical key to batch queue management that maximizes performance for everyone.

This article is marked © 2000 by Hotsos Enterprises, Ltd., but the current copyright owner is Cary Millsap.

Categories
Oracle Monographs

The Case of the 2× Slower Report

Why does a report run twice as fast on an identically configured sandbox? It’s not the execution plan. Method R Workbench makes it easy to find out.

This is article number six in the Method R Oracle® Performance Monograph series. I hope you’ll enjoy it.