Connection pools help solve a big performance problem, but they also make using trace data more difficult. Method R Tools, part of the Method R Workbench software package, makes it easier to measure individual user response time experiences on connection pooling systems. Now you can look at performance problems the way you’ve always wanted to see them.
Life Before Connection Pooling
Do you remember life before connection pooling? For a user to get some work done in his Oracle Database, he would have to connect, do the work, and then disconnect. On systems with just a few users, it worked just fine. However, on systems with tens of thousands of users, the connection management overhead would dominate the workload. All the connect/disconnect requests and all the database prepare (parse) calls would drive CPU loads and coherency delays (like waiting for latch acquisitions) through the roof.
Connection pooling helps you solve those problems. Now, instead of creating a new connection to Oracle and then parsing your own PL/SQL statement (which probably looks just like the statement that all the other users are using, too), you merely grab an existing connection handle for the kind of PL/SQL statement you want to execute, pass in the specific values that you want bound into the prepared statement’s placeholder variables, and you get your work done without creating a connection or making even a single parse call. Systems built like this can scale to millions of users.
But Connection Pools Complicate Diagnosis
But, as “everyone knows,” connection pooling systems are much more difficult to diagnose for performance problems. The reason: because thousands of different people can use a given connection over the course of a day, and because a person may use dozens of different connections throughout that same day, it’s difficult on connection pooling systems to track down how the system has responded to a given individual user experience. For example, try pushing a trace file created on a connection pooling system through the Method R Profiler. Even though the Method R Profiler is unquestionably the World’s Greatest Software Tool for explaining how an Oracle Database has consumed the response time for a single individual user experience, the Profiler can’t make sense of a connection pooling trace file, because that file does not describe a single individual user experience. Such a file, in fact, describes many user experiences, and to most people, it’s too difficult to distinguish those experiences from each other.
Don’t Give Up
At this point, a lot of people just give in to the Dark Side and use statistics that are aggregated across countless user experiences. They give up on even trying to explain exactly how the time for a specific, given experience was consumed. Well, you don’t need to give up. The information you need is in your trace files. You just need to understand how to get it out. The mrskew utility in the Method R Tools package can do it for you.
What’s in a Connection Pool Trace File
To understand how to get response times out of your trace files, you need to understand what’s in a trace file generated on a connection pooling system. It’s usually a sequence that looks something like this:
WAIT ... nam='SQL*Net message from client' ela= 1202689 ... A sequence of trace lines explaining time consumption for a single user experience; call it Experience A WAIT ... nam='SQL*Net message from client' ela= 4260917 ... A sequence of trace lines explaining time consumption for Experience B WAIT ... nam='SQL*Net message from client' ela= 5213365 ... A sequence of trace lines explaining time consumption for Experience C WAIT ... nam='SQL*Net message from client' ela= 2044420 ...
Oceans and Islands
I think of each connection pooling emitted trace file as an ocean of idle time speckled with islands of at-work time spent servicing database calls. The experience “islands” are delimited by a pair of long-latency SQL*Net message from client oceans. Each SQL*Net message from client line that the Oracle Database writes into a trace file is just a record of the kernel’s having blocked on an OS read() call for ela microseconds, waiting for its client to ask it to do something.
Enter Method R Tools
The key, then, is to split the trace file into pieces that can each be measured—if you like, even profiled. That’s where mrskew comes in. The new txnz RC file shipped with Method R Tools version 3.1 makes it simple. Here is an example of using mrskew with txnz:
$ mrskew v11203_ora_26827.trc --rc=txnz EXP-ID DURATION % CALLS MEAN MIN MAX --------- --------- ------ ------ -------- -------- -------- 24 88.238004 100.0% 17,359 0.005083 0.000000 0.283979 --------- --------- ------ ------ -------- -------- -------- TOTAL (1) 88.238004 100.0% 17,359 0.005083 0.000000 0.283979
Now, this is not a very auspicious first attempt, because it makes my file look like it contains only one experience, when I am pretty sure that it contains lots of much shorter-duration experiences.
Oceans and Islands—oh, and Rivers
Here’s the problem. Of course, not every SQL*Net message from client call is an “ocean.” Some SQL*Net message from client calls are “rivers”—they’re short-duration calls representing network round-trips between the client and the database server that are part of the user’s experience. Each such river is thus part of some island. The txnz RC file defines an “ocean” as any SQL*Net message from client call that lasts longer than 1.0 seconds. For my trace file, which is from a really fast and busy system, I’m pretty sure that I need to change the definition of an “ocean.”
I can use mrskew to see what the SQL*Net message from client latencies look like in my trace file:
$ mrskew v11203_ora_26827.trc --name="message from client" --group='$line' '$line' DURATION % CALLS MEAN MIN MAX ------------- --------- ------ ----- -------- -------- -------- 21312 0.283979 0.3% 1 0.283979 0.283979 0.283979 37077 0.174480 0.2% 1 0.174480 0.174480 0.174480 21396 0.168399 0.2% 1 0.168399 0.168399 0.168399 8810 0.167489 0.2% 1 0.167489 0.167489 0.167489 36148 0.166965 0.2% 1 0.166965 0.166965 0.166965 21410 0.165523 0.2% 1 0.165523 0.165523 0.165523 29360 0.164519 0.2% 1 0.164519 0.164519 0.164519 21036 0.163736 0.2% 1 0.163736 0.163736 0.163736 29388 0.163088 0.2% 1 0.163088 0.163088 0.163088 36316 0.161805 0.2% 1 0.161805 0.161805 0.161805 5,031 others 86.231484 98.0% 5,031 0.017140 0.001152 0.161570 ------------- --------- ------ ----- -------- -------- -------- TOTAL (5,041) 88.011467 100.0% 5,041 0.017459 0.001152 0.283979
Aha! My biggest SQL*Net message from client latency in the whole file is just 0.283979 seconds. No wonder txnz thought that all my SQL*Net message from client calls were rivers. …With a little more mrskew work, specifically adding the arguments --top=4000 --sort=2na
to see the 4,000 smallest latencies, I was able to determine that a good criterion for distinguishing between rivers of busy round-trip activity and oceans of idle time was a threshold latency of .05 seconds.
Seeing Real User Response Times
Now I can adjust txnz’s behavior. First, I copied my txnz RC file to my current working directory (the distributed file is in the directory `which mrskew`/../../rc
), and then I edited its line number 6 as follows:
OLD: if ($name eq "SQL*Net message from client" and $af > 1) {
NEW: if ($name eq "SQL*Net message from client" and $af > .05) {
With this new RC file (which I called txnz.05), I obtained the following new report on my trace file:
$ mrskew v11203_ora_26827.trc --rc=txnz.05 EXP-ID DURATION % CALLS MEAN MIN MAX ----------- --------- ------ ------ -------- -------- -------- 0 24.236626 27.5% 327 0.074118 0.050007 0.283979 19547 2.212251 2.5% 807 0.002741 0.000000 0.049582 27247 2.112561 2.4% 791 0.002671 0.000000 0.048360 24221 1.927336 2.2% 267 0.007218 0.000000 0.048210 16129 1.450686 1.6% 683 0.002124 0.000000 0.049147 22289 0.997744 1.1% 643 0.001552 0.000000 0.045547 29620 0.982700 1.1% 562 0.001749 0.000000 0.049281 2843 0.967385 1.1% 655 0.001477 0.000000 0.048986 33239 0.920264 1.0% 139 0.006621 0.000000 0.047733 23031 0.917492 1.0% 647 0.001418 0.000000 0.049615 17091 0.899165 1.0% 579 0.001553 0.000000 0.045020 14701 0.864747 1.0% 123 0.007030 0.000000 0.049502 6509 0.805075 0.9% 437 0.001842 0.000000 0.043662 653 0.780152 0.9% 403 0.001936 0.000000 0.048553 36583 0.773713 0.9% 484 0.001599 0.000000 0.030175 26287 0.767064 0.9% 619 0.001239 0.000000 0.038591 2333 0.750920 0.9% 103 0.007290 0.000000 0.045808 9685 0.720571 0.8% 479 0.001504 0.000000 0.047614 25107 0.718329 0.8% 115 0.006246 0.000000 0.043572 28487 0.715467 0.8% 107 0.006687 0.000000 0.048749 309 others 43.717756 49.5% 8,389 0.005211 0.000000 0.049996 ----------- --------- ------ ------ -------- -------- -------- TOTAL (329) 88.238004 100.0% 17,359 0.005083 0.000000 0.283979
This is a list of transactions (the txn part of the RC file’s name) delimited by .05-second latency think times (abbreviated Z by mathematicians and, of course, the z part of the RC file’s name).
It is important to understand what you can see here:
This is a list of how long people waited for their clicks on this connection pooling system!
And it’s available to you in your plain old Oracle trace data, even if the developers of the application didn’t put any extra instrumentation into their code at all. (For example, if they had just set the Oracle client id, you could do even more, like see exactly who was making each click.)
How txnz Works: a Quick Tour
The following example should make it a bit clearer how txnz works. In the output, you can see the trace file line number, the eid value for the line provided by txnz, and the whole trace file line itself:
$ mrskew 5184.trc --rc=txnz --group='sprintf("%5d %5d %s", $line, eid(), $text)' --name=all --rc=all EXP-ID --------------------------------------------------------------------------------------------------------------------------------- 1 1 Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production 2 1 3 0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 4 4 5 4 PARSE #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576 6 4 EXEC #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576 7 4 WAIT #46: nam='SQL*Net message from client' ela= 1 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360347577 8 4 FETCH #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=1546270724,tim=9267360347577 9 4 STAT #46 id=1 cnt=1 pid=0 pos=1 obj=0 op='FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)' 10 4 CLOSE #46:c=0,e=0,dep=0,type=3,tim=9267360347577 11 4 12 0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 13 13 14 13 PARSE #46:c=10,e=10,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576 15 13 EXEC #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1546270724,tim=9267360347576 16 13 WAIT #46: nam='SQL*Net message from client' ela= 1000000 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267361347576 17 13 FETCH #46:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=1546270724,tim=9267361347576 18 13 STAT #46 id=1 cnt=1 pid=0 pos=1 obj=0 op='FAST DUAL (cr=0 pr=0 pw=0 time=0 us cost=2 size=0 card=1)' 19 13 CLOSE #46:c=0,e=0,dep=0,type=3,tim=9267361347576 20 13 21 0 WAIT #0: nam='SQL*Net message from client' ela= 1000001 driver id=1952673792 #bytes=4 p3=0 obj#=-1 tim=9267360456862 22 22 --------------------------------------------------------------------------------------------------------------------------------- TOTAL (22)
The SQL*Net message from client call on line 3 creates a new experience id for line 4, and lines 5 through 11 use that experience id (eid=4). Notice that the SNMFC on line 7 does not create a new experience id: its short latency means that it is only a river; it is part of experience 4. The experience id changes again on line 12, and lines 14 through 20 carry experience id 13. On line 16, there is a SQL*Net message from client call that’s almost long-latency enough to be an ocean, but just not quite (remember, txnz defines an ocean as a SQL*Net message from client call with $af > 1).
Notice that each ocean line itself carries an eid value of 0 instead of its own line number. There’s a good reason for this: each ocean itself is not part of any island. So all of the ocean water goes into a separate bucket, with eid=0. Now you can understand why the report on my big real trace file earlier put 24.236626 seconds toward eid=0: this is how big my oceans, combined, were for that file.
There’s Lots More You Can Do
Perhaps now it’s not that difficult for you to imagine using mrskew to drill in on the response time for an individual experience (e.g., with --rc=txnz --where="eid() == 42"
). Or chop up a connection pooling trace file into the appropriate pieces that would be required to create individual profiles for all the individual experiences with the Method R Profiler….
It used to be that when I knew my trace data was going to be “connection pooled,” I’d get very nervous. Not so much anymore.