Ran into a problem trying to upgrade my Oracle instance from 10.2.0.1 to 10.2.0.4:

“Preparing to launch Oracle Universal Installer from /tmp/OraInstall2009-07-14_10-28-22AM. Please wait …oracle@dbserver:/u04/database$ Exception in thread “main” java.lang.UnsatisfiedLinkError: /tmp/OraInstall2009-07-14_10-28-22AM/jre/1.4.2/lib i386/motif21/libmawt.so: ld.so.1: java: fatal: libXm.so.4: open failed: No such file or directory”

but looks like it’s been seen before from looking at this thread on OpenSolaris.org

The workaround is to use the Java version that comes with OpenSolaris:

./runInstaller -jreLoc /usr/java/jre -ignoreSysPrereqs

Once that was complete it was easy to complete the upgrade:

SQL> STARTUP UPGRADE
SQL> SPOOL patch.log
SQL> @?/rdbms/admin/catupgrd.sql
SQL> SPOOL OFF

Some time later:


TIMESTAMP
——————————————————————————–
COMP_TIMESTAMP UPGRD_END 2009-07-14 11:19:21
.
Oracle Database 10.2 Upgrade Status Utility 07-14-2009 11:19:21
.
Component Status Version HH:MM:SS
Oracle Database Server VALID 10.2.0.4.0 00:14:04
JServer JAVA Virtual Machine VALID 10.2.0.4.0 00:02:48
Oracle XDK VALID 10.2.0.4.0 00:00:34
Oracle Database Java Packages VALID 10.2.0.4.0 00:00:30
Oracle Text VALID 10.2.0.4.0 00:00:51
Oracle XML Database VALID 10.2.0.4.0 00:02:28
Oracle Workspace Manager VALID 10.2.0.4.3 00:00:41
Oracle Data Mining VALID 10.2.0.4.0 00:00:24
OLAP Analytic Workspace VALID 10.2.0.4.0 00:00:27
OLAP Catalog VALID 10.2.0.4.0 00:01:04
Oracle OLAP API VALID 10.2.0.4.0 00:00:57
Oracle interMedia VALID 10.2.0.4.0 00:03:22
Spatial VALID 10.2.0.4.0 00:01:29
Oracle Expression Filter VALID 10.2.0.4.0 00:00:14
Oracle Enterprise Manager VALID 10.2.0.4.0 00:01:10
Oracle Rule Manager VALID 10.2.0.4.0 00:00:11
.
Total Upgrade Time: 00:31:22
DOC>#######################################################################
DOC>#######################################################################
DOC>
DOC> The above PL/SQL lists the SERVER components in the upgraded
DOC> database, along with their current version and status.
DOC>
DOC> Please review the status and version columns and look for
DOC> any errors in the spool log file. If there are errors in the spool
DOC> file, or any components are not VALID or not the current version,
DOC> consult the Oracle Database Upgrade Guide for troubleshooting
DOC> recommendations.
DOC>
DOC> Next shutdown immediate, restart for normal operation, and then
DOC> run utlrp.sql to recompile any invalid application objects.
DOC>
DOC>#######################################################################
DOC>#######################################################################

Next stop will be documenting using Swingbench to benchmark Oracle.

Posted by andy, filed under Oracle. Date: July 14, 2009, 5:45 pm | No Comments »

Oracle came to us via the VOSJEC channel for help in diagnosing a customer getting the following error in the alert log during various DML queries:

ORA-27505: IPC error destroying a port
ORA-27300: OS system dependent operation:close failed with status: 9
ORA-27301: OS failure message: Bad file number
ORA-27302: failure occurred at: skgxpdelpt1
ORA-03135: connection lost contact

and

ORA-27509: IPC error receiving a message
ORA-27300: OS system dependent operation:recvmsg failed with status: 95
ORA-27301: OS failure message: Socket operation on non-socket
ORA-27302: failure occurred at: sskgxp_rcvms
Non critical error ORA-00001 caught while writing to trace file
"/opt/oracle/diag/rdbms/tacor3p/tacor3p2/trace/tacor3p2_ora_5758.trc"
Error message: SVR4 Error: 9: Bad file number

Interestingly enough Oracle commented that the customer was not running out of file descriptors and also had enough UDP ports available for their load. Oracle were also unable to replicate this in house but appeared to be consistent at the customer site and only happened when the customer enabled Oracle auditing with the syslog audit trail option. So why did Oracle believe this was an interoperability issue? Well they provided the following truss data:

27873: open("/var/run/syslog_door", O_RDONLY)
= 0
->the bad code now opens the syslog file again, at free fd 0
<snip>
27873: door_info(0, 0xFFFFFFFF7FFF78D8) = 0
27873: getpid() = 27873 [27340]
27873: door_call(0, 0xFFFFFFFF7FFF78A8) = 0
27873: close(0) = 0
->the bad code now closes 0 since it opened the file at 0
<snip>
27873: close(0) Err#9 EBADF
->again, closes fd 0 incorrectly, it is no longer open to anyone
<snip>
27873: recvmsg(0, 0xFFFFFFFF7FFF6F88, 0) Err#95 ENOTSOCK
->SKGXP now tries to use its socket without knowing someone else
closed it and fails
<snip>

So Oracle believed that the open("/var/run/syslog_door" seemed suspicious as the call didn’t originate from the rdbms code and hence some help was required from Sun to progress. Unfortunately that limited truss output didn’t actually show us the where in the process lifecycle we were going wrong so we needed to collect some more data from the customer to work it all out. With truss you can use the -u option to do user level function call tracing i.e:



truss -faeo /var/tmp/out -u a.out:: -u libc:: ./a.out

So, what did this new data actually show us:

5389:
execve("/opt/oracle/product/11.1.0/bin/sqlplus", 0xFFFFFFFF7FFFFB38,
0xFFFFFFFF7FFFFB48) argc = 1
5389: argv: sqlplus
6085/1@1: -> libc:_open(0xffffffff7b3e7498, 0×0, 0×0, 0×0)
6085/1@1: -> libc:__open(0xffffffff7b3e7498, 0×0, 0×0, 0×0)
6085/1: open("/var/run/syslog_door", O_RDONLY) = 0
6085/1@1: <- libc:__open() = 0
6085/1@1: <- libc:_open() = 0
6085/1@1: -> libc:_door_info(0×0, 0xffffffff7fff7978, 0×0, 0×2000)
6085/1: door_info(0, 0xFFFFFFFF7FFF7978) = 0
6085/1@1: <- libc:_door_info() = 0
6085/1@1: -> libc:getpid(0×0, 0xffffffff7fff7978, 0×0, 0×2000)
6085/1: getpid() = 6085 [5389]
6085/1@1: <- libc:getpid() = 6085
6085/1@1: -> libc:_door_call(0×0, 0xffffffff7fff7948, 0×0, 0×2000)
6085/1: door_call(0, 0xFFFFFFFF7FFF7948) = 0
6085/1@1: <- libc:_door_call() = 0
6085/1@1: -> libc:_close(0×0, 0xffffffff7fff7948, 0×0, 0×2000)
6085/1: close(0) = 0
6085/1@1: <- libc:_close() = 0
6085/1@1: <- libc:syslogd_ok() = 1
6085/1@1: <- libc:vsyslog() = 0xffffffff7fff83c8
6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0×0, 0×0, 0×0)
6085/1@1: -> libc:_cancelon(0×0, 0×0, 0×0, 0×0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0×0, 0×1, 0×0, 0×0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0×9, 0×1, 0×0, 0×0)
6085/1@1: -> libc:___errno(0×0, 0×0, 0×0, 0×0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0×1, 0×0, 0×0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0×108c65000, 0×108c65,
0×108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683
6085/1@1: -> libc:gethrtime(0×0, 0×108c00, 0×10a0c7, 0×400)
6085/1@1: -> libc:setjmp(0xffffffff7fffa7e0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0×10a1025b0,
0×10a0bc4b0, 0×46100)
6085/1@1:
-> libc:gettimeofday(0xffffffff7fffa620, 0×0, 0xffffffffffffffff,
0xfffffffffffffff8)
6085/1@1: -> libc:localtime_r(0xffffffff7fffa618,
0xffffffff7fffa5f4, 0×0, 0xec23d)
6085/1@1: -> libc:getsystemTZ(0xffffffff7fffa528, 0×0, 0×0, 0×0)
6085/1@1: -> libc:getenv(0xffffffff7b3e5b68, 0×0, 0×0, 0×0)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

384 /*
385 * Use a door call to syslogd to see if it’s alive.
386 */
387 static int
388 syslogd_ok(void)
389 {
390 int d;
391 int s;
392 door_arg_t darg;
393 door_info_t info;
394
395 if ((d = open(DOORFILE, O_RDONLY)) < 0)
396 return (0);
397 /*
398 * see if our pid matches the pid of the door server.
399 * If so, syslogd has called syslog(), probably as
400 * a result of some name service library error, and
401 * we don’t want to let syslog continue and possibly
402 * fork here.
403 */
404 info.di_target = 0;
405 if (__door_info(d, &info) < 0 || info.di_target == getpid())
{
406 (void) close(d);
407 return (0);
408 }
409 darg.data_ptr = NULL;
410 darg.data_size = 0;
411 darg.desc_ptr = NULL;
412 darg.desc_num = 0;
413 darg.rbuf = NULL;
414 darg.rsize = 0;
415 s = __door_call(d, &darg);
416 (void) close(d);
417 if (s < 0)
418 return (0); /* failure – syslogd dead */
419 else
420 return (1);
421 }

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

340 /* output the message to the local logger */
341 if ((putmsg(LogFile, &ctl, &dat, 0) >= 0) &&
syslogd_ok())
342 return;
343 if (!(LogStat & LOG_CONS))
344 return;

so syslogd_ok returns 1 and then we return out of vsyslog():

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

158 void
159 syslog(int pri, const char *fmt, …)
160 {
161 va_list ap;
162
163 va_start(ap, fmt);
164 vsyslog(pri, fmt, ap);
165 va_end(ap);
166 }

So we’ve returned from vsyslog() then into gettimeofday() and then our
second close() returns EBADF.

6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0×0, 0×0, 0×0)
6085/1@1: -> libc:_cancelon(0×0, 0×0, 0×0, 0×0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0×0, 0×1, 0×0, 0×0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0×9, 0×1, 0×0, 0×0)
6085/1@1: -> libc:___errno(0×0, 0×0, 0×0, 0×0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0×1, 0×0, 0×0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0×108c65000, 0×108c65,
0×108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683

So
we’ve returned from syslog() and are now somewhere else in the code so
nothing to do with the door_calls from my understanding. With that, Oracle development did some further digging and then announced they’d found a new defect in the Oracle code and was logged as Oracle bug 7519558. If Oracle hadn’t have come back then I would have probably used the following Dtrace script to confirm where the EBADF was actually being returned from close():

root@hippy-1 # more badclose.d
syscall::close:return
/pid == $target && errno == EBADF/
{
@badf_ustack[execname,ustack(20)] = count();
stop_pids[pid] = 1;
}

syscall::rexit:entry
/stop_pids[pid] != 0/
{
printf("stopping pid %d", pid);
stop();
system("prun %d", pid);
stop_pids[pid] = 0;
}

END
{
printa(@badf_ustack);
}

root@hippy-1 # dtrace -ws badclose.d -c ./p
dtrace: script ‘badclose.d’ matched 3 probes
dtrace: allowing destructive actions
andy1
CPU ID FUNCTION:NAME
0 28735 rexit:entry stopping pid 27953
dtrace: pid 27953 exited with status 6
1 2 :END
p libc.so.1`_close+0×4
p`poo+0×8
p`main+0×4c
p`_start+0×5c
1

So, another happy ending for all, in that VOSJEC managed to get to the bottom of it using plain old truss rather than Dtrace. It’s worth remembering that although Dtrace is a fantastic diagnosis tool, don’t forget the other great tools available in Solaris in trying to observe and diagnose problems you’re investigating! :)

Posted by hippy, filed under Oracle. Date: March 28, 2009, 3:50 am | No Comments »

Oracle "WARNING: aiowait timed out" error messages in the alert log are always something that needs further investigation as it can indicate some kind of I/O or system resource issue that stopping valuable I/O’s down to storage. Now there could be a whole bunch of reasons why our asynchronous I/O as got stuck. Oracle reports "WARNING: aiowait timed out" messages after 10 mins of non-response since our I/O was queued by aioread(3) or aiowrite(3). It’s also worth being aware of the two method of by which Asynchronous I/O is handled by Solaris. This is either via kernel asynchronous I/O (KAIO) if the underlying filesystem supports it (raw or filesystems with added s/w which give raw access) or the standard userland library implementation. For further background reading then have a look at the Solaris™ Internals: Core Kernel Components book by Jim Mauro; Richard McDougall, chapter 11.4. Asynchronous I/O.

So, to help understand the where in our life cycle we’re stuck we’ll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement

Enabling the trigger environment

A full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I’m going to assume that we don’t really want to take down the entire box as it’ll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn’t yield results then try a full dump. So, here’s the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.

1/ You’ll need to setup a dedicated dump device to collect a live savecore. You’ll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.

2/ Download livecore_aio and guds script.

3/ Create collect_data.sh:

root@hippy-1 # more collect_data.sh
#!/bin/sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5  -w0 -s<case ref> -X2 -D /var/tmp &
/usr/bin/savecore -L
echo "aiowait coredump event – please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root

In my case I’m going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.

4/ Change perms on binary and script before copying to /var/tmp

chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp

note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly

Test run

Test the program and script (as root):

# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file

This should produce a live savecore dump and kick off guds.

Deploy

*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &

When issue happens, upload the live core dump and guds data to Sun for analysis.

Posted by hippy, filed under Oracle. Date: January 16, 2009, 8:20 am | No Comments »

Being an engineer it’s always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application………..and what do you get……… a problem. I suspect after that opening some of you are thinking "where’s he going with this?" …. the answer nowhere, I’m not picking holes in either the T5140 or Oracle…..good, I’m glad we got that one clear! :)

Anyhow, so a customer came to us complaining that their application wasn’t running as expected on this platform and really wanted to know if there was a hardware fault /  bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially,  the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:

bash-3.00# raidctl -l c1t1d0
Volume                  Size    Stripe  Status   Cache  RAID
       Sub                     Size                    Level
               Disk
—————————————————————-
c1t1d0                  409.9G  64K     OPTIMAL  OFF    RAID0
               0.1.0   136.6G          GOOD
               0.2.0   136.6G          GOOD
               0.3.0   136.6G          GOOD

They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you’ve a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()’ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.

SQL> set timing on
SQL> create tablespace andy3
 2  datafile ‘/u01/oracle/oradata/SUN/andy03.dbf’
 3  size 1g;

Tablespace created.

Elapsed: 00:00:18.12

and iostat -xnz 1 shows us

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0  128.0  104.0 56369.0  0.0  1.6    0.0   11.2   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  14.0   78.0  112.0 56281.3  0.0  1.2    0.1   13.5   0  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   93.0  112.0 53734.0  0.0  1.4    0.0   13.4   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   95.0  104.0 58397.6  0.0  1.4    0.0   12.7   1  92 c1t1d0

This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:

 Parallelism Time Throughput (Mb per second) db_file_multiblock_read_count
 0 719 13.8 16 (16 x 8k = 128k)
 2 195 50.9 16
 4 208 47.7 16
 0 420 23.6 32 (32 x 8k = 256k)
 2 163 60.9 32
 4 208 53.9 32

Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).

Using the same parallel test plan as the customer:

oracle@v4v-t5140a-gmp03~$more para.sql

set timing on;

select /*+ FULL(t) */ count(*) from contact_methods t;

select /*+ FULL(t) PARALLEL(t,2) */ count(*) from contact_methods t;

select /*+ FULL(t) PARALLEL(t,4) */ count(*) from contact_methods t;

select /*+ FULL(t) PARALLEL(t,64) */ count(*) from contact_methods t;

oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r—–   1 oracle   dba         9.7G Oct 24 08:25 test01.dbf

I got these:

SQL> @para

 COUNT(*)
———-
 15700000

Elapsed: 00:00:47.85

 COUNT(*)
———-
 15700000

Elapsed: 00:00:32.53

 COUNT(*)
———-
 15700000

Elapsed: 00:00:34.68

 COUNT(*)
———-
 15700000

Elapsed: 00:00:42.17

whilst the first full table scan is running I see the following in iostat -xnz 1:

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 108.0    0.0 93122.4    0.0  0.0  0.4    0.1    4.0   1  35 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 151.0    3.0 95796.9   48.0  0.0  0.5    0.1    3.0   1  34 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 115.0    0.0 103367.6    0.0  0.0  0.3    0.1    2.6   1  28 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 102232.7    0.0  0.0  0.3    0.1    3.0   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 122.0    3.0 105326.4   48.0  0.0  0.3    0.1    2.5   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 96467.2    0.0  0.0  0.5    0.1    4.1   1  34 c1t1d0

and then

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 193.0    0.0 159383.3    0.0  0.0  8.0    0.1   41.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 195.5    3.0 163681.0   48.1  0.0  8.1    0.1   40.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 220.1    0.0 188770.3    0.0  0.0  7.7    0.1   34.8   3 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 168156.9    0.0  0.0  7.2    0.1   37.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    3.0 162361.2   48.0  0.0  7.4    0.1   38.1   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 190.0    0.0 162776.0    0.0  0.0  7.3    0.1   38.7   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 162737.6    0.0  0.0  6.9    0.1   35.9   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 186.0    3.0 153754.2   48.0  0.0  8.4    0.1   44.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    1.0 160412.4    8.0  0.0  7.7    0.1   40.1   1 100 c1t1d0

when the parallel jobs are running.

This is because I changed the db_file_multiblock_read_count to 128 (128 * 8k = 1M) and in fact I saw improvements using 192k (64k*3) to match the s
tripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.

It’s amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don’t get the results you’re expecting not to assume that there is a fault with the hardware or software. For me personally, it’s always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what’s gone on before and end going down the wrong path. To avoid that,  make sure you chat with you’re colleagues when you feel like you’re not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.

Posted by hippy, filed under Oracle. Date: October 25, 2008, 1:02 pm | 1 Comment »

Following from my previous article about capturing TNF data on an Oracle event, I thought I’d also mention that you can do something similar with capturing snoop data.  This really came about because of trying to troubleshooting ORA-29740
issues with Oracle RAC. Most of the time you really want to have a look at these Sunsolve documents for giving hints:

  • Solution 230303 : SunCluster[TM] 3.x: Troubleshooting ORA-29740 in a RAC environment
  • Solution 205145 : Oracle error ORA-29740 (reason 3) resolved with network tuning

But, if you think that network is perhaps something worth chasing and the
actual physical switches & hubs etc don’t hint at a fault then we
could enable the lantracer script to snoop on an interconnect and terminate when Oracle receives an ORA-29740. To do this you’ll need to do the following:

1) in $ORACLE_HOME/bin, place a script called oradbg containing:

#!/bin/sh
echo "stopping Lantracer" &gt;&gt; /tmp/oradbg.$$
/usr/bin/rm /usr/tmp/lan/running
echo `/bin/date` &gt; /tmp/oradbg.$$
echo "done\." &gt;&gt; /tmp/oradbg.$$

2) in an sqlplus "/ as sysdba" session run the following:

alter system set "_oradbg_pathname"="&lt;$ORACLE_HOME&gt;/bin/oradbg" scope=spfile;
alter system set event="29740 debug" scope=spfile;

where $ORACLE_HOME is the expanded path of the $ORACLE_HOME env var

3) add execute to the oradbg script

4) restart the instance

5/ create /var/tmp/lan and copy lantracer script to it. chmod 755 lantracer

6/ start lantracer

cd /var/tmp/lan; ./lantracer &gt;/dev/null 2&gt;&amp;1 &amp;

7/ check oracle can stop lantracer by using it to detect a 942 event:

 sqlplus &lt;user&gt;/&lt;password&gt;
alter session set events '942 debug';
select * from &lt;a table that doesn't exist&gt;;

This should remove /var/tmp/lan/running and leave us with some snoop
scripts which should hopefully have captured data during the event.

Posted by hippy, filed under Oracle. Date: October 1, 2008, 7:29 am | No Comments »

« Previous Entries