Capturing TNF Data On An Oracle Event

Recently, I got involved in an interesting VOSJEC problem in which Oracle was core dumping at intermittent times with ORA-7445 on a large Solaris 8 system with no known root cause. Unfortunately the core dump from Oracle or the trace files helped either Oracle or us to working out the issue so a colleague of mine Tim Uglow managed to create a TNF package to collect TNF data constantly and then stop/dump the TNF kernel buffer out to disk at the time that the Oracle event happened. That way we’d be able to see what led up to the event.

To allow oracle processes running as user "oracle" to stop the tnf kernel tracing we need to change the permissions on the tnf driver. Currently you have ..

# ls -alL /dev/tnf*
crw-------   1 root     sys      114,  0 Apr 10 22:02 /dev/tnfctl
crw-------   1 root     sys      114,  1 Apr 10 22:02 /dev/tnfmap

and we need to change it to …

# ls -alL /dev/tnf*
crw-rw-rw-   1 root     sys      114,  0 Jun 27 15:57 /dev/tnfctl
crw-rw-rw-   1 root     sys      114,  1 Jun 27 15:57 /dev/tnfmap

This is done by the following two commands…

# rem_drv tnf
# add_drv -m '* 0666 root sys' tnf

the install script for JAVAtnfd.datastream.pkg package will do this for you and the uninstall script will put it back to its original state. This change would allow ordinary users to affect the kernel tracing but most users don’t know about the subsystem and the only information they can gather is kernel events, there is no actual data visible.

Installing the scripts

1) copy JAVAtnfd.datastream.pkg to /tmp on your machine
2) pkgadd -d  /tmp/JAVAtnfd.datastream.pkg

You may get warning messages like ..

"Device busy
Cannot unload module: tnf
Will be unloaded upon reboot."

These can be ignored.. devfsadm will have changed the permissions.

To start the tnf tracing..

as root please run …

# cd /opt/JAVAtnfd/data 
# /usr/bin/nohup /opt/JAVAtnfd/bin/tnfscript &

then check in /opt/JAVAtnfd/data/nohup.out for errors, every time you start it it will append to nohup.out.

If you needed to stop the tracing ..

looking in /opt/JAVAtnfd/data/nohup.out for the last line like "tnfscript pid XXXX starting" and then run …

# kill XXXX

should stop it.

When the oracle debug executes it will touch /opt/JAVAtnfd/data/stop and stop the tnf tracing and that will cause the tnfscript to exit at the end of its 10 minute loop – look at the end of /opt/JAVAtnfd/data/nohup.out for "tnfscript pid XXXX exiting as requested by Oracle". Before it exits it will extract the required files into /opt/JAVAtnfd/data/*.’date’ and email root the location of the four files that need to be sent to support together with the Oracle alert log and the trace file that will be mentioned in the alert.log file, something like….

Errors in file <path to trace file>
ORA-7445...
<path to trace file> is also required.

Instructions from Oracle to get Oracle processes to stop the tnf on a ORA-7445 error.

1) in init.ora define the following 2 parameters:

  event="7445 debug"
  _oradbg_pathname=<$ORACLE_HOME>/bin/oradbg

or you can set via and sqlplus session:

SQL> alter system set "_oradbg_pathname"="/u01/oracle/product/10.2.0/db/bin/oradbg" scope=both;
System altered.
SQL> alter system set event="7445 debug" scope=spfile;
System altered.

where $ORACLE_HOME is the expanded path of the $ORACLE_HOME environment variable

2) in $ORACLE_HOME/bin, place a script called oradbg containing:

  #!/bin/sh
  echo "stopping TNF collection using prex" >> /tmp/oradbg.$$
  echo "ktrace off" | /usr/bin/prex -k
  echo $$ > /opt/JAVAtnfd/data/stop
  echo `/bin/date` > /tmp/oradbg.$$
  echo "done." >> /tmp/oradbg.$$

3) add execute to the oradbg script

4) restart the instance

Upon a 7445 error being encountered (whereby oracle’s signal handler spots a sigsegv/core dump) we will disable the prex via the built-in oracle ‘debug’ event and a file will be written to /tmp containing the date and time and a note that we stopped prex. It willalso touch the file /opt/JAVAtnfd/data/stop which will signal the Sun tnf script to stop. Obviously if the customer has any path differences for any of the bins they should modify accordingly. This can easily be tested after restarting the instance by running:

sqlplus <user>/<password>
 alter session set events '942 debug';
 select * from <a table that doesn't exist>;

This will generate an error along the lines: <

ERROR at line 1:
ORA-00942: table or view does not exist

and the debug event will fire on the temporary session-level dummy version of the event and will run the prex command. They can then check /tmp for a new file with the details called oradbg. Obviously you can trace any other Oracle event by just changing the events debug in sqlplus which might make problem resolution potentially a little easier on Solaris prior to Solaris 10 and the world of Dtrace! 🙂