The biggest piece of advice I can give you about those of you about to upgrade with lots of custom tunables in /etc/system……..read the manual (FTFM if you’re feeling particularly vocal), no seriously, I mean it! :) You only have to read the Solaris tunables reference manual as it actually discusses upgrading to newer releases with older /etc/system tunables:

"We recommend that you start with an empty /etc/system file when moving to a new Solaris
release. As a first step, add only those tunables that are required by in-house or third-party
applications. Any tunables that involve System V IPC (semaphores, shared memory, and
message queues) have been modified in the Solaris 10 release and should be changed in your
environment. For more information, see “System V IPC Configuration” on page 21. After
baseline testing has been established, evaluate system performance to determine if additional
tunable settings are required."

So, that’s a move it out of the way and start from scratch. :) Obviously speak to your application vendors about anything that is required to run the application but other than that, see how things go and only change when and where necessary otherwise you could run into other problems.

The only application which I’ll make specific points about is Oracle as with Solaris 10 we’ve introduced resource controls so the shared memory / semaphore settings no longer need to be defined in /etc/system. See the Oracle installation guide  or  Solution  208623 :   Solaris[TM] 10 Operating System: System V Inter-Process Communication (IPC) resource controls for further details.

Posted by hippy, filed under Performance. Date: January 19, 2009, 10:05 am | No Comments »

Oracle "WARNING: aiowait timed out" error messages in the alert log are always something that needs further investigation as it can indicate some kind of I/O or system resource issue that stopping valuable I/O’s down to storage. Now there could be a whole bunch of reasons why our asynchronous I/O as got stuck. Oracle reports "WARNING: aiowait timed out" messages after 10 mins of non-response since our I/O was queued by aioread(3) or aiowrite(3). It’s also worth being aware of the two method of by which Asynchronous I/O is handled by Solaris. This is either via kernel asynchronous I/O (KAIO) if the underlying filesystem supports it (raw or filesystems with added s/w which give raw access) or the standard userland library implementation. For further background reading then have a look at the Solaris™ Internals: Core Kernel Components book by Jim Mauro; Richard McDougall, chapter 11.4. Asynchronous I/O.

So, to help understand the where in our life cycle we’re stuck we’ll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement

Enabling the trigger environment

A full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I’m going to assume that we don’t really want to take down the entire box as it’ll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn’t yield results then try a full dump. So, here’s the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.

1/ You’ll need to setup a dedicated dump device to collect a live savecore. You’ll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.

2/ Download livecore_aio and guds script.

3/ Create collect_data.sh:

root@hippy-1 # more collect_data.sh
#!/bin/sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5  -w0 -s<case ref> -X2 -D /var/tmp &
/usr/bin/savecore -L
echo "aiowait coredump event – please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root

In my case I’m going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.

4/ Change perms on binary and script before copying to /var/tmp

chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp

note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly

Test run

Test the program and script (as root):

# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file

This should produce a live savecore dump and kick off guds.

Deploy

*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &

When issue happens, upload the live core dump and guds data to Sun for analysis.

Posted by hippy, filed under Oracle. Date: January 16, 2009, 8:20 am | No Comments »