Capturing data on Oracle "WARNING: aiowait timed out" events

Oracle "WARNING: aiowait timed out" error messages in the alert log are always something that needs further investigation as it can indicate some kind of I/O or system resource issue that stopping valuable I/O’s down to storage. Now there could be a whole bunch of reasons why our asynchronous I/O as got stuck. Oracle reports "WARNING: aiowait timed out" messages after 10 mins of non-response since our I/O was queued by aioread(3) or aiowrite(3). It’s also worth being aware of the two method of by which Asynchronous I/O is handled by Solaris. This is either via kernel asynchronous I/O (KAIO) if the underlying filesystem supports it (raw or filesystems with added s/w which give raw access) or the standard userland library implementation. For further background reading then have a look at the Solaris™ Internals: Core Kernel Components book by Jim Mauro; Richard McDougall, chapter 11.4. Asynchronous I/O.

So, to help understand the where in our life cycle we’re stuck we’ll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement

Enabling the trigger environment

A full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I’m going to assume that we don’t really want to take down the entire box as it’ll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn’t yield results then try a full dump. So, here’s the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.

1/ You’ll need to setup a dedicated dump device to collect a live savecore. You’ll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.

2/ Download livecore_aio and guds script.

3/ Create collect_data.sh:

root@hippy-1 # more collect_data.sh
#!/bin/sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5  -w0 -s<case ref> -X2 -D /var/tmp &
/usr/bin/savecore -L
echo "aiowait coredump event – please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root

In my case I’m going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.

4/ Change perms on binary and script before copying to /var/tmp

chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp

note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly

Test run

Test the program and script (as root):

# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file

This should produce a live savecore dump and kick off guds.

Deploy

*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &

When issue happens, upload the live core dump and guds data to Sun for analysis.