Diagnosing a VxFS filesystem performance issue

I recently got involved in an interesting performance problem in which the customer was experiencing differences in copying a 500mb file to 2 different VxFS filesystems on the same box by up to a factor of 4 in some cases ie "timex cp <500mb file> /vxfs_filesystem". The only thing which had changed (I’m actually smiling at the word *only*) between 2 dates were:

* Various OS Patches
* Symantec Veritas Volume Manager Patch
* Upgraded LPFC Driver Servers
* Upgraded firmware
* Upgraded Power Path Servers
* Upgraded SYMCLI

which doesn’t help immediately in thinking about probable or possible causes due to far too many things changing! So how does one actually track down what’s happening in the lifecycle of the "cp" on a Solaris 9 box…… TNF to the rescue again. Using the TNF wrapper script from one of my previous post, we could enable and disable TNF whilst the cp was happening. So once we’d got that data from the customer and extracted / processed the TNF buffers I first looked at the TNF truss outputs:

[andharr@exdev:~/tnf/tnf/notworking]$ more tnf.truss
looking for pid  12045
Now(ms)          duration(ms)   pid     thread             call           return       errno
48335.67488000   15.83081400    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48351.53089800   254.21403800   12045    0x30101ab09a0    write          0x800000       0
48605.75773800   16.39489300    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48622.17487400   244.79986500   12045    0x30101ab09a0    write          0x800000       0
48866.98866100   16.51242700    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48883.52309100   263.08493800   12045    0x30101ab09a0    write          0x800000       0
49146.62003100   16.96265000    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49163.60188400   262.34315500   12045    0x30101ab09a0    write          0x800000       0
49425.95560000   13.35143400    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49439.32559600   256.47619000   12045    0x30101ab09a0    write          0x800000       0
49695.81298800   13.16556800    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49708.99975900   246.71828900   12045    0x30101ab09a0    write          0x800000       0
49955.73060900   15.93346900    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49971.68440100   254.45911200   12045    0x30101ab09a0    write          0x800000       0
50226.15543500   16.82535100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50242.98686700   262.94091700   12045    0x30101ab09a0    write          0x800000       0
50505.93946600   15.95571100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50521.91406000   254.57880900   12045    0x30101ab09a0    write          0x800000       0
50776.50887100   16.54139300    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50793.06922600   263.14118400   12045    0x30101ab09a0    write          0x800000       0
51056.22369200   13.62531200    12045    0x30101ab09a0    mmap64         0x7fffffff     0
51069.86692700   265.74658200   12045    0x30101ab09a0    write          0x800000       0
51335.62423100   14.17114700    12045    0x30101ab09a0    mmap64         0x7fffffff     0
51349.81458000   266.32114200   12045    0x30101ab09a0    write          0x800000       0
51616.14980400   14.28020100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
<snip>

[andharr@exdev:~/tnf/tnf/working]$ more tnf.truss
looking for pid  16545
<snip>
Now(ms)          duration(ms)   pid     thread             call           return       errno
74.48367000   0.09929400     16545    0x30129365ce0    mmap64         0x7fffffff     0
74.60224600   145.95268900   16545    0x30129365ce0    write          0x800000       0
220.56909700   15.92858800    16545    0x30129365ce0    mmap64         0x7fffffff     0
236.51624700   145.57631700   16545    0x30129365ce0    write    
    &nbs
p; 0x800000       0
382.11048600   17.13331300    16545    0x30129365ce0    mmap64         0x7fffffff     0
399.25084000   148.40430500   16545    0x30129365ce0    write          0x800000       0
547.67306700   15.89682400    16545    0x30129365ce0    mmap64         0x7fffffff     0
563.59125400   143.89640600   16545    0x30129365ce0    write          0x800000       0
707.50126200   16.40793300    16545    0x30129365ce0    mmap64         0x7fffffff     0
723.91583600   144.97527400   16545    0x30129365ce0    write          0x800000       0
868.90455200   15.84185600    16545    0x30129365ce0    mmap64         0x7fffffff     0
884.76377000   145.10353100   16545    0x30129365ce0    write          0x800000       0
1029.88114300   15.60142300    16545    0x30129365ce0    mmap64         0x7fffffff     0
1045.50632900   145.67129000   16545    0x30129365ce0    write          0x800000       0
1191.19362100   16.41825600    16545    0x30129365ce0    mmap64         0x7fffffff     0
1207.63660000   144.74260200   16545    0x30129365ce0    write          0x800000       0
1352.39696400   17.08674700    16545    0x30129365ce0    mmap64         0x7fffffff     0
1369.49027200   143.29896300   16545    0x30129365ce0    write          0x800000       0
1512.80107700   14.10729700    16545    0x30129365ce0    mmap64         0x7fffffff     0
1526.91309500   144.05898900   16545    0x30129365ce0    write          0x800000       0
1670.98672600   16.63788500    16545    0x30129365ce0    mmap64         0x7fffffff     0
1687.63237200   141.60897100   16545    0x30129365ce0    write          0x800000       0
1829.25798500   15.96867300    16545    0x30129365ce0    mmap64         0x7fffffff     0
1845.23249900   144.46080400   16545    0x30129365ce0    write          0x800000       0
1989.70738500   15.94683000    16545    0x30129365ce0    mmap64         0x7fffffff     0
<snip>

Now, that’s interesting, in that we see a difference in the time taken for write(), so what’s happening in the write() lifecycle?

<not working>
    48545.230384         0.006560 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690624  size: 0x10000  buf: 0x3004185e058  flags: 558353 0,0
    48545.277271         0.000080 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39692928  size: 0x10000  buf: 0x312eec85688  flags: 1289 0,0
    48545.416170         0.000800 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690752  size: 0x10000  buf: 0x327286721b0  flags: 558353 0,0
    48545.454655         0.002640 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693056  size: 0x10000  buf: 0x506b36947c8  flags: 1289 0,0
    48545.544348         0.008562 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690880  size: 0x10000  buf: 0x3004185f458  flags: 558353 0,0
    48545.580753         0.012962 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693184  size: 0x10000  buf: 0x312ead7be08  flags: 1289 0,0
    48545.642601         0.004721 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39691008  size: 0x10000  buf: 0x30070b4af30  flags: 558353 0,0
    48545.676366         0.003361 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693312  size: 0x10000  buf: 0x3006ca9bb20  flags: 1289 0,0
    48545.735814         0.003681 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39691136  size: 0x10000  buf: 0x327cd4835c8  flags: 558353 0,0
<snip>

So our I/O appears to be in 65k chunks but how long did 1 I/O actually take?
< br />48545.230384  –  12045 1 0x30101ab09a0 224 strategy device: 273,4015  block: 39690624 size: 0x10000 buf: 0x3004185e058 flags: 0x88511
48549.243496  –  0 0 0x2a100477d20 64 biodone device: 273,4015  block: 39690624 buf: 0x3004185e058

4.013112ms

<working>
      236.516247         0.000720 16545     1 0x30129365ce0  65 syscall_start              sysnum: 4 write
      236.972870         0.004481 16545     1 0x30129365ce0  65 address_fault              address: 0xfe800000  fault_type: 2  access: 1
      238.215601         0.068810 16545     1 0x30129365ce0  65 strategy                   device: 0x11100000fa9  block: 24838144  size: 0x100000  buf: 0x300359fa3a8  flags: 289 0,0
      238.247765         0.002240 16545     1 0x30129365ce0  65 strategy                   device: 0x12400000022  block: 24840448  size: 0x100000  buf: 0x30038cd24a0  flags: 297 0,0
      238.461314         0.002240 16545     1 0x30129365ce0  65 thread_block               reason: 0x300359fa468  stack:  4 0x114190801048bb0 0x10f28ec01080b30 0x78a4d834789aa770 0x78
a7dd6078a798b0
<snip>

238.215601  –  16545 1 0x30129365ce0 65 strategy device: 273,4009  block: 24838144 size: 0x100000 buf: 0x300359fa3a8 flags: 0x121
254.402304  –  0 0 0x2a100471d20 64 biodone device: 273,4009  block: 24838144 buf: 0x300359fa3a8

16.186703ms

So time for 8m in 1m chunks:

16.18 * 8 = 129.493624

and 8m in 65k chunks:

4.013112 * 128 = 513.678336 (it’s going to be less anyway due to fact that we can queue iops and be completed in parallel)

So it would seem that the difference in time can be explained by the way that the I/O is broken up on each filesystem. Perhaps the way in which the filesystem has been configured can explain the difference? Unforuntately, I didn’t see differences between the filesystems themselves but some of the configuration details did match some of our physical write sizes:

Filesystem i/o parameters for /filesystem
read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 1
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 8130396160
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0

Now our write pattern ties in for our non-working cp in that our prefered write size is 65536 (as above) but for the working one we’re using 1048576 which could be due to discovered direct I/O. Now write() is trying to complete a size of 8388608 (8mb) which we can see from the truss:

29986:   0.0146 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x07000000) = 0xFE800000
29986:   0.2394 write(4, "".., 8388608) = 8388608
29986:   0.0107 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x07800000) = 0xFE800000
29986:   0.2599 write(4, "".., 8388608) = 8388608

so why is our I/O being broken into smaller chunks for one filesystem and not on the other? Ah, what if one of the filesystems was fragmented more than the other? Yes, that could explain the difference if VxFS internally make the choice of using one other the other depending on the state of the filesystem. It’s interesting reading the VxFS administration guide as it clearly documents the possible performance implications due to fragmentation (yes, RTFM is applicable here):

Fragmentation reduces performance and availability. Regular use of fsadm’s fragmentation reporting and reorganization facilities is therefore advisable. The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs using the cron command. Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm or the df  o s commands. There are three factors which can be used to determine the degree of fragmentation:

    * Percentage of free space in extents of less than eight blocks in length
    * Percentage of free space in extents of less than 64 blocks in length
    * Percentage of free space in extents of length 64 blocks or greater

An unfragmented file system will have the following characteristics:

    * Less than 1 percent of free space in extents of less than eight blocks in length
    * Less than 5 percent of free space in extents of less than 64 blocks in length
    * More than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

A badly fragmented file system will have one or more of the following characteristics:

    * Greater than 5 percent of free space in extents of less than 8 blocks in length
    * More than 50 percent of free space in extents of less than 64 blocks in length
    * Less than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

The optimal period for scheduling of extent reorganization runs can be determined by choosing a reasonable interval, scheduling fsadm runs at the initial interval, and running the extent fragmentation report feature of fsadm before and after the reorganization. The “beforeâ€? result is the degree of fragmentation prior to the reorganization. If the degree of fragmentation is approaching the figures for bad fragmentation, reduce the interval between fsadm runs. If the degree of fragmentation is low, increase the interval between fsadm runs. The “afterâ€? result is an indication of how well the reorganizer has performed. The degree of fragmentation should be close to the characteristics of an unfragmented file system. If not, it may be a good idea to resize the file system; full file systems tend to fragment and are difficult to defragment. It is also possible that the reorganization is not being performed at a time during which the file system in question is relatively idle.

Conclusion

As in all performance related issues, you really need to understand the lifecycle of what you’re interested it so you can ideally observe what’s happening or not happening! In this case I used mainly truss and TNF to drill down to see whats happening. However what should happen way before that is to actually understand what the impact is otherwise who actually cares if there is a difference if it’s not causing a problem. In this case it was the DBA’s who noticed the difference but in all honesty it wasn’t actually causing a business impact! 🙂 Typical! Still, it’s been another learning curve in using TNF in the odd case where Dtrace (and in this case scsi.d) might have also helped if the system was running Solaris 10.

If you are using VxFS then you might want to look at this script to automatically check if a filesystem is fragmented or not. If it is then it’ll automatically kick off a defrag…….. aren’t automatic scripts great! 🙂