I recently got involved in an interesting performance problem in which the customer was experiencing differences in copying a 500mb file to 2 different VxFS filesystems on the same box by up to a factor of 4 in some cases ie "timex cp <500mb file> /vxfs_filesystem". The only thing which had changed (I’m actually smiling at the word *only*) between 2 dates were:
* Various OS Patches
* Symantec Veritas Volume Manager Patch
* Upgraded LPFC Driver Servers
* Upgraded firmware
* Upgraded Power Path Servers
* Upgraded SYMCLI
which doesn’t help immediately in thinking about probable or possible causes due to far too many things changing! So how does one actually track down what’s happening in the lifecycle of the "cp" on a Solaris 9 box…… TNF to the rescue again. Using the TNF wrapper script from one of my previous post, we could enable and disable TNF whilst the cp was happening. So once we’d got that data from the customer and extracted / processed the TNF buffers I first looked at the TNF truss outputs:
[andharr@exdev:~/tnf/tnf/notworking]$ more tnf.truss
looking for pid 12045
Now(ms) duration(ms) pid thread call return errno
48335.67488000 15.83081400 12045 0×30101ab09a0 mmap64 0×7fffffff 0
48351.53089800 254.21403800 12045 0×30101ab09a0 write 0×800000 0
48605.75773800 16.39489300 12045 0×30101ab09a0 mmap64 0×7fffffff 0
48622.17487400 244.79986500 12045 0×30101ab09a0 write 0×800000 0
48866.98866100 16.51242700 12045 0×30101ab09a0 mmap64 0×7fffffff 0
48883.52309100 263.08493800 12045 0×30101ab09a0 write 0×800000 0
49146.62003100 16.96265000 12045 0×30101ab09a0 mmap64 0×7fffffff 0
49163.60188400 262.34315500 12045 0×30101ab09a0 write 0×800000 0
49425.95560000 13.35143400 12045 0×30101ab09a0 mmap64 0×7fffffff 0
49439.32559600 256.47619000 12045 0×30101ab09a0 write 0×800000 0
49695.81298800 13.16556800 12045 0×30101ab09a0 mmap64 0×7fffffff 0
49708.99975900 246.71828900 12045 0×30101ab09a0 write 0×800000 0
49955.73060900 15.93346900 12045 0×30101ab09a0 mmap64 0×7fffffff 0
49971.68440100 254.45911200 12045 0×30101ab09a0 write 0×800000 0
50226.15543500 16.82535100 12045 0×30101ab09a0 mmap64 0×7fffffff 0
50242.98686700 262.94091700 12045 0×30101ab09a0 write 0×800000 0
50505.93946600 15.95571100 12045 0×30101ab09a0 mmap64 0×7fffffff 0
50521.91406000 254.57880900 12045 0×30101ab09a0 write 0×800000 0
50776.50887100 16.54139300 12045 0×30101ab09a0 mmap64 0×7fffffff 0
50793.06922600 263.14118400 12045 0×30101ab09a0 write 0×800000 0
51056.22369200 13.62531200 12045 0×30101ab09a0 mmap64 0×7fffffff 0
51069.86692700 265.74658200 12045 0×30101ab09a0 write 0×800000 0
51335.62423100 14.17114700 12045 0×30101ab09a0 mmap64 0×7fffffff 0
51349.81458000 266.32114200 12045 0×30101ab09a0 write 0×800000 0
51616.14980400 14.28020100 12045 0×30101ab09a0 mmap64 0×7fffffff 0
<snip>
[andharr@exdev:~/tnf/tnf/working]$ more tnf.truss
looking for pid 16545
<snip>
Now(ms) duration(ms) pid thread call return errno
74.48367000 0.09929400 16545 0×30129365ce0 mmap64 0×7fffffff 0
74.60224600 145.95268900 16545 0×30129365ce0 write 0×800000 0
220.56909700 15.92858800 16545 0×30129365ce0 mmap64 0×7fffffff 0
236.51624700 145.57631700 16545 0×30129365ce0 write &nbs
p; 0×800000 0
382.11048600 17.13331300 16545 0×30129365ce0 mmap64 0×7fffffff 0
399.25084000 148.40430500 16545 0×30129365ce0 write 0×800000 0
547.67306700 15.89682400 16545 0×30129365ce0 mmap64 0×7fffffff 0
563.59125400 143.89640600 16545 0×30129365ce0 write 0×800000 0
707.50126200 16.40793300 16545 0×30129365ce0 mmap64 0×7fffffff 0
723.91583600 144.97527400 16545 0×30129365ce0 write 0×800000 0
868.90455200 15.84185600 16545 0×30129365ce0 mmap64 0×7fffffff 0
884.76377000 145.10353100 16545 0×30129365ce0 write 0×800000 0
1029.88114300 15.60142300 16545 0×30129365ce0 mmap64 0×7fffffff 0
1045.50632900 145.67129000 16545 0×30129365ce0 write 0×800000 0
1191.19362100 16.41825600 16545 0×30129365ce0 mmap64 0×7fffffff 0
1207.63660000 144.74260200 16545 0×30129365ce0 write 0×800000 0
1352.39696400 17.08674700 16545 0×30129365ce0 mmap64 0×7fffffff 0
1369.49027200 143.29896300 16545 0×30129365ce0 write 0×800000 0
1512.80107700 14.10729700 16545 0×30129365ce0 mmap64 0×7fffffff 0
1526.91309500 144.05898900 16545 0×30129365ce0 write 0×800000 0
1670.98672600 16.63788500 16545 0×30129365ce0 mmap64 0×7fffffff 0
1687.63237200 141.60897100 16545 0×30129365ce0 write 0×800000 0
1829.25798500 15.96867300 16545 0×30129365ce0 mmap64 0×7fffffff 0
1845.23249900 144.46080400 16545 0×30129365ce0 write 0×800000 0
1989.70738500 15.94683000 16545 0×30129365ce0 mmap64 0×7fffffff 0
<snip>
Now, that’s interesting, in that we see a difference in the time taken for write(), so what’s happening in the write() lifecycle?
<not working>
48545.230384 0.006560 12045 1 0×30101ab09a0 224 strategy device: 0×11100000faf block: 39690624 size: 0×10000 buf: 0×3004185e058 flags: 558353 0,0
48545.277271 0.000080 12045 1 0×30101ab09a0 224 strategy device: 0×1240000007a block: 39692928 size: 0×10000 buf: 0×312eec85688 flags: 1289 0,0
48545.416170 0.000800 12045 1 0×30101ab09a0 224 strategy device: 0×11100000faf block: 39690752 size: 0×10000 buf: 0×327286721b0 flags: 558353 0,0
48545.454655 0.002640 12045 1 0×30101ab09a0 224 strategy device: 0×1240000007a block: 39693056 size: 0×10000 buf: 0×506b36947c8 flags: 1289 0,0
48545.544348 0.008562 12045 1 0×30101ab09a0 224 strategy device: 0×11100000faf block: 39690880 size: 0×10000 buf: 0×3004185f458 flags: 558353 0,0
48545.580753 0.012962 12045 1 0×30101ab09a0 224 strategy device: 0×1240000007a block: 39693184 size: 0×10000 buf: 0×312ead7be08 flags: 1289 0,0
48545.642601 0.004721 12045 1 0×30101ab09a0 224 strategy device: 0×11100000faf block: 39691008 size: 0×10000 buf: 0×30070b4af30 flags: 558353 0,0
48545.676366 0.003361 12045 1 0×30101ab09a0 224 strategy device: 0×1240000007a block: 39693312 size: 0×10000 buf: 0×3006ca9bb20 flags: 1289 0,0
48545.735814 0.003681 12045 1 0×30101ab09a0 224 strategy device: 0×11100000faf block: 39691136 size: 0×10000 buf: 0×327cd4835c8 flags: 558353 0,0
<snip>
So our I/O appears to be in 65k chunks but how long did 1 I/O actually take?
<
br />48545.230384 - 12045 1 0×30101ab09a0 224 strategy device: 273,4015 block: 39690624 size: 0×10000 buf: 0×3004185e058 flags: 0×88511
48549.243496 - 0 0 0×2a100477d20 64 biodone device: 273,4015 block: 39690624 buf: 0×3004185e058
4.013112ms
<working>
236.516247 0.000720 16545 1 0×30129365ce0 65 syscall_start sysnum: 4 write
236.972870 0.004481 16545 1 0×30129365ce0 65 address_fault address: 0xfe800000 fault_type: 2 access: 1
238.215601 0.068810 16545 1 0×30129365ce0 65 strategy device: 0×11100000fa9 block: 24838144 size: 0×100000 buf: 0×300359fa3a8 flags: 289 0,0
238.247765 0.002240 16545 1 0×30129365ce0 65 strategy device: 0×12400000022 block: 24840448 size: 0×100000 buf: 0×30038cd24a0 flags: 297 0,0
238.461314 0.002240 16545 1 0×30129365ce0 65 thread_block reason: 0×300359fa468 stack: 4 0×114190801048bb0 0×10f28ec01080b30 0×78a4d834789aa770 0×78
a7dd6078a798b0
<snip>
238.215601 - 16545 1 0×30129365ce0 65 strategy device: 273,4009 block: 24838144 size: 0×100000 buf: 0×300359fa3a8 flags: 0×121
254.402304 - 0 0 0×2a100471d20 64 biodone device: 273,4009 block: 24838144 buf: 0×300359fa3a8
16.186703ms
So time for 8m in 1m chunks:
16.18 * 8 = 129.493624
and 8m in 65k chunks:
4.013112 * 128 = 513.678336 (it’s going to be less anyway due to fact that we can queue iops and be completed in parallel)
So it would seem that the difference in time can be explained by the way that the I/O is broken up on each filesystem. Perhaps the way in which the filesystem has been configured can explain the difference? Unforuntately, I didn’t see differences between the filesystems themselves but some of the configuration details did match some of our physical write sizes:
Filesystem i/o parameters for /filesystem
read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 1
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 8130396160
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0
Now our write pattern ties in for our non-working cp in that our prefered write size is 65536 (as above) but for the working one we’re using 1048576 which could be due to discovered direct I/O. Now write() is trying to complete a size of 8388608 (8mb) which we can see from the truss:
29986: 0.0146 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0×07000000) = 0xFE800000
29986: 0.2394 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0".., 8388608) = 8388608
29986: 0.0107 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0×07800000) = 0xFE800000
29986: 0.2599 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0".., 8388608) = 8388608
so why is our I/O being broken into smaller chunks for one filesystem and not on the other? Ah, what if one of the filesystems was fragmented more than the other? Yes, that could explain the difference if VxFS internally make the choice of using one other the other depending on the state of the filesystem. It’s interesting reading the VxFS administration guide as it clearly documents the possible performance implications due to fragmentation (yes, RTFM is applicable here):
Fragmentation reduces performance and availability. Regular use of fsadm’s fragmentation reporting and reorganization facilities is therefore advisable. The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs using the cron command. Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm or the df o s commands. There are three factors which can be used to determine the degree of fragmentation:
* Percentage of free space in extents of less than eight blocks in length
* Percentage of free space in extents of less than 64 blocks in length
* Percentage of free space in extents of length 64 blocks or greater
An unfragmented file system will have the following characteristics:
* Less than 1 percent of free space in extents of less than eight blocks in length
* Less than 5 percent of free space in extents of less than 64 blocks in length
* More than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks
A badly fragmented file system will have one or more of the following characteristics:
* Greater than 5 percent of free space in extents of less than 8 blocks in length
* More than 50 percent of free space in extents of less than 64 blocks in length
* Less than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks
The optimal period for scheduling of extent reorganization runs can be determined by choosing a reasonable interval, scheduling fsadm runs at the initial interval, and running the extent fragmentation report feature of fsadm before and after the reorganization. The “before” result is the degree of fragmentation prior to the reorganization. If the degree of fragmentation is approaching the figures for bad fragmentation, reduce the interval between fsadm runs. If the degree of fragmentation is low, increase the interval between fsadm runs. The “after” result is an indication of how well the reorganizer has performed. The degree of fragmentation should be close to the characteristics of an unfragmented file system. If not, it may be a good idea to resize the file system; full file systems tend to fragment and are difficult to defragment. It is also possible that the reorganization is not being performed at a time during which the file system in question is relatively idle.
—
Conclusion
As in all performance related issues, you really need to understand the lifecycle of what you’re interested it so you can ideally observe what’s happening or not happening! In this case I used mainly truss and TNF to drill down to see whats happening. However what should happen way before that is to actually understand what the impact is otherwise who actually cares if there is a difference if it’s not causing a problem. In this case it was the DBA’s who noticed the difference but in all honesty it wasn’t actually causing a business impact!
Typical! Still, it’s been another learning curve in using TNF in the odd case where Dtrace (and in this case scsi.d) might have also helped if the system was running Solaris 10.
If you are using VxFS then you might want to look at this script to automatically check if a filesystem is fragmented or not. If it is then it’ll automatically kick off a defrag…….. aren’t automatic scripts great!