Observability and Diagnosis Techniques are the way forward

I’ve just closed down another escalation involving a performance
comparison of an application (Finacle in this case) running on a 6 board
SunFire 6900 and a 6 board SunFire E25K. Obviously we’re not comparing
apples and pears here but never the less the customer found that they
noticed huge performance differences during core business hours with
the *same* (and I use that term loosely as nothing entirely can occupy
the same physical space in time or made up of identical
particles….you know what I mean…although perhaps some clever physicists can argue against that?!). Anyhow, the customer made the
argument that they were using the same application binaries on both
platforms and were delivering the same  user load (and they still
couldn’t confirm that!) to both but were seeing a huge performance
degradation (slow interactive response from the users perspective and
high load) on the E25K domain which didn’t actually surprise me as the
25K is a larger system and will have larger memory latencies due to the
larger interconnect differences to the E6900. So, what we observed was
multiple application processes consuming userland cpu resources and no
userland lock contention via the prstat micro state accounting output.
If we’d suspected a scalability issue with the application then we’d
more than likely observe userland lock contention. One of the frequent
flyers we see in the performance team are caused by applications using
malloc in multi threaded SMP environments, see this interesting read
for further details.
Unfortunately that wasn’t to be the case so we needed to take another
direction in the observability game to prove to the customer the
fundamental differences between the two customer platforms. The answer
here was to use cpustat and the ability to show the difference between
the two platforms using cycles per instruction calculations.

The following chapter from Solaris™ Performance and Tools: DTrace and
MDB Techniques for Solaris 10 and OpenSolaris explains the methodology (or read from Safari Books if you’ve an account or buy it from Amazon as the entire book is well worth a read).


8.2.7. Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following
example demonstrates the use of the cycle count and instruction count
on an Ultra-SPARC IIi to calculate the average number of cycles per
instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 |
awk ‘{ printf "%s %.2f cpin",$0,$4/$5; }’
10.034 0 tick 3554903403 3279712368 1.08 cpi
10.034 1 total 3554903403 3279712368 1.08 cpi

This single 10-second sample averaged 1.08 cycles per instruction.
During this test, the CPU was busy running an infinite loop program.
Since the same simple instructions are run over and over, the
instructions and data are found in the Level-1 cache, resulting in fast
instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 |
awk ‘{ printf "%s %.2f cpin",$0,$4/$5; }’
10.036 0 tick 205607856 34023849 6.04 cpi
10.036 1 total 205607856 34023849 6.04 cpi

Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

So, looking at the customer’s data:

[andharr@node25k]$ grep total cpu*
cpu2.out: 30.429 48 total 1453222115333 358247394908 4.06 cpi
cpu22.out: 30.523 48 total 1463632215285 347056691816 4.22 cpi
cpu222.out: 30.367 48 total 1395799585592 423393952271 3.30 cpi

[andharr@node6900]$ grep total cpu*
cpu1.out: 31.038 48 total 1209418147610 522125013039 2.32 cpi
cpu11.out: 30.311 48 total 1194302525311 573624473405 2.08 cpi
cpu111.out: 30.408 48 total 1105516225829 552190193006 2.00 cpi

So the 25k is showing a higher cycles per instruction average than the
6900, so it does show a difference in performance between the two
systems. This is more than likely due to memory latency difference on
the 25k. If we actually look at the raw data for the busy times sorted
by size for the sample period then you can see some very big
differences in the largest cpi value between the systems:

[andharr@node6900]$ cat cpu1.out | awk ‘{print $6}’ |sort -n | tail
6.30
6.37
7.14
7.16
7.26
7.35
8.17
8.27
8.36
8.91
[andharr@node6900]$ cat cpu11.out | awk ‘{print $6}’ |sort -n | tail
6.67
6.71
6.77
6.80
7.21
7.70
7.72
8.40
9.21
11.92
[andharr@node6900]$ cat cpu111.out | awk ‘{print $6}’ |sort -n | tail
6.26
6.39
6.65
6.93
6.99
7.25
7.81
8.65
8.81
9.32

[andharr@node25k]$ cat cpu2.out | awk ‘{print $6}’ |sort -n |tail
26.65
26.86
26.99
28.71
29.48
30.06
30.87
32.93
34.05
34.36
[andharr@node25k]$ cat cpu22.out | awk ‘{print $6}’ |sort -n |tail
31.35
31.82
32.32
33.16
34.03
35.00
38.51
47.69
50.19
51.04
[andharr@node25k]$ cat cpu222.out | awk ‘{print $6}’ |sort -n |tail
26.03
26.71
26.90
27.31
27.45
28.29
28.42
29.28
32.30
35.40

Conclusion

So the fact that the E25k showed higher numbers didn’t really indicate
a good match with the Finacle application running on it. It would seem
to suggest that Finacle is generating a memory latency sensitive
workload which might not be entirely suited to the 25k platform.
pbinding the Finacle application into specific processor sets might
alleviate thread migration and some non-local memory accesses which may
improve some performance but not entirely eradicate the memory latency
issue from happening. The reason it’s more noticeable on the E25K
platform than the E6900 is that there is a greater distance to travel
across the E25K interconnect than the E6900 ie a greater distance in
terms of copper for the electrical signals to travel. MPO (Memory
Placement Optimization) was introduced in Solaris 9 to help elevate
latency in large scale NUMA configurations by attempting to keep LWP’s
local to it’s "home" cpu/memory board, but in some cases cannot
eliminate it for all workloads (in this case).

See the following documents for background information on MPO:


Solution 214954 : Sun Fire[TM] Servers: Memory Placement Optimization
(MPO)
and Solution 216813 : Sun Fire[TM] Servers: Memory Placement
Optimization (MPO) Frequently Asked Questions (FAQ)

As my esteemed colleague Clive said,
"Applications drive system behaviour" so we need to look at the
application and how it interacts with the system rather than the other
way round which always points me back to one of my first blog entries on the importance
of understanding the application architecture and starting a top down
approach. 🙂