What's the answer to life the universe and everything?

"42"

For those of you that had read or listened to the Hitchhikers
Guide to the Galaxy
the above question and answer will have more
meaning to you than those of you that haven’t. Essentially, how can
you have a literal answer to such an undefined question which
suggests on an allegorical level that it is more important to ask the
right questions than to seek definite answers.

I sometimes think of just saying "42" to the question of
"What’s the answer to our performance problem?"
which is usually supplied with some kind of data either in the form
of GUDS (a script which collects a whole bunch of Solaris OS output)
or some other spreadsheet or application output. This data usually
has no context or supplied with anything other than "the
customer has a performance problem" which of course makes things
slightly difficult for us to answer unless the customer will accept
"42".

So investigating performance related issues is usually very time
consuming due to difficulty in defining a problem. So it would seem
to reason that it’s probably a good idea to approach these type of
problems in a structured method. Sun has been using an effective
troubleshooting process by Kepner
Trego
for a number of years of which defines a problem as
follows:

"Something has deviated from the normal (what you should
expect) for which you don’t know the reason and would like to know
the reason
"

Still don’t get it? Well, what if you’re driving, walking,
running, hopping (you get my point) etc from point A to B and have
somehow ended up at X21 and you don’t know why you’ve ended up, you’d
probably want to know why and thus you’d have a problem because you’d
be expecting to end up at point B but have ended up at point X21.

Ok, so how does this related to resolving performance issues then?
Well, in order for Sun engineers to progress performance related
issues within the services organization we need to understand a
problem, the concerns around it and how that fits into the bigger
picture. By this I mean looking at an entire application
infrastructure (top down approach) rather than examining specific
system or application statistics (bottom up approach). This can then
help us identify a possible bottleneck or specific area of interest
to which we can use any number of OS or application tools to focus in
on and identify root cause.

So perhaps we should start by informing people what performance
engineers CAN do:

1/ We can make "observations" from static collected data
or via an interactive window into customer’s system (Shared Shell).
Yes, that doesn’t mean we can provide root cause from this but
comment on what we see. Observations mean NOTHING without context.

2/ We can make suggestions based on above information which might
progress to further data collection but again mean NOTHING without
context.

Wow, that’s not much is it….so what CAN’T we do?

1/ We can’t mind read – Sorry, we can’t possibly understand you’re
concerns, application, business, users without providing USEFUL
information. So would is useful information? Well answers to these
might help get the ball rolling:

* What tells you that you have a performance issue on your system?
i.e Users complaining that the "XYZ" application is taking
longer than expected to return data/report, batch job taking longer
to complete, etc.

* When did this issue start happening? This should be the exact
date & time the problem started or was first noticed.

* When have you noticed the issue since? Again the exact date(s)
and time(s).

* How long should you expect the job/application to take to
run/complete. This needs to be based on previous data runs or when
the system was specified.

* What other systems also run the job/application but aren’t
effected?

* Supply an architecture diagram if applicable, describing how the
application interfaces into the system. i.e

user -> application X on client -webquery->
application server -sqlquery-> Oracle database backend server

2/ We can’t rub a bottle and get the answer from a genie nor wave
a magic wand for the answer – Yes, again it’s not just as simple as
supplying a couple of OS outputs and getting an answer from us. We’ll
need to understand the "bigger" picture or make
observations before suggestions can be advised.

3/ We can’t fix the problem in a split second nor can applying
pressure help speed up the process – Again we need to UNDERSTAND the
bigger picture before suggestions and action plans can be advised.

So what kind of data stuff can we collect to observe?

Probably one of the quickest ways of allowing us to observe is via
Shared Shell. This allows us a
direct view onto a system and allows us to see what the customer
actually see’s. Again, we’ll need to discuss with the customer what
we’re looking at and UNDERSTAND the "bigger" picture to
make suggestions or action plans moving forward. If shared shell
isn’t available then we’ll need to collect GUDS data usually in the
form of the extended mode. This collects various Solaris outputs in
various time snapshots which we can view offline, however we do need
baseline data along with bad data to make any useful observations.
Yes, one snapshot isn’t much help as high values could be normal!
Yes, just because you see high user land utilization it doesn’t
necessarily mean its bad or shows a performance problem. It could
just be the system being utilized well processing those "funny"
accounting beans for the business. Again and I’ve said this a few
times…..data is USELESS without CONTEXT.

If Oracle is involved then
you could get the Oracle DBA to provide statspack
data or AWR reports for when you see the problem and when you don’t
as that might give an indication of Oracle being a bottleneck in the
application environment.

Other application vendors might have similar statistic generating
reports which show what they are waiting for which might help
identify a potential bottleneck.

The "Grey" area

The grey area is a term used by many as an issue which breaks the
mold of conventional break fix issues and starts entering the
performance tuning arena. Break fix is usually an indication that
something is clearly broken such as a customer experiencing a bug in
Solaris or helping a customer bring a system up which as crashed or
needs to be rebuilt and requires Sun’s assistance and expertise to
resolve. Performance tuning usually happens because a customer’s
business has expanded and their application architecture can’t cope
with the growth for example. It’s a little difficult to gauge when a
situation starts to go down that path when most application
architectures are very complex and involve lots of vendors. I also
happen to work in the VOSJEC (Veritas Oracle Sun Joint Escalation
Centre) and deal with quite a few interoperability issues so know
things can get pretty complex with trying to find the problematic
area of interest. For some reason some people term this as the blame
game or finger pointing which I personally hate to use. In fact I’d
rather it be a Sun issue from my perspective as we get then take the
necessary action in raising bugs and getting engineering involved to
provide a fix and ultimately resolve the customer’s issue. Thankfully
my Symantec and Oracle
counterparts also take this approach which makes things a little
easier in problem resolution.

Conclusion

I think real point of this is that you should really grasp a
problem before asking for assistance, as if you understand the
problem, then you’re colleagues understand the problem and more
importantly we (Sun) or I understand the problem and that’s half the
battle. The rest is so much easier…… πŸ™‚

2 Replies to “What's the answer to life the universe and everything?”

  1. Andy – this is an excellent blog entry and ought to be MANDATORY reading for all of Services (Ian White – please note!!).

    SGR/ATS is really the right way to go in the early stages of any call, if the engineer does not understand the problem and cannot find a solution. We "few" have to keep banging the SGR drum. keep up the good work.

    πŸ™‚

  2. Awesome! Thanks for putting this out. I rarely see problems that are well defined… and spend most of the time getting a good definition before I can begin to help.

    I would also add that people need to avoid stating problems using system statistics. I don’t know how many times I have had problems defined as user cpu% is too high… or load avg is too high πŸ™‚

Leave a Reply

Your e-mail address will not be published. Required fields are marked *