We can experience poor performance in our systems for many reasons. Sometimes there is an overloaded resource, such as memory or network bandwidth or CPU cycles. Sometimes a solution built into the software doesn’t scale, so performance seems fine until the load on the system gets high. Then suddenly we fall off a performance cliff. Sometimes we have built an inefficient implementation that puts in unnecessary overheads, such as copying a piece of data many times or making lots of calls to recursive functions to perform a tiny amount of real work.
The problem you have will affect how you go about looking for it. If you run performance measurements to uncover a scaling issue, a test that only runs a small number of iterations or on a small version of the problem may not produce useful results. If your problem is a bottleneck in your network, running tests that don’t send messages across the network will never find it. If your problem is contention in scheduling, a performance experiment embedded in a single process won’t provide much insight. There’s an obvious chicken-and-egg problem here. You might know performance is bad, but to run an experiment to determine exactly why, you need to know why performance is bad. Otherwise, you might waste your time running an irrelevant experiment that tells you nothing. So how do you get started? In actuality, you often can get some clues without running any new experiments. The operating system will tell you, on request, how much memory is being used, CPU utilization, and many other statistics concerning the behaviour of your processes. If there’s plenty of free memory and the system is still running poorly, you’d probably waste your time building a performance experiment based on investigating the effects of varying memory usage. If there are rarely any ready processes waiting to run, scheduling is very likely not the source of your problem. Knowledge of the general architecture and expected behaviour of the poorly performing system component can help, as well. If you know that the software that is running slow always works on pretty much the same quantity of data, it’s probably not a scaling problem.
But usually these kinds of hints will only get you so far, and often they will provide indications, not actually identify the source of the problem. What then? Take the best knowledge you can easily obtain about your code and the observed performance problem and generate a hypothesis about why it’s happening. Design an experiment that will test that hypothesis, proving or disproving it. Run the experiment and determine if your hypothesis is born out. If not, generate a new hypothesis (with, one would hope, a deeper knowledge base to work from than before) and try again. Obviously, there are elements of art, experience, and even luck in this process. But you’ve seen this kind of process before. It’s much like finding a bug in a program, where you observe the erroneous behaviour, make a hypothesis for its cause, add fixes or extract new information relevant to the hypothesis, and test it, until the bug is found and repaired. Generally, finding performance problems is harder than finding bugs, since it’s harder to narrow the field in which you’re searching, but the basic approach is similar.
Like finding functionality bugs, finding performance problems is a skill you are likely to develop with practice. You will come to learn the signals that point towards particular classes of performance problems and develop instincts that lead you in the right direction more often than the wrong one. However, never mistake your experience or a good hunch for the results of an actual measurement program. Ultimately, the point of performance measurement is to reveal the actual truth of what’s happening, and nothing short of measuring it is a substitute for that evidence.
Leave a Reply
You must be logged in to post a comment.