Given that you’ve determined a good set of metrics, what next? By this point in your studies, you should be painfully aware of the complexities of large systems, particularly of operating systems. That complexity is going to have a big impact on your performance measurement studies. Consider what you already have learned about operating systems. They make heavy use of caching, with an expectation that some operations will be cache hits (probably fast) and some will be cache misses (probably slow). Interrupts will occur at unpredictable times, possibly resulting in altered performance of the code they break up. Different scheduling disciplines will insert varying delays into the performance of individual processes. Even if you have access to the operating system code, down at the hardware level there are caches and pipelines and other optimizations that you can’t see at all, except that they produce varying performance results.
All of these characteristics mean that if you measure some metric on your system once, and then repeat that measurement again, you might see very different values. So what’s the right value? The one that makes your system look best? The worst one you’ve ever seen? Something else?
If you have taken a course in statistics, you may have a pretty good idea already about how to handle this issue. Don’t measure the event of interest on your system only once. Measure it many times and treat the set of measurements as a probability distribution. You can then use the rich set of tools that this field of mathematics offers us to analyze your performance. Among the simplest of these tools are mean, median, and mode. The mean of a set of measurements is its average, the median is its middle point, and the mode is the most common value in the set. Means are useful for getting a single number that somehow captures something important about the entire data set. Medians are useful for getting a sense of where the measurements in a data set are “centered,” in some sense. Mode is typically most useful when one value occurs far more often than any other, since then it can give you an idea of what result is most probable from a given data set. Mode tends to be less useful when the measurements are pretty evenly spread out.
It’s often helpful to take a step further and work with quantities called indices of dispersion, which essentially describe how spread out a set of measurements are. Range is one such index of dispersion, describing the highest and lowest value observed. Standard deviation is another, describing the most commonly occurring range of values around the mean within the set. Confidence intervals describe the probability that a particular measurement is within a certain range.
In figure 1, we show a small set of sample data of latencies (in milliseconds) to access a disk block. Figure 2 shows various statistical properties of that data set. We calculated the mean by adding the 11 latencies and dividing by 11. We calculated the median by ordering the measurements and choosing the one in the middle. We calculated the mode by counting the number of times particular latencies occurred and choosing the one that occurred most often. To determine the range, we simply found the lowest measurement and the highest measurement. Standard deviation is calculated, typically, with a somewhat more complicated formula, but it is described in any detailed treatment of statistical properties.
These statistical properties are actually a bit more complex than they appear at first glance, and full understanding of their proper use is beyond the scope of either this chapter or an introductory course on operating systems. For example, there are actually several different ways to calculate means, and calculation of confidence intervals is typically based on assumptions about the probability distribution of the measurements. We will not further describe the many useful tools that the field of statistics offers, beyond recommending that those interested in understanding the performance of their systems really need good mastery of some of these tools.
You might wonder how many experimental runs you need to perform to fully capture the behaviour of a system you’re studying, given that there is possible statistical variation in the performance of each run. The field of statistics has useful tools for this purpose, as well, but they are beyond the scope of this chapter. In brief, the greater the degree of variability in what you’re measuring, the more independent measurements you’ll need to perform to get a pretty confident picture of its full behaviour
Comparing Alternatives
Sometimes the purpose of your performance experiment is simply to characterize how a system performs according to one or more important metrics. In other cases, the system can be built, configured, or used in several different ways, and you want to know how well it performs in those varying situations. When this kind of comparative form of performance measurement is what you need, you have to take some care in how you go about doing it..
One issue is that possibly there are a large number of different options you could compare. There may be multiple dimensions in which you could examine the system. For example, you might want to know what would happen if you increased or decreased the amount of RAM allocated to buffer spaces, or what would happen if you used several different scheduling disciplines, or what would happen if you replaced the Ext3 file system with Btrfs. You might want to know what would happen if you made several of these changes in different ways.
Things you intentionally alter in performance experiments to determine which of several alternatives to use are called factors. In the paragraph above, the amount of buffer space allocated might be one factor, the scheduling disciplined used might be a second factor, and the file system you chose might be a third factor. Factors can be set independently. For example, I might want to look at allocating 1Gbyte of buffer space, using Linux’s Completely Fair Scheduling, and Btrfs, vs. allocating .5 Gbytes of buffer space, simple round robin, and Ext3. For each choice of settings of factors you want to investigate, you’ll need to run some experiments. As mentioned earlier, you’ll probably need to perform multiple runs to capture statistical variations.
Think about this a moment. If there are four different buffer sizes I want to investigate, three different scheduling disciplines, and two file systems, and I care about all possible combinations, how many sets of experiments am I going to have to run? In this relatively limited case, we’re up to 24 sets of experiments, each of which will need to be run multiple times. One of the dangers of running performance experiments is getting too entranced with the many possibilities. If you’re not careful, you might spend the rest of your life investigating some of these issues. Or, more likely, you’ll get exhausted and give up before you get around to looking at the situations that are really important. A key element in successfully obtaining good performance results is striking a balance between investigating everything that’s very important versus avoiding a combinatorial explosion of experimental settings. Maybe, on careful consideration, looking at two buffer sizes and ignoring one of the scheduling disciplines will still tell you what you need to know
Another element controlling how much work you need to do is how many runs of each alternative you make to capture statistical variation against the number of alternatives you look at. Generally, the more you make, the higher your confidence in the statistical representativeness of your results (though there is a point of diminishing returns). Is it more important to consider more options or to have greater confidence in the performance of the options you do consider? That’s for you to determine.
One can further generalize this issue to obtain perhaps the most important piece of advice we can give you before you undertake a set of performance measurements: think first, measure second. Determine exactly what you want to know and what you need to do to learn it, then design and perform experiments targeted at that required knowledge. Otherwise, you can spend arbitrarily large amounts of times hacking your way through the performance measurement jungle, possibly emerging at the end without having learned anything of importance.
There are other important issues in comparing the performance of various alternatives. One is to treat each alternative you measure fairly. Don’t create experimental conditions that artificially favor one alternative over another. For example, in some cases things go slower the first time you do them than the second or later times. Caching often causes this effect. So if you ran alternative A first, then ran alternative B, B might appear to be faster not because it’s better, but because it benefited from caching. Merely switching the order isn’t enough, because now you’ve favored A. Running each alternative multiple times will help wash out these effects, particularly if you randomly intersperse the alternatives in different runs. (That’s not always easy, if you need to perform heavyweight reconfiguration of the system to enable different alternatives.)
That is merely one example of being fair. Using the same settings for all runs (other than those that define the alternatives themselves) is another example. Resetting the system to the same state before starting each run is a third example. Even if caching is not involved, there may be some initial work that each system might need to do. If the system is not reset, the later runs can unfairly benefit from the configuration work done by earlier runs. Isolating the system from unrelated effects (such as updates to key pieces of software, external work loads not intended to be captured by your experiment, filling up the disk drive with your own experimental results, and so on) is also important.