At the highest level, most mistakes in performance measurements can be attributed to insufficient thought by the experimenter. If you leap into a measurement program without giving some careful thought to what you are going to do, it may be hard to predict what problem you’re going to encounter, but you’ve very likely to encounter some problem. Be clear on what issues you are investigating and what methods you are using to investigate them. Most frequently, you will benefit from writing down what you propose to do before you do it. That allows you to go back and check that you are continuing down the path you’d planned and hitting all the points you thought were important. It’s OK to alter your plans as new information arises, but do so knowingly, not because you’re blindly flailing around in a huge space of possible performance experiments.
At the next level of detail, problems tend to arise in areas like not measuring the right thing, not measuring accurately, not measuring in situations matching real world behaviour, and not understanding what your measurements are telling you. These issues are so broad and have some many variants of the mistakes you can make, which are often quite specific to the system that you are measuring, that it is not necessarily helpful to pin down too many particular mistakes.
But there are certain more specific mistakes in measuring system performance that are sufficiently common that they are worth calling out in detail. We’ll go through a few of these.
- Measuring latency without considering utilization. Everything runs fast (or at least faster) on a lightly loaded system. Measuring the latency of an operation when absolutely nothing else is going on in the system is only worthwhile if the question to be answered is what is the fastest possible time in which it will complete. For anything else, one should measure the latency when the system has a characteristic background load. Most often, one should also examine the latency when the system is heavily loaded, as well, since that condition is likely to arise sooner or later in most systems.
- Not reporting the variability of measurements. Sometimes this mistake is even more egregious, when a quantity is measured only once and that value reported as entire truth of the performance. Even if multiple measurements are taken, however, merely reporting the average of the values will often give a false impression of the performance observed. For most phenomena, one needs to understand the distribution of those values. Is it basically bi-modal? Is there one very common value and some outliers? Are the values uniformly spread across some range? What behaviour you will observe in the real world and whether you will be happy with your system or miserable may depend on the answers to those questions, so a good performance experiment should offer you some insight into them.
- Ignoring important special cases. This mistake comes in two varieties. In one, you ignore the fact that a few special cases distort the measurement, given you a false sense of what happens in the more general case. In the other, while you carefully measure the ordinary case, you fail to consider that there will be some special circumstances that are very important and that are likely to display different performance.
Perhaps the most common version of the first variety is ignoring start-up effects. Computers make effective use of caching in many different ways. Programs loaded off disk may hang around in memory for a while in case they will be run again. Translations of DNS names to IP addresses are stored to avoid having to make expensive network requests multiple times. Hardware caches recently run instructions to avoid the cost of fetching them out of RAM when executing a loop. Caching is so ubiquitous and built into so many levels of a system that you are unlikely to predict all of its uses. That means you should regard the first few runs of a performance experiment as being potentially biased. They may have paid higher penalties than subsequent runs in order to warm up some caches. That does not necessarily mean you should discard them or disregard them, since, after all, every cache in a real system pays a performance penalty the first time the data is used, and that is a real element of system performance. But it does mean you should not compare different alternatives when one alternative has had the benefit of a warm cache while the other has not.
A similar problem can work in the opposite way. Sometimes we have a data structure of a limited size, and as long as we are working within that size, things go quickly. When we have more elements than the data structure can hold, performance degrades. For example, consider a hash table that uses chaining to handle collisions. If the table is relatively empty, every read will hit the element it was looking for immediately, and performance will be fast. When the table starts to fill up, some probes will need to follow a chain of entries to find the one they are looking for in that cache bucket; performance will slow down in some cases, while remaining fast for others. If the table is very full, performance will slow down for almost everything, since most probes will require searching a chain. File systems that use various kinds of indirect blocks are another example. Accesses to the first few blocks will avoid the indirect block and will be fast. Accesses further into a file will require indirect, doubly indirect, or triply indirect access, and will be slower, depending on the access pattern. Again, these are genuine performance effects, but only if you are trying to measure performance for conditions where they might occur.
The other variant is also important, because sometimes these genuine effects are critical to what you need to measure. If you only measure a file system’s performance on short files, you may never learn that it is very slow once files exceed a certain size until, in production use, your system suddenly slows to a crawl. Special cases can be very important. Sometimes what you really need to know, for example, is how long servicing a web request will take under the worst circumstances likely to arise, or how your system will behave under extremely high load, or what will happen if a piece of hardware experiences partial failure. This issue returns to the point of understanding what you are measuring and why you are measuring it.
- Ignoring the costs of your measurement program. In a few cases, you may be measuring a system using tools that are entirely external to the system and impose little or no load on that system. Sniffing traffic on a network is one example. More commonly, especially for operating system measurement, you are using your system to measure your system. You’re not only sharing the processor, memory, network, and secondary storage devices with the system under study, but you’re sharing some of the abstractions the operating system offers. For example, if you are logging information from your measurement code for later examination, you are probably exercising the file system. Does that matter? If what you’re measuring is file system performance, it almost certainly does, and it might even if you are measuring something that does not have any obvious relationship to the file system, such as the scheduler or the memory manager. The file system is obviously not the only example. If you are running a separate process to perform your measurements, for example, it is competing for CPU and memory with the processes you are trying to measure.
- Losing your data. Never throw away experimental data, even if you think that you are finished with your experiment, nor even if you think the data in question was gathered in an erroneous way. Data has a way of proving useful for many purposes, but discarded data is never useful. Of course, you should particularly avoid carelessly losing data. One common beginner mistake is to inadvertently overwrite the data from a previous experiment with data being gathered for the next experiment. Also remember to label your data. Even if you have kept every byte of data you’ve gathered, if you can’t tell which bytes are related to which parts of your experiment, the data is as good as lost. This advice is for the long term. Ideally, you should be able to go back and look at data you gathered twenty years ago. Maybe you will never look at a lot of the data you gathered in the past again, but probably you will eventually want to look at some of it, and it’s hard to predict what’s going to prove useful in the future. So save it all, if possible. It’s also important to keep the metadata around, which in this case means information about how you set up and ran your experiments. Which version of the operating system was it that you used on that experiment you ran five years ago? Chances are you won’t remember, so make sure it’s written down somewhere you can find.
- Valuing numbers over wisdom. Remember, the point of your performance experiment is not to obtain a set of numbers. It’s to understand important performance characteristics of your system. The numbers are the means to an end, not themselves the end. Don’t bother gathering numbers that are not going to lead to wisdom, and don’t consider your task complete when you have the numbers in hand. You actually have the most important step still to go: using the numbers to understand your system performance and, if necessary, using them to guide redesign or reconfiguration of your system in ways that are likely to lead to better performance. Unfortunately, it’s hard to offer general advice on how to extract wisdom from sets of numbers. That’s a task you will need to perform on a case-by-case basis. But do remember that performing that task is the goal, the entire point of running an elaborate performance experiment. Without the resulting wisdom, the work you did to get the numbers will be wasted.
Leave a Reply
You must be logged in to post a comment.