Author: Neftaly Malatjie

  • 114053 LG 1.10 Common Mistakes in Performance Measurements

    At the highest level, most mistakes in performance measurements can be attributed to insufficient thought by the experimenter. If you leap into a measurement program without giving some careful thought to what you are going to do, it may be hard to predict what problem you’re going to encounter, but you’ve very likely to encounter some problem. Be clear on what issues you are investigating and what methods you are using to investigate them. Most frequently, you will benefit from writing down what you propose to do before you do it. That allows you to go back and check that you are continuing down the path you’d planned and hitting all the points you thought were important. It’s OK to alter your plans as new information arises, but do so knowingly, not because you’re blindly flailing around in a huge space of possible performance experiments.

    At the next level of detail, problems tend to arise in areas like not measuring the right thing, not measuring accurately, not measuring in situations matching real world behaviour, and not understanding what your measurements are telling you. These issues are so broad and have some many variants of the mistakes you can make, which are often quite specific to the system that you are measuring, that it is not necessarily helpful to pin down too many particular mistakes.

    But there are certain more specific mistakes in measuring system performance that are sufficiently common that they are worth calling out in detail. We’ll go through a few of these.

    1. Measuring latency without considering utilization. Everything runs fast (or at least faster) on a lightly loaded system. Measuring the latency of an operation when absolutely nothing else is going on in the system is only worthwhile if the question to be answered is what is the fastest possible time in which it will complete. For anything else, one should measure the latency when the system has a characteristic background load. Most often, one should also examine the latency when the system is heavily loaded, as well, since that condition is likely to arise sooner or later in most systems.
    1. Not reporting the variability of measurements. Sometimes this mistake is even more egregious, when a quantity is measured only once and that value reported as entire truth of the performance. Even if multiple measurements are taken, however, merely reporting the average of the values will often give a false impression of the performance observed. For most phenomena, one needs to understand the distribution of those values. Is it basically bi-modal? Is there one very common value and some outliers? Are the values uniformly spread across some range? What behaviour you will observe in the real world and whether you will be happy with your system or miserable may depend on the answers to those questions, so a good performance experiment should offer you some insight into them.

    1. Ignoring important special cases. This mistake comes in two varieties. In one, you ignore the fact that a few special cases distort the measurement, given you a false sense of what happens in the more general case. In the other, while you carefully measure the ordinary case, you fail to consider that there will be some special circumstances that are very important and that are likely to display different performance.

    Perhaps the most common version of the first variety is ignoring start-up effects. Computers make effective use of caching in many different ways. Programs loaded off disk may hang around in memory for a while in case they will be run again. Translations of DNS names to IP addresses are stored to avoid having to make expensive network requests multiple times. Hardware caches recently run instructions to avoid the cost of fetching them out of RAM when executing a loop. Caching is so ubiquitous and built into so many levels of a system that you are unlikely to predict all of its uses. That means you should regard the first few runs of a performance experiment as being potentially biased. They may have paid higher penalties than subsequent runs in order to warm up some caches. That does not necessarily mean you should discard them or disregard them, since, after all, every cache in a real system pays a performance penalty the first time the data is used, and that is a real element of system performance. But it does mean you should not compare different alternatives when one alternative has had the benefit of a warm cache while the other has not.

    A similar problem can work in the opposite way. Sometimes we have a data structure of a limited size, and as long as we are working within that size, things go quickly. When we have more elements than the data structure can hold, performance degrades. For example, consider a hash table that uses chaining to handle collisions. If the table is relatively empty, every read will hit the element it was looking for immediately, and performance will be fast. When the table starts to fill up, some probes will need to follow a chain of entries to find the one they are looking for in that cache bucket; performance will slow down in some cases, while remaining fast for others. If the table is very full, performance will slow down for almost everything, since most probes will require searching a chain. File systems that use various kinds of indirect blocks are another example. Accesses to the first few blocks will avoid the indirect block and will be fast. Accesses further into a file will require indirect, doubly indirect, or triply indirect access, and will be slower, depending on the access pattern. Again, these are genuine performance effects, but only if you are trying to measure performance for conditions where they might occur.

    The other variant is also important, because sometimes these genuine effects are critical to what you need to measure. If you only measure a file system’s performance on short files, you may never learn that it is very slow once files exceed a certain size until, in production use, your system suddenly slows to a crawl. Special cases can be very important. Sometimes what you really need to know, for example, is how long servicing a web request will take under the worst circumstances likely to arise, or how your system will behave under extremely high load, or what will happen if a piece of hardware experiences partial failure. This issue returns to the point of understanding what you are measuring and why you are measuring it.

    1. Ignoring the costs of your measurement program. In a few cases, you may be measuring a system using tools that are entirely external to the system and impose little or no load on that system. Sniffing traffic on a network is one example. More commonly, especially for operating system measurement, you are using your system to measure your system. You’re not only sharing the processor, memory, network, and secondary storage devices with the system under study, but you’re sharing some of the abstractions the operating system offers. For example, if you are logging information from your measurement code for later examination, you are probably exercising the file system. Does that matter? If what you’re measuring is file system performance, it almost certainly does, and it might even if you are measuring something that does not have any obvious relationship to the file system, such as the scheduler or the memory manager. The file system is obviously not the only example. If you are running a separate process to perform your measurements, for example, it is competing for CPU and memory with the processes you are trying to measure.
    1. Losing your data. Never throw away experimental data, even if you think that you are finished with your experiment, nor even if you think the data in question was gathered in an erroneous way. Data has a way of proving useful for many purposes, but discarded data is never useful. Of course, you should particularly avoid carelessly losing data. One common beginner mistake is to inadvertently overwrite the data from a previous experiment with data being gathered for the next experiment. Also remember to label your data. Even if you have kept every byte of data you’ve gathered, if you can’t tell which bytes are related to which parts of your experiment, the data is as good as lost. This advice is for the long term. Ideally, you should be able to go back and look at data you gathered twenty years ago. Maybe you will never look at a lot of the data you gathered in the past again, but probably you will eventually want to look at some of it, and it’s hard to predict what’s going to prove useful in the future. So save it all, if possible. It’s also important to keep the metadata around, which in this case means information about how you set up and ran your experiments. Which version of the operating system was it that you used on that experiment you ran five years ago? Chances are you won’t remember, so make sure it’s written down somewhere you can find.
    1. Valuing numbers over wisdom. Remember, the point of your performance experiment is not to obtain a set of numbers. It’s to understand important performance characteristics of your system. The numbers are the means to an end, not themselves the end. Don’t bother gathering numbers that are not going to lead to wisdom, and don’t consider your task complete when you have the numbers in hand. You actually have the most important step still to go: using the numbers to understand your system performance and, if necessary, using them to guide redesign or reconfiguration of your system in ways that are likely to lead to better performance. Unfortunately, it’s hard to offer general advice on how to extract wisdom from sets of numbers. That’s a task you will need to perform on a case-by-case basis. But do remember that performing that task is the goal, the entire point of running an elaborate performance experiment. Without the resulting wisdom, the work you did to get the numbers will be wasted.

  • 114053 LG 1.9 SYSTEM PERFORMANCE MEASURES

    One important aspect of running a performance experiment is the workload you use. In some cases, you are examining the performance of a particular program or operating system element, in which case you will tailor the workload to exercise that software. In many cases, you are looking for general performance in the face of typical overall system loads. In that situation, you need to generate a realistic workload for your system. Either way, somehow you must provide data sets, background activities, network traffic, and various other types of workload-related effects to test the performance.

    There are different aspects of workloads that you need to think about when designing performance experiments. Your system is designed to do certain things: schedule processes, lay out a file system on a flash drive, respond to web requests, and so on. Obviously, one important aspect of the workload is the tasks you provide to your system directly related to its purpose: the set of processes to be scheduled, the files and file accesses to be handled, the web requests that clients generate. An equally important aspect of the workload, however, is based on the fact that operating systems are complex and involve simultaneous interactions of many different components that might affect each other in unpredictable ways. How your file system would perform if the only activity on the operating system was reads and writes to it is not the question you need to answer, as a rule. The important question is how it would perform in the face of all the other ordinary activities that the operating system would be doing in a real world setting. So your workload must also capture those background activities.

    There are several different types of workloads typically used for performance measurement.

    1. Traces – Take or otherwise obtain a detailed trace of the workload of the system in its ordinary activities. What such a trace consists of depends on the nature of what you are testing. For a web server, the trace is likely to be a set of web requests submitted to the server. For a mail server, it is likely to be a set of messages delivered to that server. For a file system, it might be a set of opens, reads, writes, and other file system operations. For an operating system component, it might be a set of applications that are run in a particular order with specified inputs. Whatever the trace might consist of, you capture it from the running system, saving it in a form that will allow you to recreate it in a faithful manner. Then, for each experimental run, you start from the beginning and run it to the end. Traces have good and bad properties for performance experiments. A good property is realism, since they represent realistic activities that you would actually want your system to handle well. Another good property is reproducibility. The same trace can be replayed over and over, identically for each run. There is an issue here if the performance of the system has an impact on what would have happened in the real system. For example, a trace of a network protocol that sends a message and receives an acknowledgement before sending the next message would have run differently if the acknowledgement had been produced in half the time, double the time, or at some other delay than it had been when the trace was gathered. If the system being tested is the one generating the acknowledgements that can result in the replayed trace producing unrealistic results.

    A disadvantage of a trace is that it is not easily reconfigurable. If your experiment needs to examine performance under controlled levels of workload, you might not be able to get a trace for each workload level you need. Merely running two copies of one trace in parallel might not realistically represent a true doubled load. Cutting out portions of a trace might not realistically represent a smaller workload, either. Scaling a trace up or down is usually hard. Another frequent disadvantage is availability. Good traces are not easy to come by, and if your system is not yet in production, you might be unable to gather your own. Except for freshly gathered traces of your own, most traces you can find will be somewhat (to very) old. Another disadvantage in some cases is that it might be difficult to gather the information needed to create the trace from the tools available to you. You might not be able to capture all the system calls applications perform, for instance. Also, any particular single trace might or might not represent the typical activity of the system. The moment at which it was gathered might have been unusual, compared to the ordinary activities of your system. Depending on exactly what you are tracing, there may be privacy implications to saving it in a trace. For certain kinds of system, such as those dealing with medical records, you may have legal obligations to handle some of the data in particular ways. Be aware of any such privacy problems before you store data for a trace.

    1. Live workloads – Sometimes you can perform measurements on a working system as it goes about its normal activities. A production system can also instrumented and data gathered as it does its work. Realism is a clear advantage here. Also, provided you can continue to do tests on the system indefinitely, with enough time you can capture a very wide range of real system behaviour. You are likely to need to take little or no effort to establish realistic background loads, since they establish themselves, in essence.

    This approach has its own disadvantages. One is lack of control, which manifests itself both in not being able to reproduce the behavior seen in previous tests, and in not being able to scale loads up and down as desired. Another is that your experimental framework usually needs to have minimal impact, both in performance and functionality, on the running system, since it is presumably more important to complete its live work than to gather your measurements. Unless this impact is essentially nil, you are not likely to be able to run the performance measurements for very long on a working system, since those tasked with getting it to do its job will not appreciate your experiments getting in the way. As with traces, consider whether there are privacy implications to your observation of the live workload.

    1. Standard benchmarks – These are either sets of programs or sets of data that are intended to drive performance experiments, typically on some particular thing, such as a file system, a database, a web server, or an intrusion detection system. They may have been derived from real traces at some point or they may be built from models of system behavior. They are typically designed to be usable by many developers, so it is often fairly easy to integrate them into your experiments, provided you are working in the same general framework they were designed for. (For example, a file system benchmark might generate Posix-compliant file operations, so any file system that is compatible with Posix can use it for testing.) They allow for easy comparison to other systems’ performance, since the developers of those system can also run the same benchmark, or, indeed, you can yourself, if those other systems are also available for testing. A well-designed benchmark is likely to exercise a wide range of system behaviors, so the results you get from it may give you a fairly complete picture of your system’s performance under different realistic conditions. Widely used benchmarks have been heavily studied themselves, and are unlikely to have many bugs, and likely to be relatively good representations of the kind of workload they are intended to mimic. Some benchmarks (though not all) are built to be inherently scalable, allowing you to adjust the workload up or down with little more than changing a line or two in a configuration file. Since benchmarks are artificial, there are usually no privacy implications to using them.

    As you no doubt expect, though, standard benchmarks have their own set of disadvantages. First, there are a limited number of them available, and there might not be one suited for the system or situation you want to test. One aspect of this characteristic is that standard benchmarks might not include portions of the workload space that are unusual in general, but important for your case. Another aspect of this characteristic is that it’s tempting to use a standard benchmark that isn’t quite right for your situation just because it’s easy to do so. Resist such temptations. Second, since developing a good benchmark is quite a lot of work, they tend to be used for a very long time, running the risk of representing archaic workloads that no longer match what would happen on a current system.

    1. Simulated workloads – In this approach, you build models of the loads you are interested in, typically models instantiated in executable code. These models are usually parameterized, allowing them to be scaled up or down, to alter the mix of different elements of the load, and otherwise to create variations on the load. When testing a system’s performance, one decides which parameter settings are most relevant and uses the simulated workload models with suitable settings. This approach has the advantage of being easily customized to many different scenarios and possibilities, since you need merely alter the model parameters accordingly. One important aspect of this flexibility is good handling of scaling, either up or down. Assuming that there is no true randomization in the models, they are infinitely repeatable, allowing you to perform directly comparable tests of different system alternatives. As with standard benchmarks, the artificiality of simulated workloads has the benefit of avoiding privacy considerations. However, the validity of the performance results you achieve is only as good as the quality of the models. It is not easy to produce good models of complex systems and phenomena, and one can easily overlook important features of real loads in building one’s models. While parameters can be easily altered and scaled, even if the model was faithful to real load for some settings, it may prove unrealistic at others. It may also be unclear how to set the various parameters to produce simulated load that matches a particular real load. If the parameters are set incorrectly, one may get a very false picture of how a real system would behave in those situations.


  • 114053 LG 1.8 PERFORMANCE LIMITING FACTORS

    We can experience poor performance in our systems for many reasons. Sometimes there is an overloaded resource, such as memory or network bandwidth or CPU cycles. Sometimes a solution built into the software doesn’t scale, so performance seems fine until the load on the system gets high. Then suddenly we fall off a performance cliff. Sometimes we have built an inefficient implementation that puts in unnecessary overheads, such as copying a piece of data many times or making lots of calls to recursive functions to perform a tiny amount of real work.

    The problem you have will affect how you go about looking for it. If you run performance measurements to uncover a scaling issue, a test that only runs a small number of iterations or on a small version of the problem may not produce useful results. If your problem is a bottleneck in your network, running tests that don’t send messages across the network will never find it. If your problem is contention in scheduling, a performance experiment embedded in a single process won’t provide much insight. There’s an obvious chicken-and-egg problem here. You might know performance is bad, but to run an experiment to determine exactly why, you need to know why performance is bad. Otherwise, you might waste your time running an irrelevant experiment that tells you nothing. So how do you get started? In actuality, you often can get some clues without running any new experiments. The operating system will tell you, on request, how much memory is being used, CPU utilization, and many other statistics concerning the behaviour of your processes. If there’s plenty of free memory and the system is still running poorly, you’d probably waste your time building a performance experiment based on investigating the effects of varying memory usage. If there are rarely any ready processes waiting to run, scheduling is very likely not the source of your problem. Knowledge of the general architecture and expected behaviour of the poorly performing system component can help, as well. If you know that the software that is running slow always works on pretty much the same quantity of data, it’s probably not a scaling problem.

    But usually these kinds of hints will only get you so far, and often they will provide indications, not actually identify the source of the problem. What then? Take the best knowledge you can easily obtain about your code and the observed performance problem and generate a hypothesis about why it’s happening. Design an experiment that will test that hypothesis, proving or disproving it. Run the experiment and determine if your hypothesis is born out. If not, generate a new hypothesis (with, one would hope, a deeper knowledge base to work from than before) and try again. Obviously, there are elements of art, experience, and even luck in this process. But you’ve seen this kind of process before. It’s much like finding a bug in a program, where you observe the erroneous behaviour, make a hypothesis for its cause, add fixes or extract new information relevant to the hypothesis, and test it, until the bug is found and repaired. Generally, finding performance problems is harder than finding bugs, since it’s harder to narrow the field in which you’re searching, but the basic approach is similar.

    Like finding functionality bugs, finding performance problems is a skill you are likely to develop with practice. You will come to learn the signals that point towards particular classes of performance problems and develop instincts that lead you in the right direction more often than the wrong one. However, never mistake your experience or a good hunch for the results of an actual measurement program. Ultimately, the point of performance measurement is to reveal the actual truth of what’s happening, and nothing short of measuring it is a substitute for that evidence.


  • 114053 LG 1.7 INTRODUCTION

    No matter how fast our hardware gets, the performance of our system always matters. Programmers tend to add complexity and sophistication to their systems to match any hardware performance improvements, so there is always a need to design software that achieves good performance on whatever hardware is available to us.

    This instantly raises a question: what do we mean by performance? Most of us have an informal sense of what we mean. Perhaps we mean we don’t want to wait for our commands to complete. Perhaps we mean we want to run complex calculations on very large data sets fast enough for the results to be useful. Perhaps we mean we want to get the most possible work through a given piece of hardware that we possibly can. Perhaps we mean we want to use as little space as possible on a storage device, or send as few bits across a network as possible. Maybe we care about how long a user has to wait before he starts to see some response from the system, but maybe we care about how long it takes for the entire job to complete.

    As these potential answers to the performance question suggest, there’s a lot more to understanding performance in a computer system than one might initially think. And there are yet more complexities in actually providing a valid answer to a particular performance question, such as “how many web requests per second can my server handle,” or “will my VoIP call provide comprehensible speech to the listener if run it over a particular network,” or “how many buffers should I allocate in my operating system to make sure that I/O is not delayed too much?”


  • 114053 LG 1.6 SESSION 1: MONITOR THE PERFORMANCE OF A MULTI-USER NETWORKED SYSTEM

    On completion of this section you will be able to monitor the performance of a multi-user networked operating system 

    1. The monitoring explains system performance measures, and outlines and justifies a monitoring strategy. 
    2. The monitoring is carried out in accordance with the monitoring strategy. 
    3. The monitoring compares the results produced from performance monitoring, with user perception of performance, and identifies discrepancies between the two measures. 
    4. The monitoring identifies performance-limiting factors. 
    5. The monitoring outlines recommendations to improve performance, and justifies them using performance analysis.