Essentially, collecting data means putting your design for collecting information into operation. You’ve decided how you’re going to get information – whether by direct observation, interviews, surveys, experiments and testing, or other methods – and now you and/or other observers have to implement your plan. There’s a bit more to collecting data, however. If you are conducting observations, for example, you’ll have to define what you’re observing and arrange to make observations at the right times, so you actually observe what you need to. You’ll have to record the observations in appropriate ways and organize them so they’re optimally useful.
Recording and organizing data may take different forms, depending on the kind of information you’re collecting. The way you collect your data should relate to how you’re planning to analyze and use it. Regardless of what method you decide to use, recording should be done concurrent with data collection if possible, or soon afterwards, so that nothing gets lost and memory doesn’t fade.
The first few tips we would like to give investigators are philosophical but have a lot of practical consequences.
1.1.1 Think about your data.
First, think about your data in terms of what kind of data it is. At the crudest level, you can ask if it is numeric or text. For example, hormone levels are numeric; they can, theoretically, have any positive value. On the other hand, state names are text.
Statisticians classify variables into three types: continuous, discrete and ordinal. A continuous variable can have any value while a discrete variable can only have a finite number of distinct values. In the above paragraph, the hormone level would be continuous while state names would be discrete.
Ordinal data are a type of discrete data that have an inherent ordering. For example, a patient might be asked to give a subjective feeling as to the state of his pain and he might answer along a five point scale with 1 being no pain at all to 5 being unbearable. While the data itself is discrete, the fact that there is an underlying order allows the variable to be analyzed by other statistical methods than the usual discrete methods.
Some variables allow for different types of coding which can be somewhat confusing. For example, gender might be coded as either male/female (text) or 0/1 (numeric). Either way, it is still a discrete rather than a continuous variable.
While the above may seem to be trivial distinctions, they have some important consequences. In the first place, if you know what type of data you have, then you have a better idea of the types of statistical procedures to which it is amenable. If you have a continuous variable such as blood pressure, then it is meaningful to ask what is its average or standard deviation. On the other hand, trying to find the standard deviation of state names is meaningless.
In the second place, most modern databases require the user to specify the type of a variable before you can enter any data. They will not allow you to enter data of one kind in a variable that had already been earmarked for data of another type. That is to say, you would not be allowed to enter the name of a state in a variable that has been declared to be numeric. This is one of the ways databases have of protecting you from yourself.
In the third place, a lot of data in the medical profession looks to be of one type but is actually of another. For example, medical record numbers, at first glance, seem to be numeric. They are actually character strings since it makes no sense to add two of them or to multiply one by another. Their only use is to provide a unique identifier for each patient in a hospital.
As another example, consider a question about the presence or absence of a disease. It might seem that this is a simple true/false question such as “Have you ever had disease X?” However, a simple answer might not suffice in this case because patients might answer not only yes or no but also “I don’t know” or “I was tested for that once but I do not remember the results” or even “I do not want to answer that question.”
You should also think about your data in terms of the number of variables you are trying to capture. You should, obviously, try to get as much data as you need to prove or disprove your hypothesis. However, if you set out to get too much data, then you might find yourself swamped in a sea of irrelevant and/or uninformative details. For example, it might be worthwhile in a study of genetics to find out about a proband’s parents. However, if you try to find out about their great-grandparents, you might be wasting your time because nobody in the family can remember anything about them.
1.1.2 Think about Your Data Handling Tools
Currently, there are several types of data handling tools available to researchers. They include plain text files such as you can get from Notepad, spreadsheets such as Excel, database programs such as Access or MySQL and statistics packages such as SAS, SPSS, Stata or Minitab. Each of these has advantages and disadvantages. Before you decide on which ones to use, do your homework and find the best one for you.
Most researchers will need to deal with a statistics program of some sort or other. For complicated designs or unusual analyses, SAS is probably best but, if the protocol is not too complicated, Minitab or SPSS could be considered.
For a small dataset with only a few patients and a few measurements, entering data into a spreadsheet might work. However, if there are more than say, thirty patients in a study and you are seeing them multiple times, the number of data points can get very large very quickly. In that case, you might consider using a relational database such as Access or MySQL. The disadvantage with these tools is that they have a steep learning curve and you would probably need to find someone who could set one up for you. On the other hand, the advantage of them is that, once they are set up, they can make data entry and review much easier and more flexible.
1.1.3 Think About What Your Tools Can Do For You
In many instances, databases can be programmed to do mathematical tasks that would be a waste of time for a researcher or the members of that researcher’s team. For example, a spreadsheet or database can be programmed to automatically convert feet and inches to the metric system and vice versa. If this were done, it would save the data entry people from trotting out their calculators to do boring arithmetic. It would also save the researcher some time because he or she would not need to check the calculated data for errors.
There are a lot of other types of calculations that database programs can easily do such as BMIs, logarithms, calculating the number of years between two dates, etc. Allowing the tools to do the work will save time and frustration for a researcher.
1.1.4 Think About What Your Tools Can Not Do For You
Keeping data in a plain text file might be fine if a researcher only wants to have a small collection of notes about subjects. However, if the aim of a study involves data manipulations such as means and standard deviations or tabulations, then these would be very difficult in a word processing program but possible in a spreadsheet or database.
1.1.5 Practical Recommendations
a) Race and ethnicity: Record Race and Ethnicity according to the current NIH guidelines. The ethnicity categories are “Hispanic or Latino” and “Not Hispanic or Latino.” The racial categories are “American Indian or Alaska Native”, “Asian”, “Native Hawaiian or Other Pacific Islander”, “Black or African American” and “White.” b) Sorting in a spreadsheet: If you keep your data in an Excel spreadsheet, be very careful every time you sort it. If you choose to sort on one column, then that column will be sorted alone regardless of its relationship to other columns. Therefore, the relationships between the sorted column and the other columns will be destroyed.c) Remember the eleventh commandment: Back Up Thy Stuff. Data files should be stored on drives that are backed up regularly. The centralized drives of the various IT departments are backed up regularly and researchers should consider keeping their data on those drives, if at all possible. Use of these drives, however, is not the be-all and end-all for data safety. Researchers should think about other ways of keeping backups of their data. It would be well to cultivate a healthy sense of paranoia in this regard.d) Consistency: Consistency in data entry is crucial. In particular, it is important to be consistent when entering gender. Investigators should be very careful to choose one method of entry and stick to it. It is very confusing come data analysis time to see gender sometimes coded as Male/Female, sometimes coded as M/F and perhaps even coded as 1/0 all in the same column. Moreover, this will lead to delays in data analysis until the coding is made uniform. Along the same lines as above, it is important to be consistent when coding other kinds of data. For example state names can be entered either as the full state name, such as Illinois, or as the postal abbreviation, IL. An investigator should, once again, pick one entry type and stick to it. Inconsistencies in data entry can lead to serious errors in later analysis. In addition, a unique identifier for each patient in the study should be used consistently throughout the spreadsheet or database. Full names should not be the only unique identifier.e) Consistency with dates: Consistency in recording dates is so important that it deserves some discussion by itself. Most data handling programs will allow you to enter dates in several different ways. For example, they might allow you to enter a particular date as either October 31, 2004 or as 10/31/04. Mixing these can be very dangerous. It is not merely that you or a data entry clerk might become confused; it is also that various programs treat dates in different ways.
Example If you are dealing with a geriatric patient who was born in October 31, 1922, then entering her data as 10/31/22 in Excel will actually lead Excel to interpret the date as October 31, 2022. And if your patient was born on October 31, 1902, entering 10/31/02 will cause Excel to interpret this as October 31, 2002 |
The moral is clear. Learn how your analysis program handles dates and enter your dates accordingly. In particular, be sure to enter year data as four digit years because, while software may misinterpret a two digit year, it will not misinterpret a four digit year.f) More on dates: Most people will be able to tell you their date of birth. However, very few of them will be able to tell you the date they first noticed symptoms of a particular disease. However, they might be able to tell you the month and year they noticed symptoms. For this type of data, consider entering the month, day and year in separate columns.
ExampleA patient might not be able to tell you when he first came down with measles as a child. However, he might be able to limit the range of dates to say, March of 1986. You could enter this into a database as: Day Month Year3 1986While the day is missing, the other parts of the date might prove to be useful. |
g) Still more on dates: When dealing with data from outside the United States, remember that other countries sometimes reverse the American convention of writing month-day-year into day-month-year. This would be obvious with a date such as October 31, 2004 which a European might write as 31/10/2004. However, it would not be so obvious with a date such as October 5, 2004 which a European might write as 5/10/2004 leading an American to believe it is May 10 rather than October 5.h) Units of measurement: Units for measurements should be clearly identified whenever possible. Take heed of the recent warnings to use better abbreviations to mark data. For example use mcg instead of μg. In most circumstances, the entire word “microgram” is even better. The same is true for linear units of measurement. Make sure everybody involved with a study knows whether height is measured in feet, meters, or inches.i) Software is unaware: Don’t assume that the software will make the distinctions you make. In particular, sometimes researchers will adopt conventions that are obvious to humans but for which there are no adequate tools in software. For example, in one study records for a subject who died were highlighted in yellow. It was easy to tell on a screen the subjects who were alive and who were dead. However, it was impossible for the software to actually print out a list of the deceased subjects because it had no innate facility to distinguish the colours of the subjects’ records. On the other hand, a column including the words “Alive” or “Deceased” would have worked well for both humans and software.j) Missing values – 1: A database should reflect what is known about patients in a study. In particular, a researcher should not impute data for missing values in the database. For example, it is sometimes tempting to insert average values for missing data. If a person’s blood pressure is unknown then some researchers insert the mean blood pressure for that person’s age, gender and race. This might be a defensible strategy for a statistician under certain circumstances. However, as a general rule, it is a bad idea to put such data into a database that is supposed to reflect what the researcher knows about patients as a direct result of the study.k) Missing values – 2: Missing values might be missing for several reasons and the reasons they are missing might be relevant to your study. For example, a person might have a missing test for say, glaucoma. It might be missing because the person forgot to take the test. It might also be missing because the person hated the way the test was done, found it painful and humiliating and refused to take it. If there is enough of this type of missing data, the researcher should consider finding another type of test for this condition, one that did not alienate so many subjects. In the case above and in other cases, if the researcher’s approach to missing values is just to leave a blank in a data set when there is no value, then that researcher unknowingly might be throwing away a chance to collect useful data. Researchers should consider a more flexible approach. In the above case, the researcher might record a blank for the missing data but also record the reason for the missing value such as “Patient Missed test” or “Patient Refused Test” in another column. l) Missing values – 3: There is another way to record missing values by using special codes. Researchers who use this convention indicate missing values by placing an impossible number in the column for the missing data. For example, if a patient is missing a value for say, systolic blood pressure, then a missing value for this might be recorded as -1 indicating that the patient missed the test. And a value of -2 in this case might mean that the patient refused to take the test. The above method has obvious drawbacks such as the need to remove all such codes before statistical analysis. Also, it is worth mentioning that, should such a scheme be adopted, then it is important to insist that all missing value codes be such that their occurrence naturally is impossible. For example, choosing a missing value code of, say 99, for systolic blood pressure would be very unwise.m) Unintended missing values: Do not use blanks to represent zero because they could easily be mistaken as missing values.n) Longitudinal assessments: If multiple measurements are taken on each subject during the study (e.g. baseline, two weeks, four weeks, end of study), it is strongly suggested that each measurement is included in a different row and indexed by patient id and visit number. The actual dates of each visit should also be included. This is preferred to the one row per patient method when the number of visits is more than about two or if the number of visits varies across patients.o) Spreadsheets can be too cooperative: Spreadsheets can be very flexible data entry tools. In some cases they can be too flexible. For example, it is possible to mix data types in data entry fields. It is thus possible to enter data from a test in the form of a number followed by a comment about the number. A researcher could possibly enter a value such as “147 Patient had a cold.” The comment “Patient had a cold” might be informative but it should be placed in a separate comments column(s) and not put in the same column as the actual data.p) Code Books: Every study should be accompanied by a code book that describes each variable by name according to the type of data – numeric, date/time, and character – the units of measurement – grams, feet, micrograms per deciliter – the purpose of collecting it and its relationship to other data. This code book should be kept in a separate file from the data itself and should be available to all researchers on the study.
1.1.6 Spreadsheets vs. Databases – Some Differences
Practically all researchers are familiar with spreadsheets and their features. Fewer, however, are familiar with databases and that is the reason for the following paragraphs which compare and contrast the two types of tools. In the first place, database and spreadsheet programs are alike in that they are both tools for holding and manipulating data. Both operate on tables. Both can hold practically any kind of data. Both are quite useful. They are different, however, in their modes of operation and capabilities. When you wish to make up a new spreadsheet, you just pull up a program such as Excel and begin entiring data. You need to do very little preparation. You can, if you wish, put in column headings but these are not absolutely necessary. You can put both numeric and text data in the same column. You can add formulas easily. Every time you update a piece of data involved in a calculation, the calculation updates itself automatically. Spreadsheets can be very convenient. Databases require much more planning before data entry can begin. Before you can enter data into a database table, you must enter names for the columns in the table and you must also enter the type of data you wish the column to hold. A database will insist that you not mix numeric and text in the same column. Putting a formula directly in a column will not work. There is a much steeper learning curve with databases, and you begin that learning when you realize that you must not treat them like spreadsheets.So, if databases are more difficult to work with, why would people use them in preference to spreadsheets? The answer is that, once you get beyond their initial difficulties, you will find they are much more powerful than spreadsheets. For example, it is possible to restrict entries in a database to a small set of choices. If you want people to enter “Male” or “Female” for gender, it is possible to set up a database so that they are the only possible entries. This will prevent people from making mistakes such as entering “m” for male or “emale” for “Female”. In the former mistake, you might be able to guess that “m” meant male but in the latter you would never be quite sure whether the data entry person forgot a necessary “F” at the beginning or added an unnecessary “e.” All of these mistakes would have been caught by a database.It is much easier to include very complicated calculations in databases than in spreadsheets. For example, it is possible, from within a database, to pass information to external programs, have them do calculations and have those calculations passed back to the database.It is also much easier to get summary data from a database. For example, a cross-tabulation of gender with ethnic group is moderately easy in a database but can be very time consuming in a spreadsheet. Databases can print out far more sophisticated and informative reports, can be used to increase ease of data entry, and can be more easily put on web pages. It is not true that the above are impossible in spreadsheets; just that they are easier in databases.In summary, if you have a small dataset with limited data handling problems, a spreadsheet should be fine. However, if your dataset contains more than a few dozen subjects, then you would be well to consider a database, especially if the dataset is the beginning stage of ongoing research.
Leave a Reply
You must be logged in to post a comment.