Designing Quantitative Validity and Reliability Research Order Instructions: kindly view attached 7 Quality Considerations
Mary S. Stewart, PhD, and John H. Hitchcock, PhD
Designing Quantitative Validity and Reliability Research Introduction
If you want to have an impact in academic, policy, business, and program evaluation settings, you must be able to conduct high-quality research. You must also have the skills to assess the rigor of other published research.
A way to think about quality in research is to consider certain indicators to demonstrate that research findings accurately represent the subject, phenomenon, or process being studied. Failure to meet standards of quality may result in research that is misleading or inaccurate. For example, suppose that a study examined the effectiveness of a reading intervention by analyzing the test results of first graders before and after the implementation of the intervention. However, the study was done in a way that did not allow researchers to be reasonably sure that intervention exposure was the best explanation for any observed improvement. The findings of such a study would have limited use for educators and administrators because they would not know if they should use the intervention in question. Hence, it is important to be able to develop and critique studies that yield findings that can clearly inform decision making. This chapter introduces and offers examples of commonly used quality indicators in the context of different approaches to inquiry.
The most common way for researchers and scholars to address quality is to consider whether findings (and the data and inferences that form the basis of findings) are valid; in other words, they must reflect the actual phenomenon under study rather than reflecting coincidental relationships, the biases of the researcher, or the limitations of the study design. One aspect of validity is reliability, which refers to the consistency of results from a research instrument, strategy, or approach. That is, a reliable research instrument would be one that yields the same findings when administered multiple times on the same subject. In this chapter, we provide a conceptual overview of these two specific quantitative quality considerations (validity and reliability) as well as of their qualitative counterparts (credibility, transferability, and dependability), because these are the primary yardsticks by which research quality is gauged.
There are, however, some challenges to presenting such a broad overview, because even though these are fundamental terms, they have different meanings for different types of researchers, depending on their area of expertise and methodological training. These concepts take on slightly different definitions and are represented by competing terminology across approaches to inquiry. These different definitions can yield some disagreements across various subfields, although such disagreements are not inherently problematic; indeed, scholarly arguments are a necessary ingredient for improving academic disciplines. Instead of focusing on inconsistencies or evolving definitions, the purpose of this chapter is to make you aware of the conceptual bases of quality in social science research and some of the broad debates that shape these concepts. In this chapter, we do not go into great detail on method-specific issues related to validity and reliability, or the philosophical orientations aligned with different epistemologies; rather, our hope is to describe how your increased attention to research design, execution, and analysis can yield higher quality findings.
We provide a broad introduction to validity, specifically using experimental design as an example of how validity can be undermined or enhanced as a function of design choice. We also review the topic of trustworthiness, which is often used in qualitative research. We chose to highlight these two approaches to research because they are reasonably concrete and span a wide set of studies. Although these approaches are quite different, we argue that quality considerations apply to many different types of social science research (e.g., case studies, survey work, test design, phenomenology, single-case designs, and developing models that predict given outcomes). We also introduce several basic method-specific terms and techniques for improving research quality. Finally, although validity and reliability are two of the most central quality indicators, they are certainly not the only important indicators. To that point, this chapter cannot be thought of as a one-stop source for what you need to know. Indeed, quality considerations pertaining to the research process are involved in every step of the research process (Guest & MacQueen, 2008), and you should always investigate issues of quality that are specific to your chosen research designs.
Validity and Reliability
The meaning of validity is related to the concept of truth; in research, valid findings accurately describe or reflect the phenomenon under study. The concept of truth is also reflected in the qualitative term trustworthiness, which some scholars approximate to the quantitative notion of validity. Cook and Campbell (1979), two of the best-known scholars on experimental design methodologies, wrote that validity is “the best available approximation to the truth . . . of propositions” (p. 37).
There are several considerations in the research process that are necessary to promote valid findings, and these all relate to designing a study that is appropriate to the research question. Such considerations include understanding whether (a) the method of data collection (quantitative or qualitative) enables you to answer the specific research question, (b) the type(s) of data collected (interviews, attitude surveys, standardized test results) enable(s) you to answer the question, (c) the sample of data collected enables you to address a target question (i.e., did you question or test the appropriate types of people or other subjects?), (d) you asked the participants questions that were appropriate to the research question, and (e) you included enough participants such that results can be applied beyond the study. These are just a few of the details that you must consider when thinking about the quality of a study.
Although the concept of validity broadly reflects the idea that research findings reflect the true phenomenon, causal mechanism, or attitudes under study, different types of studies and methods necessitate different approaches to ensure validity. Some methodologists, primarily from qualitative and mixed methods traditions, sometimes use different terminology for concepts related to validity, such as credibility, trustworthiness (e.g., Lincoln & Guba, 1985; see also Guest & MacQueen, 2008; Onwuegbuzie & Johnson, 2006), legitimation (e.g., Johnson & Onwuegbuzie, 2004), and inference quality (e.g., O’Cathain, 2010). Some qualitative methodologists even reject the concept of qualitative validity altogether (e.g., Wolcott, 1990). Some of these disagreements are rooted in real and honest differences in philosophy, or how one thinks about the world. For example, some researchers espouse a postmodern framework to question the assumption that there can be one reality to portray a finding, or that even primary data, such as informant interviews, test results, or survey responses, are able to fully describe that reality (cf. Lofland, Snow, Anderson, & Lofland, 2009; Onwuegbuzie & Johnson, 2006). However, for the purposes of this chapter, we avoid this disagreement and operate from the assumption that certain aspects of reality can be observed and/or measured by researchers; the slight variations in the concepts of validity and trustworthiness correspond to their relationship with various approaches to inquiry and methods.
After choosing research questions, you must consider what kind of study design and methods are appropriate to address your questions at hand. Fortunately, there is an existing framework1 for just about every type of design. Simply put, if you are going to pursue any project that takes several steps and careful thinking, it helps to have a set of guidelines to follow. Methodological guidelines, or frameworks, are available when doing surveys, case studies, psychometric studies (i.e., developing tests and measurement instruments), experiments, ethnographies, phenomenological studies, mixed methods investigations, and so on (see Table 7.1).
A framework is an established structure for the design and execution of a given type of study, including data collection methods, data management, and analytic methods. In addition, all frameworks include components for checking for quality, whether in terms of validity (quantitative) or trustworthiness (qualitative). Frameworks also tend to address validity as both a process, where following certain steps should help yield defensible findings, and as an outcome, where one examines the degree to which a set of findings are defensible (e.g., Kane, 2013; Onwuegbuzie & Johnson, 2006). Validity should not be considered as a one-dimensional goal. It is both a process and an outcome and requires an iterative process that continually helps us get a better understanding of whatever is being studied. On that note, notice that we refer to validity as something that you strive for; it is thus best thought of as a kind of continuum. That is, evidence for validity ranges from being poor to really great, as opposed to something being either valid or not. As Cook and Campbell (1979) have stated, “We should always use the modifier ‘approximately’ when referring to validity, since one can never know what is true. At best, one can know what has not yet been ruled out as false” (p. 37).2
List of Different Frameworks and Suggested Readings for Designing Quantitative Validity and Reliability Research
Type of Inquiry
Used when trying to learn about phenomena in the context of a particular case (e.g., a person, school, etc.).
To be used when a central question is causal in nature, such as when obtaining evidence that a new teaching technique might yield better learning outcomes compared to another technique.
Shadish et al. (2002)
This form of inquiry tends to be used when studying a particular cultural group.
LeCompte & Schensul (2010)
General qualitative inquiry
For studies where the aim is to explore and learn about phenomena in natural settings.
Brantlinger, Jimenez, Klingner, Pugach, and Richardson (2005) Denzin and Lincoln (2005) Lincoln and Guba (1985) Nastasi and Schensul (2005) Patton (2014)
General statistical guidance
There is a lot of guidance around application of the general linear model, which is used in most statistical analyses readers of this text are likely to run into. We offer one text because we find it to be accessible and amusing.
For use when synthesizing the results of multiple, existing studies to learn about aggregated levels of evidence.
Hedges and Olkin (1985) Lipsey and Wilson (2001)
These apply when combining both qualitative and quantitative design elements.
Creswell and Plano Clark (2010) O’Cathain (2010) Tashakkori and Teddlie (2010)
To be used when the central purpose of a study is to develop or refine a test/measurement instrument.
American Education Research Association, American Psychological Association, and National Council on Measurement in Education (2014) Crocker and Algina (1986)
This is a particular variant of qualitative inquiry that focuses on understanding the experiences of research participants.
Single-case (single-subject) designs
For studies that aim to test intervention effects on small numbers of people (e.g., ABAB and multiple baseline designs).
Horner et al. (2005) Kratochwill et al. (2010) Kratochwill et al. (2013) Kratochwill and Levin (2014)
For studies that use surveys, typically when working with a sample and the intent is to learn more about some population of interest.
Dillman, Smyth, and Christian (2009) Fowler (2009) Groves et al. (2009)
Note. This list is not meant to be complete because there is such wide variation in the types of studies and designs within the broad arena of the social sciences. We selected a few on the basis that we think they are commonly used. We also did not attempt to be comprehensive with the citation list. These can be considered as a beginning set of resources to learn more. Later in the chapter, we cover ideas from experimental and qualitative frameworks in more detail. Finally, in some cases, we impose the word framework. Some of the citations we offer use this term and others do not, but we are otherwise confident that the authors of the citations would agree that their intent was to offer guidance on how to carry out the particular form of inquiry.
Quality of Data Sources and Methods for Designing Quantitative Validity and Reliability Research
The quality of data sources and data collection methods has implications for validity. In both qualitative and quantitative studies, there may be inconsistencies among data sources. For example, a subject’s actions may not match what the subject says he or she does, or the topic of a given survey question, such as drug use or other illegal activities, may incentivize subjects to answer items inaccurately (Cronbach, 1946; Groves et al., 2009; Lofland et al., 2009). For example, if you wanted to study the prevalence of cheating, how might you gather such information? Would you conduct interviews to ask participants to confess their tendency to cheat? The chances are fairly high that your participants would underreport their behavior. If you chose a different approach, such as allowing them to self-disclose their cheating behavior via an anonymous survey, you might get a more accurate—or valid— portrayal of their behavior. You must therefore examine your data sources and data collection methods for problems that may undermine validity. For this reason, triangulating data sources and using mixed methods are often done to bring to light inconsistencies among qualitative sources (Denzin, 1989; Guest & MacQueen, 2008; LeCompte & Goetz, 1982; Onwuegbuzie & Johnson, 2006), and there are statistical methods for designing and checking the validity of specific survey and assessment questions (e.g., Borgers, Hox, & Sikkel, 2004).
The validity of a research instrument depends in part on its intended purpose and whether it is used for that purpose. In other words, when thinking about data quality, you should consider the evidence for using a specific instrument in a particular situation, rather than thinking of the instrument as valid or not (see Kane, 2013). Consider, for example, standard college or graduate school entrance examinations designed to assess achievement, such as the Scholastic Aptitude Test (SAT), American College Testing (ACT), and the Graduate Record Examinations (GRE). There may be some evidence that these assessments have a valid application in terms of deciding which students are likely to perform well if admitted to given schools (cf. Brewer, Knoeppel, & Lindle, 2014; Hamilton, Stecher, & Klein, 2002; Heubert & Hauser, 1999; Messick, 1994, 1995), but the evidence to support their use for assessing intelligence is far weaker because achievement and intelligence are currently understood to be two different things. Therefore, it is critical to always think of a measurement instrument as a tool, and then consider whether the tool is being used for its intended purpose (Shadish, 1995). Following the tool analogy, you might have a poorly made hammer, such that even after light work it breaks easily. In this case, the tool itself is problematic. But another consideration is the purpose for which the tool is used.
The hammer may be one of the very best ones ever made, but it still would be a poor tool to use when needing a screwdriver. This analogy applies equally well when thinking about what instruments to use when measuring psychological or educational traits. If, for example, we hoped to assess whether some new teaching technique yielded improvements in reading scores, we would not logically choose a mathematics test to use for the outcome measure. But reading is a complex skill with many subcomponents— such as fluency versus comprehension—and we must be able to distinguish the specific skill(s) being tested and which assessments will measure those specific skill(s). Above all, it is important to make sure you are using the right tool for the job.
A related aspect of understanding validity entails thinking about contextual variables of the study, including local cultures, time period, and environment (Onwuegbuzie & Johnson, 2006). Every variable, even in quantitative work, represents an attribute that is situated within a specific time and place, and depending on the focus of a given study, certain aspects of variables may be relevant to the defensibility of findings. It is therefore important to always consider the context in which data were collected and interpreted. For example, if you were to do a study on political values, you would need to note the current political climate in which you are doing the study. Or, if you were to read a study on political values, you would need to note the publication date on that study and make sure to take into consideration the political climate of that time period. Consider the two largest political parties in the United States: Democrats and Republicans. In this example, time is the specific type of context, because there has been a shift in the overarching political beliefs of these two parties over time. Decades ago, Republicans would have been thought of as being the more liberal of the two political parties. This example demonstrates that researchers, and readers of research, must understand the context of studies.
A related issue is the concept of social construction of variables, or the fact that the common understanding of a variable may be defined by the society in which it is situated, as opposed to being defined by scientific differences. Race is a well-known example of a socially constructed variable, in which racial identity has very little to do with biological differences among races. However, the experiences of people from different races tend to be systematically different within certain societies, based on how those societies consider different races. One example of how race can be socially constructed in multiple ways is the construction of White in the United States. Several immigrant groups that are now considered White, including Irish and Eastern Europeans, were once considered racially different from immigrant groups from other Western and Northern European areas (Jacobson, 1998). Racial identities can change as demographic groups assimilate with or differentiate themselves from other groups, and there are often social, economic, and political benefits and/or drawbacks to these changes. As a researcher, you must be aware of the possibility that an attribute has been socially constructed and, if so, whether and how that social construction affects the meaning of the variable. The ways in which variables are constructed will have considerable influence on study validity (cf. Reynolds et al., 2014; Spillane et al., 2010; Wells, Williams, Treweek, Coyle, & Taylor, 2012). Clear, straightforward definitions of each study variable can help to increase validity.
In summary, we reiterate that validity is somewhat synonymous with truth. And just as your definition and understanding of truth can be individually subjective as well as based on cultural and social interpretations, so it goes with the concept of validity. When thinking in terms of research quality, consider how to design studies that can yield defensible evidence that can be used to make a reasonably accurate inference or proposition. Being detailed, specific, and thoughtful in all of your design elements and analysis will help you increase validity. These efforts both help you as a researcher to keep your focus within the scope of the study and help the readers of your study to understand your specific research questions, methods, and variables.
In a broad sense, reliability refers to the extent to which findings and results are consistent across researchers using the same methods of data collection and analysis. The heart of the concept is synonymous with the notions of consistency and accuracy (Crocker & Algina, 1986), and it is related to validity. You must consider the importance of data and methodological consistency (reliability) because consistency increases the likelihood that your interpretation of data has validity. On the other hand, findings might also be reliably wrong, and this is a critical difference between reliability and validity. To illustrate reliability, consider a scale used to measure a person’s weight. If the scale yields a close approximation to a person’s actual weight, then one would say the scale’s measure is accurate, a key aspect of validity. But now consider the idea of consistency. Suppose a person is weighed once a day for a week, and the scale indicates values of 130, 150, 170, 110, 190, 145, and 155 pounds. Because the weight of one person will not vary so much in a span of a week, the conclusion to be drawn is that the scale is broken. Because of the lack of consistency, or reliability, there is little reason to trust the validity of any single measurement. In this sense, valid estimation requires consistent, or reliable, scores.
It is also instructive to see that just because measurements are reliable, they are not necessarily valid. A person may weigh 150 pounds, but suppose that the measurements produced by the scale across 1 week are 191, 189, 190, 191, 190, 190, and 189 pounds. These scores consistently indicate that the person weighs about 190 pounds, but the estimates are consistently wrong. This idea reflects an often-repeated phrase in research methods: Reliability is a necessary aspect of validity, but insufficient if used alone as a measure of quality. When using tests and surveys to measure a phenomenon, it is thus critical to understand the properties or the measurement tool and consider whether the measurement tool has been well designed and suited for the job at hand. When engaging in qualitative tasks such as observations and interviews, think of strategies for assessing whether any conclusions to be drawn from these data collection approaches are likely to be consistent (reliable) and accurate (valid).
Reliability Issues in Data Collection and Analysis. Reliability checks can take place at two stages of research: during data collection and during analysis. To test reliability at the data collection stage, another researcher could collect data using the same sampling strategy as you to see if consistent data are being collected. An example of this strategy would be two researchers, with two different scales, each measuring the same person. If both scales indicate the same result, you have some assurance of reliability. It is also possible to have two groups analyze the same set of data with the same analytical methods to see if the two groups come to the same results.
The types of data collected have significant influence over the reliability and replicability of a study (Peräkylä, 1997). Quantitative data sets are often easily accessible and transferable among researchers. Furthermore, quantitative data, once collected and recorded, are not usually subject to any detailed interpretation beyond understanding what a number is supposed to represent. For example, consider a five-option response scale, where 5 = strongly disagree, 4 = disagree, 3 = neutral, 2 = agree, and 1 = strongly agree. The number 4 is understood to have one meaning (disagree), and researchers tend not to conjecture further without having special reason to do so. In contrast, qualitative data are often products of the researcher’s filtering and interpreting of information during data collection via observation and interview notes. For example, a researcher creates notes about an observed lesson, and these notes become part of the data set. However, the researcher cannot observe, or record, every detail of the lesson, due to the limited capacity of human observation and as well as choices—conscious or unconscious—that the researcher makes about which details to notice and record. In contrast, data that are not filtered by the researcher at the time of collection include documents, tape-recorded interviews, and videos of observations. The researcher does not have to transfer heard or observed data into a tangible record, such as drawings or notes, because the data are already in a tangible format. Qualitative researchers generally agree that a combination of machine-recorded and interpretive data is ideal in order to achieve a full understanding of the phenomenon under study (see, e.g., Lofland et al., 2009). Research design conceptualization, whether qualitative, quantitative, or mixed method, should entail examining the trade-offs of different kinds of data collection and should incorporate plans for increasing the reliability of the research, such as by using multiple researchers, multiple data sources, detailed data audits, or other strategies.
LeCompte and Preissle (1993) discussed the challenges inherent in trying to replicate the data collection phase of a qualitative study, comparing it with quantitative methods: “Unique situations cannot be reconstructed precisely because even the most exact replication of research methods may fail to produce identical results. Qualitative research occurs in natural settings and often is undertaken to record processes of change, so replication is only approximated, never achieved.” Furthermore, they point out that replication may not even be appropriate for qualitative research: “Researchers whose goals are generation, refinement, comparison, and validation of constructs and postulates may not need to replicate situations. Moreover, because human behavior is never static, no study is replicated exactly, regardless of the methods and designs used” (p. 332). One reason for these issues, suggest LeCompte and Preissle, is the longer history of discussions of reliability within quantitative research arenas. These discussions are still fairly new to qualitative methodologists, as demonstrated by the diversity of opinions regarding theory and standardized practices for achieving reliability in qualitative studies.
In summary, you will need to consider whether your data are collected in a consistent manner and whether the type(s) of data collected will help you develop inferences and propositions that approximate the reality of your studied phenomenon. These are not the only important considerations, however. Even when your data are of high quality and are appropriate for the research questions, there are additional considerations to keep in mind when making analytic inferences from these data. That is, you must choose the most appropriate methods to analyze and interpret data in order to reach valid conclusions about your object of study. We use the methodology of experimental design as one example to demonstrate how analytic methods can positively or negatively affect the validity of conclusions.
Validity Considerations in Experimental Design
Shadish and colleagues (2002) provided an overview of an experimental design validity framework that focuses on validity in the context of experimental designs. There are four components to the framework: internal, external, statistical-conclusion, and construct validity. Each type of validity is briefly reviewed in this section, and we focus on internal and external validity in this chapter because these offer relatively concrete examples of how the truth of a proposition can be defended or undermined. Overall, understanding these validity issues is necessary in order to create rigorous study designs as well as to enable critique when reading empirical work done by other researchers. Note that this particular validity framework was developed to help researchers specifically assess causal mechanisms; that is, it is used to determine whether a particular condition or treatment causes better outcomes compared to some alternative. This is one example of a framework, as discussed earlier, which provides researchers with standards by which to judge the validity of conclusions. The following discussion of internal validity prompts the use of the experimental design framework, due to the element of causation.3
Internal Validity of Experimental Findings
Consider the following statement: I took some aspirin and my headache went away; therefore, aspirin reduced my pain. This statement contains a causal inference: Taking aspirin caused the reduction in pain. The degree to which this inference is valid reflects the degree of internal validity. In experimental designs, researchers examine whether some variable (the independent variable), rather than others, produces some result or change (dependent variable; Shadish et al., 2002). Consideration of internal validity begs the question: How truthful is the proposition that a change in one variable, rather than changes in other variables, causes a change in outcome?
Causal inference, and thus internal validity, can be surprisingly tricky. For any given proposition about a causal inference, there are rival explanations; these explanations are referred to as threats to a statement’s validity. For our example, we might assume that you usually take aspirin with water; if the headache had been caused by dehydration, then it is possible that the water—not the aspirin—was the actual cause of pain relief. Alternatively, it is possible that the headache eventually subsided on its own, and thus it was the natural recovery processes—not the aspirin—that yielded the improvement. In short, just because pain subsided after aspiring ingestion does not necessarily mean that we can be sure that the drug was the causal agent. The point here is that rival explanations exist, and it becomes important to consider whether the aspirin explanation is better than the others. The same logic applies to making decisions about policies. For example, you might be interested in implementing a new teaching technique, a new type of counseling procedure, or pay-for-performance compensation models. These all represent policy options, and making the best policy necessitates having data that result in reasonably valid inferences about program or policy impacts by ruling out rival explanations.
Once you infer that a given approach resulted in the desired outcome, the quality of this inference can be judged by assessing various threats to internal validity. Each threat is a form of alternative explanations—other than the treatment—for the cause of an observed outcome. The experimental validity framework identifies a number of common threats to internal validity; the following discussion draws on examples of these threats in order to illustrate the process of identifying and eliminating rival explanations (see Shadish et al., 2002, for a complete list and description of threats; see also Table 7.2). As an example, there is the so-called history threat, or the possibility that other events may have occurred during the duration of the experiment that could explain the change in outcome. In the case of aspirin, the fact that common headaches eventually subside on their own is an example of such a threat to the inference that it was aspirin that caused pain reduction. This threat, and many of the other internal validity threats, can be addressed by including a comparison (or control) group that does not receive the treatment being studied. If a treatment effect is observed by comparing performance across both groups (i.e., students who received counseling show better outcomes than those who did not), then it becomes the case that the independent variable—that is, treatment exposure—is the best overall explanation for the difference in scores between study groups. An important point here is that this and many other threats to internal validity can be addressed by adding a control group when the intent is to make a causal inference. Indeed, a basic quality indicator for studies that set out to address a causal question is to look for the presence of a control condition (Shadish et al., 2002).
By adding a control group, we also potentially introduce new threats to internal validity. One such threat is selection, which refers to how groups in a study were formed. There are many ways to form groups. People can volunteer to be treated, students might be picked by a teacher, a researcher may decide who is most in need of treatment, and so on. Of the many options, one approach to selecting who is treated and who is in a control group is to use random assignment to treatment and control groups. Such assignment is essentially based on chance: If you use random procedures (such as coin flips or computer algorithms, which are user-friendly in most statistical packages), you can expect that, on average, there will be no systematic differences between groups.
When assignment is nonrandom, such as when participants in groups are purposefully selected based on certain characteristics, there may be key differences between treatment and control groups. For example, due to legal restrictions about research, students in a treatment group may have to be volunteered by their parents/guardians; the requirement that participants actively volunteer and have parent permission introduces the possibility that the characteristics of students in the treatment group are systematically different than those in the control group. Thus, the manner by which selection was done can threaten later attempts to make causal inferences. Suppose the question at hand was whether a new counseling technique yielded better outcomes than typical treatment procedures. Furthermore, suppose that the counseled group did in fact show better outcomes than their control counterparts. One might infer that these improved outcomes were caused by the treatment, but the threat to this inference is the fact that treated students came from families who desired and supported the treatment and students in the control group tended to come from families who were indifferent to the treatment. It is therefore possible that students in the treatment group might have done better than those in the control group even if the treatment was not the cause of the students’ improvements. In other words, the way groups were formed may have made it look like the new counseling technique made a difference, even if it did not. Thus, selection processes can threaten the validity of any inference made about treatment. Researchers can prevent selection bias by using random assignment to treatment and control groups, when possible. When random assignment is not an option, researchers may statistically “control” for other variables such as socio economic status, gender, race, age, and disability status in statistical models. These types of controls can help tease out the differential impact of these contextual factors.
Overview of Internal Validity Threats
Other events may have occurred during the duration of the study that could explain the improved behavior.
During the course of treatment, the children in the counseling treatment group were also assigned to a new teacher who is excellent at managing behavior concerns. In this scenario, was the improvement because of the treatment, the presence of a new teacher, or both?
Maturation in Designing Quantitative Validity and Reliability Research
The fact that people, including study participants, change over time.
During a 1-year study, the treatment students could have simply outgrown their initial behavior problems; their personal development, unrelated to the counseling, may have contributed or even been solely responsible for the improved behavior scores.
The possibility that repeated exposure to a measurement instrument could, by itself, affect test-taking behavior and test scores.
The children who took the baseline measurement test reflected on what the test was measuring and at posttest offered socially desirable responses that resulted in higher scores; yet, their overall classroom behavior may not have actually improved. Instead, observed score change was the result of knowing from baseline the test questions and how to best respond.
(1) A testing instrument may change or may be used in a way that does not correctly measure the treatment effect.
This might happen if there are two versions of a test (Form A and Form B) that are incorrectly assumed to be equivalent. In this case, it could be that behavior as measured by Form A looks more problematic as compared to Form B. If Form B was used at posttest (the second testing session), then any apparent improvement cannot be attributed to the treatment; rather, differences in the test drove the change.
(2) There may be unknown contextual factors that can impact testing.
Perhaps baseline measurement was done in the morning and posttest measurement was done in the afternoon, and for some reason, the children in the study are more likely to report or demonstrate better behavior after lunch.
Designing Quantitative Validity and Reliability Research and Statistical regression to the mean
The phenomenon that extreme scores tend to not be repeated.
Anyone might score unusually high on a psychological measure because of a string of positive but rare events, such as winning the lottery. Taking the test 6 months later may still result in a high score, but not extremely high. Over time, extreme scores—both positive and negative—tend to move closer to the average for that particular measure.
Changes in research design or analysis that are a result of the researcher’s subjective views regarding the study topic, participants, theory of change, or other relevant areas of design or execution.
The researchers may be so convinced that the new counseling approach works that they unintentionally modify aspects of the original study design in order to show that the treatment makes a difference.
The process of creating participant groups in a study. Nonrandom selection (e.g., participants in groups are purposefully selected based on certain characteristics) might yield two groups that are not equivalent at the beginning of a study. If there are key differences between two groups, one cannot know if any posttest differences are because of a treatment effect or because of such differences.
Due to legal restrictions about research, students in a treatment group may have to be volunteered by their parents/guardians; the requirement that participants actively volunteer and have parent permission introduces the possibility that the characteristics of students in the treatment group are systematically different than those in the control group. For example, parents who push for their children to be exposed to a new treatment might, on average, be more involved in their children’s education than ones who do not. In such an example, if we see that children who were treated appear to perform better on a posttest, is such improvement because treated children have more involved parents (i.e., they would have been better off anyway), or is it because the treatment worked? The selection threat is not a concern if study participants are assigned randomly to study groups because, on average, there should be no differences between participants in treatment and control conditions.
Overall mortality (attrition)
Loss of members in the study sample.
A study compares pre- and posttest scores on an assessment in order to measure participant change. Some students drop out before completing the posttest. The loss of part of the sample creates the possibility that the students who remained in the study and completed the posttest have systematically different characteristics than the students who left the study. If this is the case, any average positive or negative change from pretest to posttest may simply be a function of the mortality threat, or the loss of students with particular characteristics.
Members of sample groups (e.g., treatment and control) drop out in different rates, and nonrandomly, in one group as compared with the other(s).
Some students in the treatment sample drop out because they no longer wish to receive the counseling and miss out on other activities during the school day. These students may be differently motivated or have systematically different behavioral characteristics than the students who are willing or happy to miss other school activities.
The broader point here is that, in general, not all designs are equal in terms of their inherent capacity to address the internal validity of target research questions. If you are conducting an experiment, it will behoove you to design studies to have stronger validity because doing so will result in better quality. Without certain design features put into place, the improvement in behavior may indeed be due to the treatment but may also be due to any of the threats listed. Fortunately, in the context of an experiment, as with many other methodologies, there is guidance that you can consult to assist with recognizing and addressing these threats. Unless these threats can be removed as plausible explanations, the study quality must be considered questionable (see Table 7.2).
External Validity of Experimental Findings
External validity—the extent to which findings hold true across contexts— and its threats are also major considerations in research design quality. Suppose you have produced a study with high internal validity; that is, none of the previously discussed threats are plausible explanations for observed improvements in a treated group. The best explanation for the outcome of the study is that the treatment worked. This high level of internal validity leads other researchers to want to know whether this finding has high external validity as well, or whether it is likely to hold true across other students, in other places, times, contexts, cultures, and so on. As with internal validity, there are several common threats to external validity (Shadish et al., 2002) (see Table 7.3).
To elaborate a little, one such threat is treatment variation; this type of threat addresses the degree to which observed treatment effects reflect variations in the treatment received by the study subject(s). Treatment variation can be a function of the human error of administering a treatment or a function of seemingly innocuous choices around implementing a program. One example would be inconsistency in dosage levels; consider two teachers ostensibly delivering the same intervention, but one teacher has excellent classroom management skills and the second teacher does not. The second teacher’s students receive less of a dosage of the intervention because one-third of class time is spent on classroom management issues. Other examples include the time of day that treatment is delivered or failure to correctly implement some element of the treatment.
Threats to external validity present a number of concerns, and researchers must find ways to address these threats. There are two broad strategies for addressing threats to external validity. The first is to engage in thorough literature reviews and to build on previous, related studies. External validity can be strengthened by limiting the research focus and by comparing new findings to existing studies in the literature. A careful review can highlight gaps in the existing literature; these gaps then justify a specific focus that is situated within an existing framework of studies. For example, a specific counseling technique may have been thoroughly studied in residential treatment settings, and so your focus might be on the first effort to try it in a public school. The design of your study will be strengthened by the evidence available from other related studies, and the threats to external validity will be minimized by limiting the focus to a very specific area.
The second strategy is to think carefully about ways in which your findings may apply, or generalize, to other settings. Shadish (1995) listed a number of principles that can help you think about generalization when doing experiments, ethnographies, or other types of studies. You must consider how applicable the findings from your study might be to another setting, such as similarities in the sample and how it was obtained, measurements used, duration, and other treatment details. Above all, claims of general-izability are most appropriate when there is evidence that a very specific aspect of a treatment yields an exact outcome. Knowing what aspects of a study are likely to generalize and what aspects are likely to be highly context specific is the key to thinking through considerations that might threaten the generalizability of a finding to some new scenario.
Overview of External Validity Threats
Interactions of the observed causal relationship with sample units
The possibility that whatever was observed with one particular sample may not hold true for different samples.
Simply put, the treatment may work well with one type of student and not another.
The effect of a treatment reflects variations in how it was administered, and so on, as opposed to the effect of the treatment itself.
Treatment variation can be a function of the human error of administering a treatment or a function of seemingly innocuous choices around implementing a program (e.g., dosage levels, time of day treatment is delivered, or failure to correctly implement some element of the treatment).
Types of outcome measures used
Treatment effects may be found with one kind of test but not another.
One might see an effect with a particular type of test but not another. If two tests measure approximately the same thing (e.g., SAT and ACT), this should be less of a concern, although when differences are found across similar but different tests, one has to wonder about the external validity of any observations from a study. Logically, concerns arise when thinking about the degree to which study findings might be externally valid when thinking about clearly different outcome measures.
Settings in which the treatment was delivered
The possibility that observed effects are due to contextual factors, as opposed to the treatment itself.
A simple example of this threat would be observing treatment effects in a school that is located in a high-income community; the same effects may or may not hold in more impoverished settings.
The influence of a mediating factor in one setting versus another setting.
A common mediating factor is treatment dosage; others may be factors such as staff skill or availability. For example, is it possible to fully implement an intended treatment in the form of intense counseling in an overcrowded school setting where there are extensive demands on a counselor’s time?
Note. SAT = Scholastic Aptitude Test; ACT = American College Testing.
To illustrate these issues around generalizability, we use an example from a study by Paino, Renzulli, Boylan, and Bradley (2014) that did not examine the effect of a treatment, but rather charter school closings in North Carolina (this is to show that generalization can and should be pondered not only when dealing with treatment effects but also when dealing with other issues, such as state policy). The authors performed a quantitative analysis of data on charter schools and the nearby public districts. Data included financial information, local market variables, density of charters in the area, school demographics, enrollment, age of school, and academic performance information. This quantitative analysis allowed the researchers to examine the probability of a charter school closing at a given point in time. The findings suggested that charter schools were less likely to close with increases in school enrollment, compliance with federal desegregation orders, and state and federal funding of charters. However, because the location of this study is in one state, its findings may not generalize well to another state that may have different policies. Here, the authors’ inclusion of a qualitative case study analysis could help them to better understand the degree to which these findings might generalize to other states and contexts. Suppose a state has conditions similar to that of North Carolina—contextual conditions that have been rigorously analyzed in relation to the quantitative findings. As a reviewer of the study, you may feel more confident in applying the study findings to that new context. On the other hand, if the case studies in North Carolina show major differences in state charter policies, funding, or enrollment patterns, you may not feel confident in using the findings of this study to understand patterns in the other state.
Statistical-Conclusion and Construct Validity
There are two remaining types of validity from Cook and Campbell’s (1979) framework. Statistical-conclusion validity refers to the degree to which researchers are correct about the relationship between two variables. This type of validity requires not only that researchers know which kind of statistical models or techniques are appropriate for a given data set and research question but also that they can accurately test those models and apply those techniques. Shadish and colleagues (2002) identified nine distinct threats that are helpful; if you are doing quantitative research, we highly encourage you to review this resource in depth. Other concepts and techniques that relate to statistical-conclusion validity include statistical power, data cleaning, and outlier analyses. Measurement reliability, or lack thereof, is classified as a threat to this form of validity.
Construct validity refers to the degree to which underlying ideas (e.g., treatments, behaviors, behavior problems, cooperative learning, and socioeconomic status) are properly conceptualized and operationalized in a study. Every study is based on a set of concepts that underlie the theory being tested. In our ongoing example, the theory being tested in the experiment is that a certain type of counseling intervention will improve problematic behavior issues. If the measurement of this improvement is completed through a student pre- and postintervention assessment, we must ensure that (a) the intervention addresses the behaviors under study and (b) the questions on the assessment correctly represent the behaviors under study. An intervention or measurement that does not accurately represent the constructs being studied cannot result in valid findings about the constructs.
Considerations in Qualitative Inquiry
Earlier, we presented aspects of the experimental validity framework to demonstrate the point that your design choices can affect the validity of inferences you make at the end of your study. We also demonstrated this point because causal questions tend to be of wide interest. Moving forward, we focus on another broad arena: qualitative research. Some of the challenges, or threats, to reaching validity and reliability in quantitative and qualitative research are similar, although they must be observed or measured using different techniques (e.g., Onwuegbuzie & Leech, 2007). For example, whereas quantitative researchers attempt to statistically control for variables that may influence the outcome, qualitative researchers attempt to understand the influence of variables through careful observation and recording of phenomena (Cook & Campbell, 1979; LeCompte & Goetz, 1982). In the next section, we provide an introduction to validity and reliability issues regarding qualitative research methods.
Trustworthiness is the qualitative term that is often used in place of the quantitative term validity. Trustworthiness is the degree to which you, as a researcher, can have confidence in your sources as well as the methods used to gather your sources. Steps taken in the earliest stages of research— study purpose and design—can help you decide which collection methods will result in the most relevant, trustworthy data for your questions under study. Ethnographic field notes, formal and informal interviews, formal and informal observations, video recordings, photographs, and archival records offer different strengths and weaknesses (LeCompte & Goetz, 1982). For example, Peräkylä (1997) discussed the specific benefits and drawbacks to tape-recorded and transcribed (audio and/or visual) data, as compared to ethnographic field notes. Field notes filter observations at the time of data collection through the researcher’s particular frameworks; in contrast, audio/visual recordings capture all of the data from one particular angle and/ or sense (e.g., visual vs. audio). Downsides to audio/visual recordings are the inabilities to see gestures and movements or to see the observation from multiple angles or perspectives, respectively. Ethnographers can take in an entire observation site through all of the senses, but they are limited in what they can record in words or pictures. Using a combination of these data collection methods would allow you to compare two or more data sources; such comparisons can highlight areas of inconsistency that need further inquiry or patterns/themes that have a high degree of consistency (i.e., they surface in multiple types of sources and in ways that do not conflict).
There are a variety of ways in which you as a qualitative researcher can check the trustworthiness of emerging themes in your data (Tracy, 2010). During data collection and analysis, researchers can attend to potential observer effects, employ multiple researchers, and use member checks. Also see Lincoln and Guba (1985) and Nastasi and Schensul (2005) for more in-depth discussion on trustworthiness.
Observer Effects. As a qualitative researcher, you must address the potential influence of observer effects, which is the possibility that collected data have been contaminated, or influenced, by your presence or your research instruments. One example of observer effects is a change in participant behavior during observations due to your presence (LeCompte & Goetz, 1982). Depending on the type of activity and individuals under observation, your demographic characteristics, and the methods by which you are recording data, participants may consciously or unconsciously change their behavior. If participants change their behavior, then you cannot report that their observations are typical of participants’ natural or normal behavior.
We can use the example of a counseling intervention to illustrate this issue. Imagine this scenario: Suppose there is a qualitative observation element to the study, in which you might observe in a group counseling session for student participants. The majority of the students in the session speak English as a second language, and about half of them have parents who are not U.S. citizens. The majority of students in the group are also on free or reduced-price lunch. In comparison, you, the researcher, are White, well dressed, and speak only English. The demographic differences between you and the student participants include social class, first language, age, and, in some cases, citizenship. These differences may lead students to behave differently in front of you than they would behave with only the counselor present; additionally, they may behave differently in a group session with their peers than in a one-on-one session with the counselor. You can take two precautions against observer effects. First, you can note all of the potential effects that your presence may have on the participants or their behavior; getting a second opinion on these potential effects can dually strengthen this precautionary strategy. Second, you can follow up with members of the group—in this case, the counselor or one of the participants—to ask whether the observed session was typical or uncommon in any way. This type of context from a normal member of a group can help you put your observations in perspective.
Multiple Researchers. Although not always feasible in qualitative studies, using multiple researchers in data collection has benefits as well as challenges for validity. When multiple researchers collect data, they are able to demonstrate that they are recording data in comparable ways; this is vital to study validity. Similar to interrater reliability (see later discussion), mul-tiresearcher data collection procedures must be uniform in order to collect valid and trustworthy data across an entire study. One example of aligning data collection procedures relates to level of detail; all researchers should know how much detail to include in field notes or observation rubrics. This is true for all methodologies; just as tests producing quantitative data must be administered and recorded consistently, interview and observation data must be recorded using the same techniques.
Member Checks. Similar to one of the strategies involved in minimizing observer effects at the data collection level, member checking involves sharing emergent patterns and findings with members of subject groups to get feedback on the accuracy of those findings. Although the purpose of independent research is to create and implement an unbiased research design that examines the input of all relevant stakeholders or participants, there are also limitations to using outside researchers. Outside researchers rarely have the insider cultural perspective or organizational knowledge that is needed to fully understand the phenomena being observed. Member checking allows the outside researcher to share his or her ideas with the views of an insider and develop an ongoing, increasingly accurate understanding of the phenomena (LeCompte & Goetz, 1982; Lofland et al., 2009). The dual use of insider and outsider perspectives is crucial to achieving this accuracy because both perspectives tend to have particular types of biases, such as ingrained cultural or social beliefs (Bloor, 1978; Turner & Coen, 2008). Such beliefs can include views on gender/sex, racial or ethnic groups, or age-appropriate behaviors; for example, a study participant who is a member of a diverse urban community may have different views on race and ethnicity than a White researcher working in a predominantly White, elite institution.
Reliability in Qualitative Research
The concept of reliability, sometimes called dependability, is relevant in some ways for qualitative methods and problematic in others. Specifically, the definition of reliability as “replicability” is problematic for qualitative, especially naturalistic, methodologies. As LeCompte and Goetz (1982) explained, “Because human behavior is never static, no study is replicated exactly, regardless of the methods and designs used” (p. 332). However, there are ways in which the larger concept of reliability has been adapted to apply to qualitative, naturalistic fields of study. Areas of focus within the umbrella of qualitative reliability include the replicability of data collection and analysis (e.g., understanding how much of the analysis is specific to an individual researcher’s interpretations) and intercoder reliability or interrater agreement, which refers to the degree to which multiple researchers within the same study agree on how to describe and categorize the observed data in terms of the study’s theoretical framework. These issues of reliability can be found in many qualitative studies, and the researcher subjectivity plays an important, if sometimes overlooked, role in these processes. The following sections examine challenges to and strategies for strengthening qualitative reliability.
Researcher Subjectivity. Researcher subjectivity refers to the unique perspective that each researcher brings to a given study; this uniqueness poses a reliability challenge for qualitative studies at the stages of both data analysis and data collection because the interpretations of two or more unique researchers are unlikely to be identical, or replicable (Carey & Gelaude, 2008). For example, in an empirical study of qualitative thematic coding, Armstrong, Gosling, Weinman, and Marteau (1997) found that a sample of trained, experienced experts in qualitative coding, looking at the same data set, did not reach the exact same conclusions about the data. The study demonstrated that when multiple researchers analyzed the same data, the themes that emerged were similar enough to be considered common, but different enough to highlight the role of researcher discipline, training, and cultural background. The findings of this study suggested that the inherent nature of subjective analysis in qualitative methods will result in some degree of agreement and some degree of disagreement. See Glaser and Strauss’s (1967) description of the constant comparison method for a specific example of how to systematically code qualitative data.
Reflexivity. The findings of this study also point to the need for individual researchers to be reflexive, or transparent and forthcoming about their demographics, their discipline, their training, and any other characteristics that may influence their collection or analysis of data. Toward this end, you should reflect on your position in relation to the study and examine the potential for bias based on your cultural or socioeconomic background, nationality, ability status, and other factors (LeCompte & Preissle, 1993; Onwuegbuzie & Johnson, 2006). Your explanation of methodology should also include steps that you take to minimize the impact of your researcher bias on research design, data collection, and analysis (Guest & MacQueen, 2008).
Interrater Reliability. When multiple researchers are used to analyze qualitative data, reliability issues multiply as well. In addition to being reflexive about individual characteristics, the research team must also take steps to ensure that they are using the same criteria to collect or interpret the same data set. Interrater reliability refers to the rate of agreement among multiple research team members applying the same analytic methods to the same data set; these methods typically involve some degree of researcher subjectivity, such as coding text or rating observed behaviors. Additional benefits to determining interrater reliability are twofold: The process allows the research team to examine both the team’s understanding of codes and concepts as well as individual team member accuracy in using the coding or rating system (Carey & Gelaude, 2008). Like many phases of research, interrater agreement is an iterative process. If the independently coded data samples end up with substantially different results, the coding system must be reviewed and clarified or individual coders must be trained further. Interrater reliability testing must continue until the desired level of agreement among researchers has been achieved (MacQueen, McLellan-Lemal, Bartholow, & Milstein, 2008).
Transferability. A final point, related to reliability in qualitative research, is to consider the concept of transferability. Transferability is the degree to which a set of findings from one study will transfer to another particular situation (Lincoln & Guba, 1985). The idea is largely associated with qualitative inquiry, but the principle can be applied to almost any kind of study. The general challenge in transferability is describing the setting of a study with sufficient clarity and detail so that readers of that study can make their own judgments about what does and does not apply to their particular scenarios.
Designing Quantitative Validity and Reliability Research Conclusion
Research quality is important in all disciplines and fields, including program development and implementation, because all knowledge—understanding human behavior, program designs, and effects of medical treatments—is influenced by the quality of the research on which it is based. If inaccurate research findings are used as the basis for products, program development, or policy improvements, these changes are unlikely to actually work as hoped, potentially wasting time and other valuable resources. Some areas of product or program development have a variety of parties with established financial or political stakes in the direction of development; here, it is especially important that cited research be independent and of high quality. Peer review is generally understood to be a hallmark in the research process, because it entails review by multiple experts in the field; the experts are looking for indicators of research quality that provide confidence in the findings. Even with basic knowledge of indicators of research quality, it is possible for a layperson to review the methodology of a given piece of research and decide for oneself whether the piece contains the necessary quality indicators.
It is also important to note that there are particular aspects of mixed methods research that lend it to increasing validity, such as the ability to take advantage of the strongest tools of each framework and discard the weaker tools (Onwuegbuzie & Johnson, 2006). One of the challenges of using mixed methods is figuring out which tools are strongest for which research questions, and Onwuegbuzie and Johnson (2006) discussed several sets of existing guidelines for making mixed methods research decisions (e.g., Collins, Onwuegbuzie, & Sutton, 2006; Greene, Caracelli, & Graham, 1989; Onwuegbuzie & Johnson, 2004). Some qualitative and quantitative methodologists, without purposefully using a mixed methods framework, have incorporated these tools organically in order to best answer their research questions (e.g., Reynolds et al., 2014; Wells et al., 2012).
In this chapter, we have offered an introduction to the range of quality issues that can arise in research studies. This introduction should help you understand that threats to validity and reliability can surface at any point of the research project: design, data collection, data analysis, or even results reporting. To handle validity and reliability concerns, you first need to be aware of them. At every step, you should be looking out for possible threats to research quality, making sure that their design minimizes these threats as much as possible, and clearly reporting the severity of existing threats. To facilitate this process, you should first have a clear understanding of your research question(s). Then, you should seek out methodological frameworks or guidance that promote thinking through designs and generating the highest quality inferences. Finally, you should identify design choices that have the capacity to answer the question well. Again, Table 7.1 is designed with that purpose in mind.
Given the space allocated for this chapter, our overriding advice for you is to appreciate the idea that design decisions can influence the quality of the data collected, later analyses, and overall inferences drawn from your work. We recommend that you investigate further the wide range of specific strategies and techniques to address the threats to validity and reliability that were briefly introduced here. The following chapter, building on these quality concerns, examines ethical considerations in research projects. Many of the same rationales for research quality support the concern for ethics in research, such as the increasing focus on using research findings to make policy and program decisions.
Designing Quantitative Validity and Reliability Research Key Sources
Brewer, C., Knoeppel, R. C., & Lindle, J. C. (2014). Consequential validity of accountability policy: Public understanding of assessments. Educational Policy 29, 1–35. doi: 10.1177/0895904813518099
LeCompte, M. C., & Preissle, J. (1993). Ethnography and qualitative design in educational research (2nd ed.). San Diego, CA: Academic Press.
O’Cathain, A. (2010). Assessing the quality of mixed methods research: Toward a comprehensive framework. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed., pp. 305–338). Thousand Oaks, CA: Sage.
Shadish, W. R., Cook, T., & Campbell, D. (2002). Experimental and quasi- experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Tracy, S. J. (2010). Qualitative quality: Eight “big-tent” criteria for excellent qualitative research. Qualitative Inquiry, 16(10), 837–851.
Designing Quantitative Validity and Reliability Research References
American Education Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The place of inter-rater reliability in qualitative research: An empirical study. Sociology, 31(3), 597–606.
Bloor, M. (1978). On the analysis of observational data: A discussion of the worth and uses of inductive techniques and respondent validation. Sociology, 12(3), 545–552.
Borgers, N., Hox, J., & Sikkel, D. (2004). Response effects in surveys on children and adolescents: The effect of number of response options, negative wording, and neutral mid-point. Quality & Quantity, 38(1), 17–33.
Brantlinger, E., Jimenez, R., Klingner, J., Pugach, M., & Richardson, V. (2005). Qualitative studies in special education. Exceptional Children, 71, 195–207. DOI: 10.1177/001440290507100205
Brewer, C., Knoeppel, R. C., & Lindle, J. (2014). Consequential validity of accountability policy: Public understanding of assessments. Educational Policy, 1–35. DOI: 10.1177/0895904813518099
Carey, J. W., & Gelaude, D. (2008). Systematic methods for collecting and analyzing multidisciplinary team-based qualitative data. In G. Guest & K. M. MacQueen (Eds.), Handbook for team-based qualitative research (pp. 227–272). Altamira, CA: Lanham.
Collins, K. M. T., Onwuegbuzie, A. J., & Sutton, I. L. (2006). A model incorporating the rationale and purpose for conducting mixed methods research in special education and beyond. Learning Disabilities: A Contemporary Journal, 4, 67–100.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Boston, MA: Houghton Mifflin.
Creswell, J. W., & Plano Clark, V. L. (2010). Designing and conducting mixed methods research (2nd ed.). Thousand Oaks, CA: Sage.
Crocker, L. M., & Algina, J. (1986). Introduction to classic and modern test theory. New York, NY: Holt, Rinehart, and Winston.
Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological Measurement, 6(4), 475–494.
Denzin, N. K. (1989). The research act: A theoretical introduction to sociological methods. Englewood Cliffs, NJ: Prentice Hall.
Denzin, N. K., & Lincoln, Y. S. (Eds.). (2005). The discipline and practice of qualitative research. In The Sage handbook of qualitative research (3rd ed., pp. 1–32). Thousand Oaks, CA: Sage.
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail and mixed-mode surveys: The tailored design method (3rd ed.). Hoboken, NJ: Wiley.
Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). Thousand Oaks, CA: Sage.
Fowler, F. J. (2009). Survey research methods. Thousand Oaks, CA: Sage.
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. New Brunswick, NJ: Transaction.
Greene, J. C., Caracelli, V. J., & Graham, W. F. (1989). Toward a conceptual framework for mixed-method evaluation designs. Educational Evaluation & Policy Analysis, 11, 255–274.
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd ed.). Hoboken, NJ: Wiley.
Guest, G., & MacQueen, K. M. (Eds.). (2008). Reevaluating guidelines in qualitative research. In Handbook for team-based qualitative research (pp. 205–226). Altamira, CA: Lanham.
Hamilton, L. S., Stecher, B. M., & Klein, S. P. (2002). Making sense of test-based accountability in education. Washington, DC: Rand.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. New York, NY: Academic Press.
Heubert, J. P., & Hauser, R. M. (1999). High stakes: Testing for tracking, promotion and graduation. Washington, DC: National Academy Press.
Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165–179.
Jacobson, M. F. (1998). Whiteness of a different color: European immigrants and the alchemy of race. Cambridge, MA: Harvard University Press.
Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has come. Educational Researcher, 33(7), 14–26.
Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457.
Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. M. (2010). Single-case design technical documentation. Retrieved from http://ies.ed.gov/ncee/wwc/pdf/wwc_scd.pdf
Kratochwill, T. R., Hitchcock, J. H., Horner, R. H., Levin, J. R., Odom, S. L., Rindskopf, D. M., & Shadish, W. R. M. (2013). Single-case intervention research design standards. Remedial and Special Education, 34, 26–38. doi:10.1177/0741932512452794
Kratochwill, T. R., & Levin, J. R. (Eds.). (2014). Single-case intervention research: Methodological and statistical advances. Washington, DC: American Psychological Association.
LeCompte, M. D., & Goetz, J. P. (1982). Problems of reliability and validity in ethnographic research. Review of Educational Research, 52(1), 31–60.
LeCompte, M. D., & Preissle, J. (1993). Ethnography and qualitative design in educational research (2nd ed.). San Diego, CA: Academic Press.
LeCompte, M. D., & Schensul, J. J. (2010). Designing and conducting ethnographic research: An introduction (2nd ed.). Plymouth, United Kingdom: AltaMira Press.
Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Lofland, J., Snow, D. A., Anderson, L., & Lofland, L. H. (2009). Analyzing social settings: A guide to qualitative observation and analysis (4th ed.). Belmont, CA: Wadsworth.
MacQueen, K. M., McLellan-Lemal, E., Bartholow, K., & Milstein, B. (2008). Team-based codebook development: Structure, process, and agreement. In G. Guest & K. M. MacQueen (Eds.), Handbook for team-based qualitative research (pp. 119–135). Altamira, CA: Lanham.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13–23. doi:10.2307/1176219
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. doi:10.1037/0003-066X.50.9.741
Moustakas, C. (1994). Phenomenological research methods. Thousand Oaks, CA: Sage.
Nastasi, B. K., & Schensul, S. L. (2005). Contributions of qualitative research to the validity of intervention research. Journal of School Psychology, 42, 177–195. doi:10.1016/j.jsp.2005.04.003
O’Cathain, A. (2010). Assessing the quality of mixed methods research: Toward a comprehensive framework. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed., pp. 305–338). Thousand Oaks, CA: Sage.
Onwuegbuzie, A. J., & Johnson, R. B. (2004). Mixed method and mixed model research. In B. Johnson & L. Christensen (Eds.), Educational research: Quantitative, qualitative, and mixed approaches (pp. 408–431). Boston, MA: Allyn & Bacon.
Onwuegbuzie, A. J., & Johnson, R. B. (2006). The validity issue in mixed research. Research in the Schools, 13(1), 48–63.
Onwuegbuzie, A. J., & Leech, N. L. (2007). Validity and qualitative research: An oxymoron? Quality & Quantity, 41, 233–249.
Paino, M., Renzulli, L., Boylan, R., & Bradley, C. (2014). For grades or money? Charter school failure in North Carolina. Educational Administration Quarterly, 50(3), 500–536.
Patton, M. Q. (2014). Qualitative research and evaluation methods: Integrating theory and practice (4th ed.). Thousand Oaks, CA: Sage.
Peräkylä, A. (1997). Reliability and validity in research based on tapes and transcripts. In D. Silverman (Ed.), Qualitative research: Theory, method, practice (pp. 201–220). London, United Kingdom: Sage.
Reynolds, J., DiLiberto, D., Mangham-Jefferies, L., Ansah, E. K., Lal, S., Mbakilwa, H., . . . Chandler, C. I. (2014). The practice of “doing” evaluation: Lessons learned from nine complex intervention trials in action. Implementation Science, 9(75), 1–12.
Shadish, W. R. (1995). The logic of generalization: Five principles common to experiments and ethnographies. American Journal of Community Psychology, 23, 419–428.
Shadish, W. R., Cook, T., & Campbell, D. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.
Spillane, J. P., Pareja, A. S., Dorner, L., Barnes, C., May, H., Huff, J., & Camburn, E. (2010). Mixing methods in randomized controlled trials (RCTs): Validation, contextualization, triangulation, and control. Educational Assessment, Evaluation and Accountability, 22(1), 5–28.
Tashakkori, A., & Teddlie, C. (Eds.). (2010). Sage handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.
Tracy, S. J. (2010). Qualitative quality: Eight “big-tent” criteria for excellent qualitative research. Qualitative Inquiry, 16(10), 837–851.
Turner, S., & Coen, S. E. (2008). Member checking in human geography: Interpreting divergent understandings of performativity in a student space. Area, 40(2), 184–193.
Wells, M., Williams, B., Treweek, S., Coyle, J., & Taylor, J. (2012). Intervention description is not enough: Evidence from an in-depth multiple case study on the untold role and impact of context in randomised controlled trials of seven complex interventions. Trials, 13(95), 1–17.
Wolcott, H. F. (1990). On seeking—and rejecting—validity in qualitative research. In E. W. Eisner & A. Peshkin (Eds.), Qualitative inquiry in education: The continuing debate (pp. 121–152). New York, NY: Teachers College Press.
Yin, R. K. (2009). Case study research: Design and methods (4th ed.). Thousand Oaks, CA: Sage.
1 By framework, we refer to a set of ideas that can help us think through research processes and findings. Shadish, Cook, and Campbell (2002), for example, describe four facets of experimental validity and ways in which validity can be undermined during the course of an experiment. Some have called this the Campbellian validity framework. Lincoln and Guba (1985) offer one of the earlier sets of guidelines for strengthening qualitative inquiry. As another example, O’Cathain (2010) describes a framework for assessing the quality of mixed methods studies.
2 You may have been exposed to this sort of wisdom before. Consider the saying: Believe those who seek the truth; doubt those who say they have found it.
3 A broader notion of internal validity can be conceptualized as the degree to which interpretations of a particular data set are reasonable inferences, causal or otherwise, without getting into the separate question of how well a set of findings applies to new settings outside of the study. But for now, we apply the narrower idea as used in an experimental framework, where one has a research question related to causation.
Designing Quantitative Research
The results of a research study are significant in the event that they can be considered as accurate and confidently in their interpretation. The element of accuracy and confidence in the interpretation of a research study’s result is subsequently dependent on the validity of the study. Validity in this case infers to the degree in which a research study’s inferences can be articulated from the results of the study. In consideration of this, there are two primary aspects of validity that include the internal and the external validity.
This can be established as the extent in which the results of a research study are considered as a function of the variables that are manipulated in a systematic way, measured and observed during a study. An example of this can be seen in a researcher determined to establish which of the two instructional approaches are superior in teaching a mathematical concept within a classroom setting(Haegele, & Hodge, 2015). The researcher would be intrigued to encourage two tutors to use each of the instructional methods and then take a comparison of the mean test scores of each and every class following the use of the instructional method. The validity of this study can be depicted in the tutors efficiency and enthusiasm in using the instructional method, the interest of the class and their preparation. In this case, it is essential to establish that some of the potential threats of internal validity include:
History can be considered as the occurrence of events that are prone to alter the end result or outcome of a research study. In this case, before conducting a research study, it is essential to determine that a previous history is likely to have taken place(Haegele, & Hodge, 2015). For instance, a study on the effectiveness of a new concept used in teaching a unit on the biology of a nervous system may be overtaken by history since many students may have watched a documentary on this on the television.
The aspect of maturation depicts the changes that are likely to occur on the subjects of a study during the research period. These changes are considered as not part of a study since they are likely to affect studies results(Haegele, & Hodge, 2015). For example, in a biological growth process, a researcher may be forced to consider the element of weight gain or the increase in an individual’s height that results from lunch or breakfast programs as a change that may occur during a study.
Mitigating the Potential Threats of Internal Validity
In addressing the element of history in internal validity, a research may consider using a control group that is selected within the same population within an experimental group(Haegele, & Hodge, 2015). This group therefore needs to experience the same history as the experimental group, an aspect that would eliminate the effects of history. On the other hand, the duration of an experiment may be shortened in reducing such effects. On the other hand, the effects of maturation can be compared to those of maturation and can be mitigated through the selection of the same population from as that of an experimental group and the study period may be shortened as well.
This refers to the extent in which a studies result can be generalized in a confident way to a larger group that engaged in the study(Haegele, & Hodge, 2015). In this case, a researcher needs to determine the reasons behind the use of variables that are similar to the aspects that exist within the larger population. Some of the potential threats of an external validity include:
The Selection-treatment Interaction:
This is primarily considered as the possibility of the selected participants characteristics interactions with some elements of the treatment(Haegele, & Hodge, 2015). This may therefore include the participants learning, prior experiences, personality or any other elements that may interact with the effects of the study.
Effects of an Experimentation Arrangement
This primarily infers to the situations in which the participants of a study become aware of their involvement in a study and as a result of this, their performance and response changes from what would have been.
Mitigating the Potential Threats of Internal Validity
The possible approaches of mitigating threats to external validity include the inclusion of an efficient design by adding treatment or control groups and differential waves of measurement (Haegele, & Hodge, 2015). On the other hand, a researcher may also consider the use of statistical analysis
Ethical Issue in Quantitative Research
Ethics can be perceived as the development of a good study conduct with the aim of making moral judgments on the element of good conduct. In quantitative research, one of the ethical issues that need to be given consideration is the acquisition of the participants consent in a study(Haegele, & Hodge, 2015). This may influence the design decision of a study since the researchers may have to include efficient methods aimed at attaining the consents of participants in a research study.
Amenability of a Research to Scientific Study Using a Quantitative Approach
In considering this, it is vital to establish that this element enables a researcher to scientifically establish the primary causes of his/her observations with the aim of in providing unambiguous answers to the research studies intent. This element remains essential since without it, the cause of an effect may not be established and isolated.
Main Issue Post
As established, the primary issues established in this post can be seen in the construction of a social variable that determines the limitations of racial identity with the biological differences that exist among races (Haegele, & Hodge, 2015). It is essential to consider that an individual’s experience on different faces may be viewed as systematically different within particular societies based on how these societies take cognizance of the element of racial differences.
An instance of this can be viewed in the manner in which multiple races are socially contracted among the whites in U.S. Numerous immigrant groups that are now classified as Whites that include the Eastern Europeans and the Irish populations were first considered as racially different from other groups that include the North European and Western regions. In this case, racial identities may be viewed as changing as a result of the assimilation of demographic groups that differentiate themselves from other groups as a result of political, economic and social variables.
As a researcher, it is credible to take cognizance of the possibilities that result in the social construction and the manner in which such constructions affect the meaning of a studies variable. This is ion consideration of the fact that variables are constructed to have immense influence on studies validity (Haegele, & Hodge, 2015). This therefore requires a clear and concise definition of each and every variable in a study with the aim of increasing the validity of the study. On the other hand, it is vital to establish the context in which research data is collected and interpreted.
As determined in this study, the element of internal and external validity plays a significant role in a study since they determine the confidentiality and accuracy of a research design.
Haegele, J. A., & Hodge, S. R. (2015). Quantitative Methodology: A Guide for Emerging Physical Education and Adapted Physical Education Researchers. Physical Educator, 7259-75.