Reliability and Validity within Assessment: Reaction

Posted on

Reliability and validity within assessment, as well as all parts of education, is necessary in order to make the results of that educational work appeal to peers and those requesting the work done. Without reliability or validity in the work, the results of that work become useless. But in order to understand that impact of both aspects independently, it is necessary to understand the terms clearly.

Reliability has been defined differently depending on the experts that have been consulted. Baer defined reliability as “the degree to which two observers viewing the same behavior at the same time agree on its occurrence and nonoccurrence” (Gresham, 2003). This means that in order to truly have a reliable result, it would need to be recognized by at least more than one observer of the same result at the same time. As a definition, this is perhaps the most widely accepted of applied behavior analysis, and remains so to this day (Gresham, 2003).

Johnston and Pennypacker defined the results very differently, as “the consistency with which measure of behavior yield the same results” (Gresham, 2003). This applies to the consistency of results based on the same behavior, and is perhaps more applicable for individual experiments and observations. This differs, because the first definition by Baer doesn’t take into consideration individual bias of the observer during the behavior observation. Therefore, in an educational environment, one teacher can see the results of a student’s behavior as being completely different than another teacher’s observation, even though it is the same behavior being observed. This definition provides for the actual results of the behavior, and not the interpretation of the behavior that produces that result.

Validity does not rely on hypothetical constructs for description, but on actual results (Gresham, 2003). According to Johnston and Pennypacker “if the behavior under study is directly measured, no question about validity exists” (Gresham, 2003). This leaves only indirectly measured results to be given validity, which is therefore validated by directly measured results. Of course, this view assumes that direct methods of measurement do not contain large amounts of error (Gresham, 2003). This of course does present a problem for our definition of validity, when the question of error is brought before us.

In answer to that problem, many behavior analysts consider the concept of accuracy to be much more important than validity (Gresham, 2003). While validity measures the results, accuracy measures the degree in which the results reflect the true state that the analysis is meant to measure (Gresham, 2003). This of course calls into consideration the content of the analysis that is being measured, and its reflection on the true state that is being measured.

In answer to that, it seems that content validity has become more relevant than any other types of validity (Gresham, 2003). Linehan has argued assessment procedures need to be focused on actual representative sampling before validity can be given to the results (Gresham, 2003). Others feel that multiple sources of results, as well as measures, provide validity to the overall assessment, as it can give a more complete picture of the analysis to be evaluated (Henderson-Montero et al., 2003). Both provide a more comprehensive understanding of the results through valid content, allowing for acceptable statistical error.

Reliability and Validity in Application
Now that we have a general feel for what we want in our assessments, how can we apply this knowledge to an actual assessment situation? Lane and Ziviani (2003) managed to address these particular points in their assessments of children’s mouse proficiency.

The first step was to determine what exactly the results were that they were looking for. This was particularly difficult, since in many areas their assessment was breaking new ground in this field. What they were looking for were measurable results that could be gathered through computer interaction using only the mouse. In order to provide variability in the testing scenario, they tested their subjects one week apart for each case. They then pooled the results that were measured for a more accurate assessment, as opposed to assessing each group individually. They then used standard measurement procedures and algorithms to allow for a standard that their peers could relate to when the findings were published.

In order to assess the reliability of the assessment, Lane and Ziviani conducted additional studies other than the initial one, from various pools. This provided for more accurate measurements of the results, and provides reliability based on Johnston and Pennypackers definition of reliability with regards to results (Gresham, 2003). They also tested in environments that were mutually available, convenient, and comfortable for those being valuated. This allows for a more accurate measurement.

In order to provide validity to the results, two aspects were considered: construct-related validity and criterion-related validity. Both focus on Linehan’s definition of representative sampling as a source of content validity (Gresham, 2003), and are used to validate their findings based on how valid the actual measurements would be.

With construct-related validity, Lane and Ziviani focused on the ability to complete aiming, tracking, drawing, and target selection tasks with maximum speed and efficiency (Lane, Ziviani, 2003). This provides a clear idea as to what is being assessed, and how the results should be measured. Therefore, the actual measurements should not be effected by content that contains unpredictable errors (Gresham, 2003). Criterion-based validity focused on the specifically on the predictability of the results based on a coefficient of 0.5, which is pretty standard for similar assessments (Lane, Ziviani, 2003). This also provides validity, as the criteria are made valid with the expected results reaching the predictable mean in the statistical review.

And so we see that once a definition of reliability and validity are reached, and our understanding of those terms are firmly set in the assessment, the assessment itself can provide valid results that are reliable within statistical means. The actual definitions that you select determine the direction of your assessment, as well as the general validity and reliability as seen by your peers.

Lane, Alison, and Ziviani, Jenny, Assessing Children’s Competence in Computer Interactions: Preliminary Reliability and Validity of the Test of Mouse Proficiency, OTJR, Winter 2003. Vol. 23, Iss. 1; pg. 18

Gresham, Frank M., Establishing the Technical Adequacy of Functional Behavioral Assessment: Conceptual and Measurement Challenges, Behavioral Disorders, Tempe: May 2003 Vol. 28, Iss. 3; pg. 282

Henderson-Montero, Dianne, Julian, Marc W., Yen, Wendy M. Multiple Measures: Alternative Design and Analysis Models, Educational Measurement, Issues, and Practice Washington, Summer 2003 Vol. 22, Iss. 2; pg. 7