The Reformed Teaching Observation Protocol (RTOP) was developed as an observation instrument to provide a standardized means for detecting the degree to which K-20 classroom instruction in mathematics or science is reformed.
The developers did NOT presume that reformed instruction is necessarily quality instruction. Rather, we left that as an hypothesis to be examined and tested in and across various reformed settings.


  RTOP draws on five major sources for its validity  


  • The Horizon Research 1997-98 Local Systemic Change Revised Classroom Observation Protocol
  • The "standards" in science and mathematics education [NCTM's Curriculum and Evaluation Standards (1989), Professional Teaching Standards (1991), Assessment Standards (1995) and NRC's National Science Standards (1996)]
  • The principles of reform underlying the ACEPT project
  • The work of ACEPT Co-Principle Investigators, particularly that of Tony Lawson and the ASU Mathematics Education group led by Marilyn Carlson
  • Members of Evaluation Facilation Group (EFG)

  Development of the Instrument  

Initial Sources

The Evaluation Facilitation Group (EFG) of the Arizona Collaborative for the Excellence in the Preparation of Teachers (ACEPT) developed RTOP from two instruments: the Horizon Research Inc. instrument and a classroom observation instrument developed locally by ACEPT co-PI Dr. Anton Lawson (1995) of ASU Biology Department.
The EFG initially combined 35 items from Horizon Protocol and 25 from Lawson's instrument. Items that did not focus strongly on reform were eliminated. This produced a set of 39 items focusing on reform.


The items were organized into five categories and reworded as necessary by Sawada and Piburn:

  1. Lesson Design and Implementation
  2. Content: Propositional Pedagogic Knowledge
  3. Content: Procedural Pedagogic Knowledge
  4. Classroom Culture: Communicative Interactions
  5. Classroom Culture: Student/teacher Relationships

Using a five point Likert scale the 39 items were used by five members of the EFG (Benford, Falconer, Turley, Piburn and Sawada) to observe three videotaped lessons. Detailed discussion of each item resulted in discarding 14 items, leaving 25 items with five items in each category.
Piburn and Sawada reconsidered each of the remaining 25 items, refining the wording of each as necessary so the complete set would have a common vocabulary. Piburn gave the rough instrument a name, the "Reformed Teaching Observation Protocol". All members of the EFG, reviewed the three videotapes used earlier. While considerable commonality of interpretation was achieved, several items were reworded and a revised instrument produced.

Critique from the Math Cluster

The EFG submitted its instrument to two members of the ACEPT ASU Mathematics Cluster, Matt Isom and Apple Bloom. At this point in the development, Isom and Bloom were skeptical that the instrument, with its focus on both science and mathematics, would be sufficiently sensitive to the mathematics standards. The Mathematics cluster informed the EFG of two major problems with the instrument:

  • the strong presence of science concepts and terminology
  • the lack of a problem solving orientation

Major Changes in the items

Serious changes were necessary. Sawada revisited all the items. As a result several items from the 25 were deleted or drastically reworded and new ones created. These changes were modified by the EFG as a whole after considerable argument. The modified instrument met with the approval of the mathematics cluster.
The resulting RTOP was used by all the EFG to assess 17 twenty minute videotaped mathematics or science lessons. Each of the 25 items were scrutinized and reworded so that a stronger common understanding could be attained. Initial inter-rater correlations calculated on a subset of the tapes ranged from 0.45 to 0.78. While not particularly high, these were deemed strong enough to warrant a pilot.

Annotated Training Guide

The EFG began piloting the RTOP on various university and college classrooms during the Spring 1999. Analysis and discussions of these ratings lead to further modifications to items, which produced questionable results. At the same time, Sawada began preparing an "Annotated RTOP Guide". The Guide documented the growing inter-rater consensus about how each item should be interpreted. The guide was also being developed to facilitate the training of new observers. Informal calculation of inter-rater correlation coefficients produced estimates of 0.50 -0.85. These were deemed sufficiently high to incorporate the RTOP into the evaluation plans for ACEPT summer 1999 workshops. It was hoped that from May 1999 onwards, the changes to RTOP would be minimal (largely the case).

  Psychometric Properties of RTOP  


RTOP was used on all the courses included in the Fall 1999 evaluation of ACEPT. Each of the courses was observed at least two times. In order to get an early reading of inter-rater reliability, observers agreed to work in pairs for some of the initial observations. As a part of the plan, Kathleen Falconer and Daiyo Sawada paired up to do a set of observations on the same classes. The first 16 such pairs (a total of 32 independent observations) were used to calculate estimates of reliability.

Estimates of reliability were obtained be doing a best-fit linear regression on one set of observations vs. the other.

Figure 1 shows a scatter plot of the 32 data points (some data points fall on each other). The equation for the best-fit line and the proportion of variance accounted for by that line (R2 = 0.954) are shown. This estimate of reliability is very high.

In a similar manner, reliabilities were also estimated for the five subscales that constitute RTOP. Because each subscale consists of only 5 items, it was anticipated their reliability would be substantially lower than for the total score. While this was true for Subscale Two, it was not true for the others as shown in Table 1.

Table 1: Reliability Estimates for Subscales of RTOP
Name of Subscale
Subscale 1: Lesson Design and Implementation
Subscale 2: Content- Propositional Pedagogic Knowledge
Subscale 3: Content- Procedural Pedagogic Knowledge
Subscale 4: Classroom Culture- Communication Interactions
Subscale 5: Classroom Culture- Student/Teacher Relationship






Face Validity

As indicated in the introduction, the Face Validity of RTOP is established with the credibility of the sources consulted.

Construct Validity

Construct Validity refers to the theoretical integrity of an instrument. Because the RTOP is a quantitative measure of the degree to which a classroom is in accord with the science and mathematics reforms as embodied in the ACEPT project, the theoretical relationships of interest are those underlying the ACEPT reform.

The first principles of ACEPT reform are:

  • Standards-based
  • Inquiry-oriented

Theoretically Speaking

Although there are a large number of individual mathematics and science standards, the ACEPT has taken "Inquiry" as a major integrating orientation: Learners as inquirers in the classroom. On this basis, It would be expected that RTOP would span many standards, but underlying these standards would be a single dimension of "inquiry-orientation."

Therefore, if a correlational analysis were done of the five subscales, it would be hypothesized that if the individual standards are a dominant force in the ACEPT first principles, then the intercorrelations among the five subscales would be relatively low. However, if "inquiry-orientated" is a powerful integrating force, then there should be a strong coherence in RTOP that cuts across and interconnects the subscales. This latter view was the view expressed by the evaluators.

To test the hypothesis that "inquiry-orientated" is a powerful integrating force in the structure of RTOP, a correlational analysis was performed on the five suscales. Each subscale was used to predict the total score. High R-squared values would support the hypothesis, while low R-squared values would serve to reject it.


  What the Data Say  

Table 2 provides the R-squares for each subscale as a predictor of the total score. As can be seen, the R-squares approach the reliabilities of each subscale. This offers very strong support for the inquiry-based construct validity of RTOP.

Table 2: Subscales as Predictors of the RTOP Total Score
Name of Subscale
R-Squared as predictor of Total
Subscale 1: Lesson Design and Implementation
Subscale 2: Content- Propositional Pedagogic Knowledge
Subscale 3: Content- Procedural Pedagogic Knowledge
Subscale 4: Classroom Culture- Communication Interactions
Subscale 5: Classroom Culture- Student/Teacher Relationship






The graph shows an example of using RTOP to verify that the experimental and the reform groups differ significantly from each other with regard to reform. Being able to make such elemental distinctions has been important in understanding the nature of ACEPT reform.