More Unintended Consequences of High-Stakes Testing

Final Version

Re-submitted to: Educational Measurement: Issues and Practice
September 2001

by:
Gregory J. Cizek
Associate Professor of Educational Measurement and Evaluation
110 Peabody Hall
School of Education, CB 3500
University of North Carolina
Chapel Hill, NC 27599-3500

Tel: (919) 843-7876
Email: cizek@unc.edu

It is doubtful that any applied measurement specialist whose scope of interest includes K-12 education in the US would not be aware of the central and controversial role of testing in American schools. What has come to be called the Atesting backlash@ movement comprises a sentiment about the expense, extent, and content of student testing that cannot be ignored. On the one hand, there is evidence that concerns about the extent (Phelps, 1997) and cost (Phelps, 2000) of testing are overblown, and there is evidence that the depth of anti-testing sentiment in the populace has been overstated (Business Roundtable, 2001). In general, the American public has expressed a broad, consistent, and strong support for measuring all students in ways that yield accurate, comparable data on student achievement (Phelps, 1998). For example, one recent survey of 1,023 parents of school-age children found that 83% of respondents believe tests provide important information about their children's educational progress, and 9 out of 10 wanted comparative data about their children and the schools they attend. Two-thirds of the parents surveyed said they would like to receive standardized test results for their children in every grade; half of those parents indicated such tests should be given twice a year and the other half said that tests should be given once a year (Driesler, 2001).

In contrast to the public's generally favorable attitude toward testing stands the strong disaffection for testing commonly expressed by those within the education profession. An impartial reader of much opinion in the field could easily conclude that the most dire threat facing postmodern education is testing. For example, in just one recent issue of the widely-read magazine for educators Phi Delta Kappan, commentator Alfie Kohn urged teachers to "make the fight against standardized tests our top priority...until we have chased this monster from our schools" (2001, p. 349). Another article in the same issue described high-stakes testing as Athe evil twin@ of an authentic standards movement (Thompson, 2001, p. 358). A third article singled out 22 educators and parents for their efforts to derail, resist, or sabotage testing (Ohanian, 2001).

At this juncture, it is perhaps important to be clear about exactly the kinds of testing that have stimulated such approbation. To be precise, it is not all testing that is at issue, only testing to which consequences are attached. Examples of such consequences include promotion/retention decisions for students, salaries for educators, or the futures of (in particular) low-performing schools. It is likely that tests with consequences have always and will always be rejected by some critics out of philosophical or political conviction. However, current testing--having become inextricably linked with nascent accountability systems--have moved the question of consequences from the comparatively smaller cadre of those with philosophical concerns to the attention of a larger group of educators, parents, and others with more pragmatic concerns.

The pragmatic complaints about testing are surely familiar to both proponents and opponents of testing. Anecdotes abound describing the despair experienced by earnest students who were denied a diploma as a result of a high-stakes test, and illustrating how testing narrows the curriculum, frustrates the best teachers, produces gripping anxiety in even the brightest students, and makes young children vomit or cry, or both. What is distinctive about current critiques of testing, however, is the intensity of the rhetoric about the negative consequences of testing (both real and claimed) and the seeming absence of serious consideration of any positive consequences (both real and anticipated) from these public debates

The Long History of High-Stakes Tests

There have always been high-stakes tests. Testing historians have traced high-stakes testing to civil service examinations of 200 B.C., military selection dating to 2000 B.C., and Biblical accounts of the Gilead guards. Mehrens and Cizek relate the story of the minimum competency exam that took place when the Gilead Guards challenged the fugitives from the tribe of Ephraim who tried to cross the Jordan river.

AAre you a member of the tribe of Ephraim?" they asked. If the man replied that he was not, then they demanded, "Say Shibboleth." But if he couldn't pronounce the H and said Sibboleth instead of Shibboleth he was dragged away and killed. So forty-two thousand people of Ephraim died there" (Judges 12:5-6, The Living Bible).

To observe that high-stakes tests have a long history is not to say that the character and issues surrounding high-stakes tests have remained unchanged. For example, in the scriptural account just related, nothing is mentioned regarding any professional or public debates that may have occurred regarding: what competencies should have been tested; how to measure them; how minimally-proficient performance should be defined; whether paper/pencil testing might have been cheaper and more reliable than performance assessment; whether there was any adverse impact against the people of Ephraim; or what remediation should be provided for those judged to be below the standard. Maybe the Gilead Guards should have abandoned their test altogether because it was unclear whether Ephriamites really had the opportunity to learn to pronounce "shibboleth" correctly, because the burden of so many oral examinations was a top-down mandate, or because listening to all those Ephriamites try to say "shibboleth" reduced the valuable instructional time available for teaching young members of the tribe of Gilead the real-life skills of sword fighting and tent making (Mehrens & Cizek, 2001).

While it is certain that high-stakes testing has been around for centuries, it is curious that current high-stakes tests in American education face such heated approbation from, primarily, educators. Ironically, for this, too, we must blame those in the field of testing. Those who know and make high-stakes tests have done the least to make known the purposes and benefits of testing. The laws of physics don't seem to apply: for every action in opposition to tests, there has been and equal and opposite silence. One recent example involves the Massachusetts Comprehensive Assessment System (MCAS). A recent controversy arose in Massachusetts over the use of, purportedly, the MCAS as a single measure used for making high school graduation decisions. The testing company responsible for portions of the MCAS, Harcourt Educational Measurement, came under severe attack. Speaking about that controversy, the President of the test publisher, Eugene Paslov, summed up the current environment and made a recommendation regarding the role that measurement specialists must play:

This anti test sentiment is often based on incorrect information, erroneous assumptions, and self-serving rhetoric designed to change the course of accountability and instructional improvement.... The fact that ... tests meet rigorous technical standards is irrelevant. The battle will rage at the state and national policy and political levels, and for those of us who are test advocates we will have to take our arguments to those levels to be effective. (personal communication, June 1, 2001)

The Nature of Current Critique

If nothing else, published commentary concerning high-stakes testing has been remarkable for its uniformity. The conclusion: high-stakes tests are uniformly bad. A colleague of mine recently performed a literature search to locate information about the effects of high-stakes tests. She found 59 entries over the last 10 years. A review of the results revealed that only 2 of the 59 could be categorized as favorably inclined toward testing. The two entries included a two-page, 1996 publication in a minor source, which bore the straightforward title, "The Case for National Standards and Assessments" (Ravitch, 1996). The other nominally-favorable article was the previously-cited review of public opinion about testing that concluded that broad support persists (Phelps, 1998). The other 57 entries reflected the accepted articles of faith concerning high-stakes tests. The titles of some of these articles leave little room for misinterpretation of the perspective that the authors take. A few representative entries include: "Excellence in Education versus High-stakes Testing" (Hilliard, 2000); "The Distortion of Teaching and Testing: High-stakes Testing and Instruction" (Madaus, 1998); "Burnt at the High-Stakes" (Kohn, 2000); Judge's Ruling Effectively Acquits High-stakes Test to the Disadvantage of Poor and Minority Students in Texas" (Chenoweth, 2000); and AI Don't Give a Hoot If Somebody is Going to Pay Me $3600: Reactions to Kentucky's High-stakes Accountability Program" (Kannapel, 1996).

A short list of the main points made in critiques of high-stakes testing might begin with the work of Smith and Rottenberg (1991), published in Educational Measurement: Issues and Practice over a decade ago. They identified six effects--none of them positive--under the heading AConsequences of External Testing@ (p. 8). The consequences included: 1) reduction of time available for ordinary instruction; 2) neglect of teaching material not covered by tests; 3) a press toward methods of instruction and assessment (frequently lower-order) in the classroom that mirror those implied by tests; 4) limits on students instructional opportunities; 5) undesirable effects on teacher morale; and 6) imposition of Acruel and unusual punishment@ on students--younger ones in particular--because of the length, difficulty, small print, time constraints, and individualistic nature of tests (p. 10). More recent critiques have raised issues such as the potential for tests to foster negative attitudes by students toward tested content (Lattimore, 2001) or diminish students' self-esteem (Meisels, 2000), and the possibility that high-stakes tests have differential effects by student ethnicity (Lomax, West, & Harmon, 1995).

As might be expected, the evidence put forth in various studies is of variable quality. The study design used to develop Smith and Rottenberg's (1991) list of consequences and tentative conclusions seem appropriate for initial descriptive work From 49 classrooms in two elementary schools serving low-income students, the authors selected 19 teachers for interviews, and subsequently identified 4 teachers for more intensive observation. Typically, the conclusions suggested by such a study might be verified by more controlled, more representative, or larger-scale efforts.

In the case of high-stakes testing, however, subsequent evidence appears to have become even more skimpy in support of conclusions that seem even more confident. For example, the conclusions offered in the previously-cited work by Lattimore (2001) were derived from the study of three 10th graders. The work of Lomax, West, and Harmon (1995) involved a large (n=2,229) sample of teachers, although the survey methodology employed was only able to detect the educators' perceptions of effects on students, as opposed to the actual effects on students. Meisels' (2000) suggestion that high-stakes tests do Aincalculable damage to students' self-esteem@ (p. 16) is not well-buttressed by any data. In Meisels' swirl of objections to multiple-choice formats, the on-demand nature of testing, and the use of a single indicator for making decisions, his conclusion that Ausing such tests to decide how well a child is learning is absurd@ (p. 19) seems insupportable even from a rhetorical perspective.

Some solid evidence is available to refute a few of the most commonly-encountered criticisms of high-stakes tests. For example, with changes in content standards and test construction practices, few state-mandated tests can be said to be Alower-order@ or consist solely of recall-type questions. In fact, recent experience in states such as Washington, Arizona, and Massachusetts signals that concerns about low-level tests are being replaced by a concern that complex content is being pushed too early in students' school years, that performance expectations may be too high, and test questions too challenging (see, for examples, Bowman, 2000; Orlich, 2000; Shaw 1999).

A thorough review of the purported negative consequences of testing was conducted by Mehrens (1998). On nearly every front, Mehrens found the evidence for making any strong conclusions about the asserted negative consequences to be lacking. For example, he concluded that:

"the evidence for a test's influence on either curricular content or instructional process is not totally clear;

the evidence regarding the effects of large scale assessments on teacher motivation, morale, stress and ethical behavior is sketchy; and

with respect to assessment impacts on the affect of students, we are again in a subarea where there is not a great deal of empirical evidence."

The sparseness of the evidence that can be used to shed light on contentious issues does not mean that criticisms of high-stakes testing are ill-founded. From an academic perspective, many of the most passionately proffered positions seem ripe for settlement based upon the evidence that can be provided by existing research (as in the case of the Alower-order@ critique). Or, where research relevant to a question is not available or sufficiently well-developed, a satisfactory scientific analysis of the issue could likely be framed and conducted. From a public policy perspective, however, perhaps the most troubling aspect of the current debates is the almost total omission of any serious articulation or consideration of the positive consequences of high-stakes testing.

The Inescapable Need to Make Decisions

One assumption underlying high-stakes testing has received particularly scant attention: the need to make decisions. There is simply no way to escape making decisions about students. These decisions, by definition, create categories. If, for example, some students graduate from high school and others do not, a categorical decision has been made, even if a graduation test was not used. (The decisions were, presumably, made on some basis.) High school music teachers make decisions such as who should be first chair for the clarinets. College faculties make decisions to tenure (or not) their colleagues. We embrace decision making regarding who should be licensed to practice medicine. All of these kinds of decisions are unavoidable; each should be based on sound information; and the information should be combined in some deliberate, considered, defensible manner.

It is currently fashionable to talk as if high-stakes tests are the single bit of information used to make categorical decisions that wreak hellacious results on both people and educational systems. But simple-minded slogans like "high stakes are for tomatoes" are, quite simply, simple-minded. And, there is a straightforward, but accurate response to the oft-repeated fiction that high-stakes tests should not be the single measure used for making important decisions such as awarding high school diplomas: Tests aren't the only piece of information and is doubtful that they ever have been. In the diploma example, multiple sources of information are used to make decisions, and success on each of them is necessary.

Certainly, many states have, over the past 25 years or so, opted to add a test performance requirement to other requirements. But, the test is just one of several criteria that most states use. For example, for this article I survey the posted graduation requirements for several states across the US, and I looked into what the local school district in my area, Chapel Hill, North Carolina, required. It turns out that nearly every state has, first and foremost, content requirements. Wisconsin's requirements are typical:

"(a) Except as provided in par. (d), a school board may not grant a high school diploma to any pupil unless the pupil has earned: In the high school grades, at least 4 credits of English including writing composition, 3 credits of social studies including state and local government, 2 credits of mathematics, 2 credits of science and 1.5 credits of physical education; In grades 7 to 12, at least 0.5 credit of health education.@ (Wisconsin Statutes, Section 118.33, a, 1-2)

In addition to course requirements, Pennsylvania--like many states--includes a test performance requirement, but also gives authority to local boards of education to add to the requirements:

"Each school district, including charter schools, shall specify requirements for graduation in the strategic plan under '4.13... Requirements shall include course completion and grades, completion of a culminating project, and results of local assessments aligned with the academic standards. Beginning in the 2002-2003 school year, students shall demonstrate proficiency in reading, writing and mathematics on either the State assessments administered in grade 11 or 12 or local assessment aligned with academic standards and State assessments under '4.52 (relating to local assessment system) at the proficient level or better in order to graduate." (22 PA Code, § 4.24, a).

Under Florida law, test performance is a requirement, and very specific requirements are spelled out for credits students must earn: For example, in addition to requiring American history, American government, and Florida government, Florida students must obtain:

"One credit in performing fine arts to be selected from music, dance, drama, painting, or sculpture. A course in any art form, in addition to painting or sculpture, that requires manual dexterity, or a course in speech and debate, may be taken to satisfy the high school graduation requirement for one credit in performing arts pursuant to this subparagraph..." and "one-half credit in life management skills to include consumer education, positive emotional development, marriage and relationship skill-based education, nutrition, prevention of human immunodeficiency virus infection and acquired immune deficiency syndrome and other sexually transmissible diseases, benefits of sexual abstinence and consequences of teenage pregnancy, information and instruction on breast cancer detection and breast self-examination, cardiopulmonary resuscitation, drug education, and the hazards of smoking." (Florida Statutes, Title XVI, S232.246, 2, 3, i)

Not only are highly specific course requirements and test performance mandated in order for students to obtain a high school diploma, but Florida law requires that:

"Each district school board shall establish standards for graduation from its schools, and these standards must include... achievement of a cumulative grade point average of 1.5 on a 4.0 scale, or its equivalent, for students entering 9th grade before the 1997-1998 school year; however, these students must earn a cumulative grade point average of 2.0 on a 4.0 scale, or its equivalent, in the courses required by subsection (1) that are taken after July 1, 1997, or have an overall cumulative grade point average of 2.0 or above." (Florida Statutes, Title XVI, S232.246, 5, c)

As mentioned previously, most states permit local school boards to add to the graduation requirements imposed by the state. In the school district serving the area of Chapel Hill, North Carolina, the requirements include 50 hours of community service learning experience and a limit on the number of days a senior can be absent. Though variability across states and local districts surely exists, the picture is clear: Just one too few days attendance? No diploma. Didn't take American Government?: No diploma. Not enough course credits?: No diploma. Miss too many questions on a test?: No diploma.

By now it should be obvious that to speak of a test as "single measure" is nonsensical. But the obviousness of that is not at all obvious--particularly among those closest to classrooms. As was demonstrated in the Massachusetts example, such profound misunderstandings, where present, have the potential to create controversy, sustain skepticism about the value of testing, and hinder educational reforms.

What is perhaps most important to understand about Massachusetts example and others like it, is that the controversy arose in the context of a categorical decision--e.g., high school graduation decisions--that are being made. As has been demonstrated, across the United States similar categorical decisions are made on each of the numerous criteria enacted by legislatures and by state and local boards of education. It makes as much sense to single out a single test as the sole barrier as it does to single out civics class as the single measure that denies students an opportunity to graduate.

Given the reality that policy makers have decided that the awarding of a diploma depends on success on numerous criteria, we are faced with the statistical truism that such conjunctive models result in the denial of more diplomas, promotions, awards, and so on as each conjunctive element is added to the decisionmaking model. Of course, policy makers could decide not to require success on each of the elements. One could get a diploma by making success on, say, four out of the six elements. But which four? Why not three? The same three for everyone? That seems unfair, given that some people would be denied a diploma simply on the basis of the arbitrary three that were identified. Even if all other criteria were eliminated, and all that remained was a requirement that students must attend at least, say, 150 out of 180 days in their senior year to get a diploma, then what about the honors student who is passing all of her classes but has attended only 149 days? In the end, as long as any categorical decisions must be made, and regardless of the decisionmaking model, there is going to be subjectivity involved. And, if there is going to be subjectivity, most testing specialists--and most of the public--simply favor coming clean about the source and magnitude of the subjectivity, and trying to minimize it.

Testing and Accountability

It should be evident then, that categorical decisions will be made with or without course requirements, with or without conjunctive models, and with our without tests. Thus, it cannot be that tests of themselves are the cause of all the consternation about high-stakes assessments.

The real reasons are two-fold. One reason covers resistance to high-stakes testing with the education profession; the second explains why otherwise well-informed people would so easily succumb to simplistic rhetoric centering on testing. On the first count, that fact that high-stakes tests are increasingly used as part of accountability systems provides a sufficient rationale for resistance. Education is one of the few (only?) professions for which advancement, status, compensation, longevity, and so on are not related to personal performance. The entire accountability movement--of which testing has been the major element--has been vigorously resisted by many in the profession. The rationale is rational when there is a choice between being accountable for performance or maintaining a status quo without accountability.

There is much to be debated about professionalization of teaching and its relationship to accountability; a comprehensive analysis of that complex issue is beyond the scope of this article. The focus of this article is on the second count--the debate about testing. As mentioned previously, those who know the most about testing have been virtually absent from the public square when any criticism surfaces. As the previously-cited literature search revealed, the response to 57 bold articles nailed to the cathedral door is 2 limp slips of paper are slid under it. The benefits of high-stakes tests have been assumed, unrecognized, or unarticulated. In an attempt to provide some remedy for this and some balance to the policy debates, the following sections present 10 unanticipated consequences of high-stakes testing--consequences that are actually good things that have grown out of the increasing reliance on test data concerning student performance.

I. Professional Development

I suspect that many assessment specialists who have been involved in K-12 education painfully recall professional development in the not-too-distant past. Presentations with titles like the following aren't too far from the old reality:

Vitamins and Vocabulary: Just Coincidence that Both Begin with "V"?

Cosmetology across the Curriculum

Horoscopes in the Homeroom
The Geometry of Rap: 16 Musical Tips for Pushing Pythagoras
Multiple Intelligences in the Cafeteria

In a word, much professional development was spotty, hit-or-miss, of questionable research base, of dubious effectiveness, and thoroughly avoidable. But professional development is increasingly taking on a new face. It is more frequently focussed on what works. It is curriculum-relevant and results-oriented. The press toward professional development that helps educators hone their teaching skills and content area expertise is evident. Without doubt, recent emphases on higher standards and mandated tests to measure progress toward those standards should be credited.

As clear evidence of this link, one need only consider the language and specific mandates embodied in recent federal and state department of education directives. For example, at the federal level, there is a new emphasis on focussed professional development. The government publication, Improving America's Schools: Newsletter on Issues in School Reform reported on the need for "rethinking professional development" and stated that:

almost every approach to school reform requires teachers to refocus their roles, responsibilities, and opportunities--and, as a result, to acquire new knowledge and skills. The success of efforts to increase and reach high standards depends largely on the success of teachers and their ability to acquire the content knowledge and instructional practices necessary to teach to high academic standards. (US Department of Education, 1996)

At about the same time, the US Department of Education was disseminating a list of "Principles of High-Quality Professional Development." Among the guidelines presented for future professional development activities in education was that it should "enables teachers to develop further expertise in subject content, teaching strategies, uses of technologies, and other essential elements in teaching to high standards" (US Department of Education, 1995). Similar focus for educators' professional development worked its way down to the state level. The following excerpt from the New Jersey Administrative Code highlights the evidence of refined focus for professional development at the state level. The Code sets for "Standards for Required Professional Development for Teachers" which require that professional development activities must:

assist educators in acquiring content knowledge within their own discipline(s) and in application(s) to other disciplines; enable classroom professionals to help students achieve the New Jersey Core Curriculum Content Standards (CCCS); [and] routinely review the alignment of professional development content with CCCS and with the Frameworks in all disciplines. (New Jersey Administrative Code, Section 6:11-13)

II. Accommodation

Recent federal legislation enacted to guide the implementation of high-stakes testing has been a catalyst for increased attention to students with special needs. Describing the impact of legislation such as Goals 2000: Educate America Act and the Improving America's Schools Act [IASA], Thurlow and Ysseldyke have observed that, "both Goals 2000 and the more forceful IASA indicated that high standards were to apply to all students. In very clear language, these laws defined 'all students' as including students with disabilities and students with limited English proficiency" (2001, p. 389).

Because of these regulations applied to high-stakes tests, states across the US are scurrying to adapt those tests for all students, report disaggregated results for subgroups, and implement accommodations so that tests more accurately reflect the learning of all students. The result has been a very positive diffusion of awareness. Increasingly, at the classroom level, educators are becoming more sensitive to the needs and barriers faced by special needs students when they take tests--even the ordinary assessments they face in the classroom. If not driven within the context of once-per-year, high-stakes tests, it is doubtful that such progress would have been witnessed in the daily experiences of many special needs learners.

Research in the area of high-stakes testing and students at risk has provided evidence of this positive consequence of mandated testing. One particularly noteworthy example of this beneficial effect has been observed by the researchers of the Consortium on Chicago School Research, which has monitored effects of that large, urban school district's high stakes testing and accountability program. The researchers found that students (particularly those who had some history of failure) reported that the introduction of accountability testing had induced their teachers to begin focussing more attention on them (Roderick, in press). Failure was no longer acceptable and there was a stake in helping all students succeed. In this case, necessity was the mother of intervention.

III. Knowledge about Testing

For years, testing specialists have documented a lack of knowledge about assessment on the part of many educators. The title one such article bluntly asserted educators' "Apathy toward Testing and Grading" (Hills, 1991). Other research has chronicled the chronic lack of training in assessment for teachers and principals and has offered plans for remediation (see, for example, Impara & Plake, 1996; O'Sullivan & Chalnick, 1991; Stiggins, 1999). Unfortunately, for the most part, it has been difficult to require assessment training for pre-service teachers or administrators, and even more difficult to wedge such training into graduate programs in education.

Then along came high-stakes tests. What faculty committees could not enact has been accomplished circuitously. Granted, misperceptions about tests persist (for example, in my state there is a lingering myth that "the green test form" is harder than "the red one"), but I am discovering that more educators know more about testing than ever before. Because many tests now have stakes associated with them, it has become de rigeur for educators to inform themselves about their content, construction, and consequences. Increasingly, teachers can tell you the difference between a norm-referenced and a criterion-reference tests; they can recognize, use, or develop a high-quality rubric; they can tell you how their state's writing test is scored, and so on.

Along with this knowledge has come the secondary benefit that knowledge of sound testing practices has had positive consequences at the classroom level--a trickle-down effect. For example, one recent study (Goldberg, & Roswell, 1999/2000) investigated the effects on teachers who had participated in training and scoring of tasks for the Maryland School Performance Assessment Program (MSPAP). Those teachers who were involved with the MSPAP overwhelmingly reported that their experience had made them more reflective, deliberate, and critical in terms of their own classroom instruction and assessment.

IV and V. Collection and Use of Information

Because pupil performance on high-stakes tests has become of such prominent and public interest, there has been an intensity of effort directed toward data collection and quality control that is unparalleled. As many states mandate the collection and reporting of this information (and more), unparalleled access has also resulted. Obtaining information about test performance, graduation rates, per-pupil spending, staffing, finance, and facilities is, in most states, now just a mouse-click away. How would you like your data for secondary analysis: Aggregated or disaggregated? Single year or longitudinal? PDF or Excel? Paper or plastic? Consequently, those who must respond to state mandates for data collection (i.e., school districts) have become increasingly conscientious about providing the most accurate information possible--sometimes at risk of penalties for inaccuracy or incompleteness.

This is an unqualified boon. Not only is more information about student performance available, but it is increasingly used as part of decision making. At a recent teacher recruiting event, I heard a recruiter question a teacher about how she would be able to tell that her students were learning. "I can just see it in their eyes," was the reply. Sorry, you are the weakest link.

Increasingly, from the classroom to the school board room, educators are making use of student performance data to help them refine programs, channel funding, and identify roots of success. If the data weren't so important, it is unlikely that this would be the case.

VI. Educational Options

Related to the increase in publicly-available information about student performance and school characteristics is the spawning of greater options for parents and students. Complementing a hunger for information, the public's appetite for alternatives has been whetted. In many cases, schools have responded. Charter schools, magnet schools, home schools, and increased offerings of honors, IB and AP courses, have broadened the choices available to parents. And, research is slowly accumulating which suggests that the presence of choices has not spelled doom for traditional options, but has largely raised all boats (see, for example, Finn, Manno, & Vanourek, 2000; Greene, 2001). It is almost surely the case that legislators' votes and parents' feet would not be moving in the direction of expanding alternatives if not for the information provided by high-stakes tests--the same tests are being used to gauge the success or failure of these emerging alternatives.

VII. Accountability Systems

No one would argue that current accountability systems have reached a mature state of development. On the contrary, nascent systems are for the most part crude, cumbersome, embryonic endeavors. Equally certain, though, is that even rudimentary accountability systems would not likely be around if it weren't for high-stakes tests. For better or worse, high-stakes tests are often the foundation upon which accountability systems have been built. This is not to say that this relationship between high-stakes tests and accountability is right, noble, or appropriate. It simply recognizes the reality that current accountability systems were enabled by an antecedent: mandated, high-stakes tests.

To many policy makers, professionals, and the public, however, the notion of introducing accountability--even just acknowledging that accountability is a good innovation--is an important first step. That the camel's nose took the form of high-stakes tests was (perhaps) not recognized and was (almost certainly) viewed as acceptable. Necessary and healthy debates continue about the role of tests and the form of accountability.

Although high-stakes tests have made a path in the wilderness, the controversy clearly hinges on accountability itself. The difficult fits and starts of developing sound accountability systems should be understandable given their nascent developmental state. Understanding the importance, complexity, and difficulties as the accountability infant matures will be surely be trying. How--or if--high-stakes tests will fit into the mature version is hard to tell, and the devil will be in the details. But it is evident that the presence of high-stakes tests has at least served as a conversation-starter for a policy dialogue that may not have taken place in their absence.

VIII. Educators' Intimacy with Their Disciplines

Once a test has been mandated in, say, language arts, the first step in any high-stakes testing program is to circumscribe the boundaries of what will be tested. The nearly-universal strategy for accomplishing this is to empanel groups of (primarily) educators who are familiar with the ages, grades, and content to be tested. These groups are usually large, selected to be representative, and expert in the subject area. The groups first study relevant documentation (e.g. the authorizing legislation, state curriculum guides, content standards). They then begin the arduous, time-consuming task of discussing among themselves the nature of the content area, the sequence and content of typical instruction, learner characteristics and developmental issues, cross-disciplinary relationships, and relevant assessment techniques.

These extended conversations help shape the resulting high-stakes tests, to be sure. However, they also affect the discussants, and those with whom they interact when they return to their districts, buildings, and classrooms. As persons with special knowledge of the particular high-stakes testing program, the participants are sometimes asked to replicate those disciplinary and logistic discussions locally. The impact of this trickling-down is just beginning to be noticed by researchers--and the effects are beneficial. For example, at one session of the 2000 American Educational Research Association conference, scholars reported on the positive effects of a state testing program in Maine on classroom assessment practices (Beaudry, 2000) and on how educators in Florida were assimilating their involvement in large-scale testing activities at the local level (Banerji, 2000).

These local discussions mirror the large scale counterparts in that they provide educators with an opportunity to become more intimate with the nature and structure of their own disciplines, and to contemplate interdisciplinary relationships. This is a good thing. And the impulse for this good thing is clearly the presence of a high-stakes test.

IX. Quality of Tests

Another benevolent consequence of high-stakes testing is the effect that the introduction of consequences has had on the tests themselves. Along with more serious consequences has come heightened scrutiny. The high-stakes tests of today are surely the most meticulously developed, carefully constructed, and rigorously reported. Many criticisms of tests are valid, but a complainant who suggests that today's high-stakes tests are "lower-order" or "biased" or "not relevant" is almost certainly not familiar with that which they purport to critique. If only due to its long history and ever-present watch-dogging, high-stakes tests have evolved to a point where they are: highly reliable; free from bias; relevant and age appropriate; higher order; tightly related to important, public goals; time and cost efficient; and yielding remarkably consistent decisions.

Evidence of the impulse toward heightened scrutiny of educational tests with consequences can be traced at least to the landmark case of Debra P. v. Turlington (1984). Although the central aspect of that case was the legal arguments regarding substantive and procedural due process, the abundance of evidence regarding the psychometric characteristics of Florida's graduation test was essential in terms of making the case that the process and outcomes were fundamentally fair to Florida students. Although legal challenges to such high-stakes tests still occur (see the special issue of Applied Measurement in Education, 2000, for a recent example involving a Texas test), they are remarkably infrequent. For the most part, those responsible for mandated testing programs responded to the Debra P. case with a heightened sense of the high standard that is applied to high-stakes measures. It is a fair conclusion that, in terms of legal wranglings concerning high-stakes tests, the psychometric characteristics of the test are rarely the basis of a successful challenge.

It is also fair to say that one strains the gnat in objecting to the characteristics of high-stakes tests, when the characteristics of those tests is compared to what a child will likely experience in his or her classroom the other 176 days of the school year. This point should cause all those genuinely committed to fair testing to refocus their attention on classroom assessment. Decades of evidence have been amassed to support the contention that the quality of teacher-made tests pales compared to more rigorously developed, large-scale counterparts. Such evidence begins with the classic studies of teachers' grading practice by Starch and Elliot (1912, 1913a, 1913b) and continues with more recent studies which document that weaknesses in typical classroom assessment practices have persisted (see, for example, Carter, 1984; Gullickson & Ellwein, 1985). It is not an overstatement to say that, at least on the grounds of technical quality, the typical high-stakes, state-mandated test that a student takes will--by far--be the best assessment that student will see all year.

A secondary benefit of the quality of typical high-stakes tests is that, because of their perceived importance, they become mimicked at lower levels. It is appropriate to abhor teaching to the test. However, it is also important to recognize the beneficial effects of exposing educators to high-quality writing prompts, document-based questions, constructed-response formats, and even challenging multiple-choice items. It is not cheating, but the highest form of praise when educators then rely on these exemplars to enhance their own assessment practices.

X. Increased Student Learning

It is not completely appropriate to categorize increased student learning as an unintended consequence. At least in terms of the rhetoric often accompanying high-stakes tests, there are usually strong claims made regarding the importance of testing on student learning. However, not all of those concerned about educational reform would necessarily agree that increased learning should be expected. On one side, some have suggested that high-stakes tests may increase students' test scores, though not necessarily students' learning. Even more cynical observers have suggested that no effects of testing are likely, as in the commonly-heard metaphor that frequent temperature taking has no effect on reducing a fever. Thus, at least in some quarters, increased student achievement attributable to the presence of high-stakes tests would qualify as an unexpected consequence.

As with all of the previously-mentioned consequences, the evidence bearing on this outcome is just beginning to come in. Because all research on the consequences of high-stakes tests is comparatively recent, the evidence on positive and negative consequences is necessarily skimpy (see Mehrens, 1998). Compounding the problem for uncovering the positive consequences of high-stakes tests is the same phenomenon that is routinely invoked when newspapers are criticized for printing only "negative" stories. Scholarly research on many social issues tends to focus more heavily on ascertaining the potential harmful consequences or weaknesses and critical appraisal than on cataloging the benefits of a policy or intervention.

Nonetheless, some evidence on the positive consequences of high-stakes testing has begun to emerge. A series of studies by Bishop (1998, 2000) provides some encouraging findings. In one study, Bishop compared countries and Canadian provinces that had what he termed "curriculum-based external exit examination systems" (CBEEESs; i.e., high-stakes tests) with those that did not have such tests (1998, p. 171). He found that there was a significant, positive relationship between the presence of CBEEESs and student scores on the International Assessment of Educational Progress (IAEP) and the Third International Mathematics and Science Study (TIMSS). Bishop also examined students who participated in New York state's Regents examination system and found that, after controlling for student demographic characteristics, students in a state with a high-stakes testing program performed significantly better on the National Assessment of Educational Progress (NAEP) in the 8th grade, and on the Scholastic Assessment Test (SAT) in high school (Bishop, 2000). In a more recent example, the USA Today reported on increasing student achievement in Massachusetts. Citing a review of that states standards-based reforms by Achieve, the newspaper noted that "the state using the nation's highest regarded test is reaping some of the most impressive gains" and concluded that "testing can improve student performance, especially when states serve up high-quality education standards backed by relevant, high-quality tests" (Schools sharpen testing, 2001, p. A-14).

Keepin' It Real

It would be foolish to ignore the shortcomings and undesirable consequences of high-stakes tests. Current inquiry along these lines is essential, productive, and encouraging. However, in the context of some consternation about high-stakes tests, particularly within the education profession, it is equally essential to consider the unanticipated positive consequences, and to incorporate these into any cost-benefit calculus that should characterize sound policy decisions. Importantly, those most knowledgeable about the characteristics of high-quality assessment cannot be absent from the public square.

Vigorous debates about the nature and role of high-stakes tests and accountability systems are healthy and needed. To these frays, the stakeholders may bring differing starting points and differing conceptualizations of the meaning and status that should be afforded to issues ranging from technical adequacy, to educational goals, to social justice . It is an exhilarating time of profound questioning. High-stakes tests: we don't know how to live with them; we can't seem to live without them. The oft-quoted first sentence of Charles Dickens= A Tale of Two Cities ("It was the best of times, it was the worst of times") seems especially relevant to the juncture at which we find ourselves. The remainder of Dickens= opening paragraph merely extends the piquant metaphor:

It was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way. (1859, p. 3)

References

Applied Measurement in Education. (2000). Special issue on Texas high-school graduation test. Vol. 13., No. 4.

Banerji, M. (2000, April). Designing district-level classroom assessment systems. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

Beaudry, J. (2000, April). The positive effects of administrators and teachers on classroom assessment practices and student achievement. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

Bishop, J. H. (1998). The effect of curriculum-based external exit exam systems on student achievement. Journal of Economic Education, 29(2), 171-182.

Bishop, J. H. (2000). Curriculum-based external exit exam systems: Do students learn more? How? Psychology, Public Policy, and Law, 6(1), 1999-215.

Bowman, D. H. (2000, November 29). Arizona poised to revisit graduation exam. Education Week. Available on-line at: http://www.edweek.org/ew/ewstory.cfm?slug=13ariz.h20

Business Roundtable. (2001). Assessing and addressing the 'testing backlash.' Washington, DC: Author.

Carter, K. (1984). Do teachers understand principles for writing tests? Journal of Teacher Education, 35(6), 57-60.

Chenoweth, K. (2000). Judge's ruling effectively acquits high-stakes test: To the disadvantage of poor and minority students. Black Issues in Higher Education, 51, 12.

Debra P. v. Turlington (1984). 730 F. 2d 1405.

Dickens, C. (1859). A tale of two cities. London: Chapman and Hall.

Driesler, S. D. (2001). Whiplash about backlash: The truth about public support for testing. NCME Newsletter, 9(3), 2-5.

Finn, Jr., C. E., Manno, B. V., & Vanourek, G. (2000) Charter schools in action: Renewing public education. Princeton, N.J.: Princeton University Press.

Goldberg, G. L., & Roswell, B. S. (1999/2000). From perception to practice: the impact of teachers' scoring experience on performance-based instruction and classroom assessment. Educational Assessment, 6(4), 257-290.

Greene, J. P. (2001). An evaluation of the Florida A-plus accountability and school choice program. Tallahassee: Florida State University, Devoe L. Moore Center for the Study of Critical Issues in Economics and Government.

Gullickson, A. R., & Ellwein, M. C. (1985). Post-hoc analysis of teacher-made tests: The goodness of fit between prescription and practice. Educational Measurement: Issues and Practice, 4(1), 15-18.

Hilliard, A. (2000). Excellence in education versus high-stakes testing. Journal of Teacher Education, 51, 293-304.

Hills, J. (1991). Apathy toward testing and grading. Phi Delta Kappan, 72, 540-545.

Impara, J., & Plake, B. (1996). Professional development in student assessment for educational administrators. Educational Measurement: Issues and Practice, 15(2), 14-20.

Kannapel, P. et al. (1996, April). I don't give a hoot if somebody is going to pay me $3600: Local school district reaction to Kentucky's high-stakes accountability system." Paper presented at the Annual Meeting of the American Educational Research Association, New York. (ERIC Document No. 397 135)

Kohn, A. (2001). Fighting the tests: A practical guide to rescuing our schools. Phi Delta Kappan, 82(5), 349-357.

Kohn, A. (2000). Burnt at the high stakes. Journal of Teacher Education, 51, 315-327.

Lattimore, R. (2001). The wrath of high-stakes tests. Urban Review, 33(1), 57-67.

Lomax, R. G., West, M. M., & Harmon, M. C. (1995). The impact of mandated standardized testing on minority students. Journal of Negro Education, 64, 171-185.

Madaus, G. (1998). The distortion of teaching and testing: High-stakes testing and instruction. Peabody Journal of Education, 65, 29-46.

Mehrens, W. A. (1998). Consequences of assessment: What is the evidence? Education Policy Analysis Archives, 6(13). Available on-line at: http://olam.ed.asu.edu/epaa/v6n13.html

Mehrens, W. A., & Cizek, G. J. (2001). Standard setting and the public good: Benefits accrued and anticipated. In G. J. Cizek (ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 477-486). Mahwah, NJ: Erlbaum.

Meisels, S. J. (2000). On the side of the child. Young Children, 55(6), 16-19.

Ohanian, S. (2001). News from the test resistance trail. Phi Delta Kappan, 82(5), 363-366.

Orlich, D. C. (2000). Education reform and limits to student achievement. Phi Delta Kappan, 81, 468-472.

O' Sullivan, R., & Chalnick, M. (1991). Measurement-related course requirements for teacher certification and recertification. Educational Measurement: Issues and Practice, 10(1), 17-19, 23.

Phelps, R.P. (1997). The extent and character of system-wide student testing in the United States. Educational Assessment, 4(2), 89B122.

Phelps, R. P. (1998). The demand for standardized student testing. Educational Measurement: Issues and Practice, 17(3), 5-23.

Phelps, R. P. (2000). Estimating the cost of standardized student testing in the United States. Journal of Education Finance, 25, 343-380.

Roderick, M. (in press). The grasshopper and the ant: Motivational responses to low achieving students to high-stakes testing. Educational Evaluation and Policy Analysis.

Ravitch, D. (1996). The case for national standards and assessments. The Clearing House, 69, 134-135.

Schools sharpen testing. (2001, October 17). USA Today, p. A-14.

Shaw, L. (1999, October 10). State's 4th-grade math test too difficult for students' development

level, critics say. Seattle Times. Available on-line at:
http://seattletimes. nwsource.com/news/local/html98/ test_19991010.html

Smith, M. L., & Rottenberg, C. (1991). Unintended consequences of external testing in elementary schools. Educational Measurement: Issues and Practice, 10(4), 7-11.

Starch, D., & Elliot, E. C. (1912). Reliability of the grading of high school work in English. School Review, 21, 442-457.

Starch, D., & Elliot, E. C. (1913a) Reliability of the grading of high school work in history. School Review, 21, 676-681.

Starch, D., & Elliot, E. C. (1913b). Reliability of grading work in mathematics. School Review, 22, 254-259.

Stiggins, R. (1999). Evaluating classroom assessment training in teacher education programs. Educational Measurement: Issues and Practice, 18(1), 23-27.

Thompson, S. (2001). The authentic testing movement and its evil twin. Phi Delta Kappan, 82(5), 358-362.

Thurlow, M. L, & Ysseldyke, J. E. (2001). Standard setting challenges for special populations. In G. J. Cizek (ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 387-410). Mahwah, NJ: Erlbaum.

United States Department of Education. (1996). Improving America's Schools: Newsletter on Issues in School Reform. Available on-line at: http://www.ed.gov/pubs/IASA/newsletters/profdev/pt1.html

United States Department of Education. (1995). Principles of high-quality professional development. Available on-line at: http://www.ed.gov/G2K/bridge.htm