Key URLs and Links from Talks

Brian Huot:
The Big Test by Nicholas Lemann
On a Scale: A Social History of Writing Assessment in America by Norbert Elliot
Standards For Educational And Psychological Testing 1999 by AERA
Assessing Writing: A Critical Sourcebook by Brian Huot and Peggy O'Neill

Bob Cummings/Ron Balthazor:
No Gr_du_te Left Behind by James Traub
EMMA, UGA's electronic and e-portfolio environment

Marti Singer:
GSU's Critical Thinking Through Writing Project



Tuesday, October 23, 2007

Assessing Writing: The Introduction

Introduction

Now more than ever, writing assessment is a critical activity in our profession—with accrediting agencies, policymakers, and government organizations demanding evidence of learning from educational institutions. While the demand for assessment often comes from those outside of the field, it is still a critical component in teaching writing, creating curriculum, and developing programs. Increasingly, to be considered an effective writing teacher—not to mention a writing program director—means being able to respond to a variety of writing assessment needs confronting students, faculty, and institutions. However, for many practicing composition teachers and administrators, formal preparation in assessment was not part of their graduate education. In addition to a lack of preparation, many writing teachers and writing program administrators have negative views of assessment as a punitive force toward students, faculty and progressive forms of instruction. These fears of assessment are not unfounded although we believe that a powerful discourse like assessment can be harnessed for positive and productive writing instruction and administration. While many conversations about assessment seem to assume formal, standardized activities such as placement or exit testing, it is important to remember that assessment is a critical component of writing: writers self-assess and frequently seek evaluative input from readers throughout their writing processes. While our concern in this volume is not focused on the assessment writers do as they write, it does remind us that assessment activities are not hostile to writing but, on the contrary, contribute to success (Huot 2002).

The purpose of this collection of articles is to gather together readings that can help both practicing professionals and graduate students understand the theory and practice of writing assessment. We focus on assessment that functions outside of a single course—i.e., placement, exit testing, and program evaluation. While we realize the significance of response and classroom evaluation, and we see it as closely connected to large-scale assessment, we have not included articles about it in this volume. There just simply isn’t enough space. For example, portfolios could be the topic of a stand alone volume, but it is important for readers to see portfolios as part of the wider field, not isolated or distinct from other methods and not separate from assessment theory.

The readings collected here reflect a field of writing assessment that encompasses scholars, ideas, issues and research from a range of scholars in college writing assessment, K-12 teachers and researchers and educational measurement theory and practice. For example, we include a piece by Roberta Camp, who was an employee of the Educational Testing Service (ETS). Camp’s research on portfolios in the late 1970s and early 1980s was a precursor for the wealth of activity in portfolios later on in composition and education. While it is not unusual in college assessment scholarship to represent ETS and other educational measurement scholars in opposition to the models, theories and practices best suited to theoretically and pedagogically informed assessment practices (Lynne, White, Yancey and others), we see theories from educational measurement, especially the development of validity theory, as a strong platform for advocating for newer, more progressive forms of writing assessment. In addition to Camp, the volume also includes work from those like Pamela Moss who are not primarily associated with college composition and are from education and educational measurement. To illustrate what we mean about how theories from outside college writing assessment can be valuable and how this volume is an argument for a specific vision for writing assessment, we look at dual historical threads in writing assessment and assessment in general since the 1940s that illustrate our vision for the volume and the field of writing assessment. The first thread reviews the role of reliability in the development of holistic scoring of student writing, and the second looks at the development of validity as a psychometric concept. For us, these two threads—the development of reliable scoring of writing and the development of validity theory—come together to produce a rich understanding of writing assessment as a field.

Writing assessment has had reliability problems since at least the study by Daniel Starch and Edward Elliot in 1912 reporting that teachers couldn't agree on grades for the same student essays. Reliability, in this sense, referred to consistency of scoring across readers, what later became known as interrater reliability. By the beginning of the 1940s, the College Entrance Examination Board (CEEB) had been looking at the great success they had had piloting the Scholastic Aptitude Test (SAT) for college admission for special populations of students. The death knell for essay testing as the centerpiece of the CEEB was sounded when America entered the war in late 1941, and the CEEB immediately suspended essay testing (for good) because the then new SAT could produce admission data for students more quickly, and this would contribute to America’s war effort. Colleges, such as Princeton, accelerated instruction for students who had deferred military service and upon graduation would be called to active service (Elliot 100) .The backlash from English teachers was strong and vociferous, causing John Stalnaker, the chief officer of the CEEB, to note:

The type of test so highly valued by teachers of English, which requires the candidate to write a theme or essay, is not a worthwhile testing device. Whether or not the writing of essaysas a means of teaching writing deserves the place it has in the secondary school curriculum may be equally questioned. Eventually, it is hoped, sufficient evidence may be accumulated to outlaw forever “the write a theme on…” type of examination. (Fuess 158)

English teachers’ protest over the termination of essay testing forced the CEEB to develop the English Composition Test (ECT) which was read and scored by teachers at their home institutions –a fact test developers found amusing considering the labor involved (see O’Neill et al’s discussion of the labor involved in writing assessment), though it might also be taken as a sign of the commitment teachers had to the direct assessment of student writing (Elliot).

At any rate the ECT reinforced the ongoing problems with the ability of test developers to furnish a testing environment in which consistent scores from different readers for the same papers (interrater reliability) could be generated. Throughout the 1950s, test developers now primarily from ETS worked on trying to develop methods that would ensure agreement among raters. Ultimately this work was not successful, and ETS discontinued the ECT and created tests of usage and mechanics[1] to assess student writing (Elliot; Yancey). During the 1960s, test developers made two important breakthroughs. In 1961 Diederich, French and Carlton published a study in which fifty-three raters scored three hundred papers on a nine point scale. Although the vast majority (over 94%) of the papers received at least seven different scores, the researchers used factor analysis to analyze and determine the main influences upon the scores given by the raters. These five main factors became the basis of what would eventually become analytic scoring in which raters read and scored student writing according to their evaluation of an essay’s Ideas, Form, Flavor (style), Mechanics and Wording. The scores could be weighted depending upon the purpose of the assessment. A decade or so later, a team led by Richard Lloyd-Jones revised analytic scoring using categories germane to the writing task to create what came to be known as primary trait scoring, which was used in early writing assessment for the National Assessment of Educational Progress (NAEP). In a 1966 ETS research bulletin, Godshalk, Swineford and Coffman published a study detailing a set of procedures, which would eventually become known as holistic scoring, in which readers arrived at a single score based upon a set of criteria or a rubric through which they were trained to agree. By the mid 1960s, then, ETS researchers had developed the procedures to begin the modern era in writing assessment in which writing assessment actually contained student writing that was read and scored at reliable rates.[2] By the mid 1970s, direct writing assessment was a common practice in education and composition and a subject for the scholarly literature (Cooper and Odell).

While educational measurement test developers—largely from ETS—had seemingly solved the reliability problem for writing within a psychometric framework that demands “reliability be a necessary but insufficient condition for validity” (Cherry and Meyer 110), validity theorists headed by Lee J. Cronbach had begun to repudiate the positivist basis for validity in educational and psychological tests. Since the creation of educational and psychological measurement and testing in the later nineteenth and early twentieth centuries, validity had been considered the most important aspect for assessment (Angoff; Ittenbach, Esters and Wainer; Mayrhauser). Nonetheless, very little scholarship about validity appeared during the first half or so of the twentieth century (Angoroff). Odd as it might seem to us, the validity of a specific test was often seen as something best left to the test author(s), since he or she would have the most information and the deepest understanding of how well a specific test or measure might work. The lack of scholarship in validity for the burgeoning field of assessment, which was in fact dominated by intelligence testing,[3] was also influenced by the development of psychometrics, the statistical apparatus for measurement, as well as the largely positivist paradigm within which most social science operated during the first half of the twentieth century.

J. P. Guilford’ s essay on validity in 1946 is the most substantive scholarship on validity in testing from the first part of the twentieth century and the best summary of the way in which validity was conceived and used. For Guilford, “a test is valid for anything with which it correlates” (429). This definition reflects both the acontextual nature of much positivist philosophy as well as the psychometric orientation for the social science and testing of that era. As Cronbach notes, validity’s definition at this time as the degree to which a test measures what it purports to measure focuses on the test’s accuracy – did it do what it said it would and how well did it do it? Subsequent notions that stipulated “that a test is valid if it serves the purpose for which it is used, raised a question about worth” (Cronbach, 1988, 5). This shift in focus from accuracy and truth (positivism is after all based upon a Platonic search for truth) to the value or meaning of a measure began the long road away from a positivist orientation for validity to one based upon a measure’s value. Whether we focus on Samuel Messick’s, Cronbach’s or the most recent Standards for Educational and Psychological Testing (authored jointly in 1999[4] by the American Psychological Association, the American Educational Research Association and the National Council on Measurement in Education), validity as it is currently understood is about validating the decisions made based upon an assessment. Cronbach sums up this position and the logic that guides it in his memorable statement about validity and interpretation:

To call Test A or Test B invalid is illogical. Particular interpretations are what we validate. To say “the evidence to date is consistent with the interpretation” makes far better sense than saying “the test has construct validity.” Validation is a lengthy even endless process (1989 151).

The shift in focus for validity from the test itself to the decisions and interpretations we make based upon the test marks a movement away from a positivist orientation that looks for an ideal reality or “true score” and recognizes the contingent nature of human experience and value. This shift in focus from the test itself to the decisions made based upon a test also moves away from the form of the examination to its use. In other words, we cannot assume, assert or even argue that one form of assessment is more valid than another because validation is a local, contingent process. It is also, as Cronbach notes, an ongoing process in which every use of an assessment implies a series of inquiries into its accuracy, appropriateness and consequences for the learners and the learning environment. This local, contingent, fluid nature of validity and validity inquiry also marks a movement away from a fixed, positivist notion of truth to a more postmodern notion of reality as something in which value is constructed by individuals and groups to reflect the ongoing, changing nature of human experience. This localized positioning for validity works to deconstruct unequal power relationships in which a central authority is in control and in which European, middle class, male, heterosexual values are held above all others.

In short, once the CEEB decided to switch to the cheaper, quicker, more reliable SAT, English teachers lobbied for reliable writing assessment that included students’ writing. At the same time that ETS is working toward new, reliable writing assessments that would eventually produce holistic, analytic and primary trait scoring, educational measurement theorists headed by Cronbach and later joined by Messick would work to develop theories and practices for validity that reflect the importance and weight of decisions made upon data generated from tests and exams.

There has been much development of writing assessment procedures beyond holistic, analytic and primary trait scoring, with many new procedures that work toward having readers make a decision (like the basis for validity) rather than score a piece of writing. Reliability is still an important part of writing assessment because without consistency in judgment, decisions about students’ writing would have more to do with who read the piece of writing than who wrote it. On the other hand, a current understanding of reliability positions it within the overall umbrella of validity, so each use of a writing assessment includes a check on the consistency of the decisions-making among judges, if multiple judges for the same decision are used. Currently, we see and present in this volume a field of writing assessment that draws together the issues of reliability and validity we reviewed above with its insistence upon using student writing or student-decision making (self-directed placement asks for students to decide what course they would rather be placed in) as the basis for making decisions about students. Exploring the accuracy, appropriateness and consequences of decision-making based upon assessment should be a part of the use of a writing assessment. The articles and book chapters we present here provide an introduction to this view of writing assessment.

Depending upon a host of factors, the field of writing assessment can be constructed in various ways. We certainly faced the question—or problem—of how we wanted to construct the volume itself. Outside of what articles to choose, we faced the question of how we wanted to organize the volume. In the end, after much discussion, negotiation, and reviewer feedback, we settled on three sections: Foundations, Models, and Issues. This structure recognizes that there is a variety of important and necessary work in writing assessment and that people who will be using this volume will have different needs and will use the volume in different ways. For example, some will look for Models for writing assessments or focus on specific issues within writing assessment, while others will look for basic, foundational knowledge of writing assessment history, practice and theory. Like all categories, ours have some overlap and slippage, but like the volume itself, they argue for a specific version of the field in which an understanding of foundational issues is as important as providing a range of assessment options or forays into specific issues.

We chose Foundations as one of our topics after originally using at least two categories like theory and history. While we believe that it is important that compositionists know both writing history and theory, we see this knowledge as foundational to other work they might undertake. These Foundations, as we call them, span such subjects like fundamental research comparing analytic, holistic and primary trait scoring (Veal and Hudson), theoretical work on holistic scoring (White) reliability (Cherry and Meyer; Moss, 1994; Williamson) and validity (Huot; Moss, 1998) and the history of writing assessment as a field (Camp; Yancey). This section functions as an abbreviated introduction to the field and, along with our bibliography in the appendix, should provide interested writing teachers and program administrators with an introduction to the most important historical and theoretical issues in writing assessment.

If the Foundations section recognizes the need for a theoretical and historical introduction for anyone who wants to really understand the field of writing assessment, the Models category reminds us that assessment is something you do – that’s one of the reasons writing assessment appeals to us because it’s not just a field in which you read, write and discuss. Writing assessment is an activity, a practice, as well as a theoretically rich scholarly field. Our selection for this section is hardly comprehensive – in fact, we left out several model programs we are very fond of, including work by SusanMarie Harrington, David Blakesley, Lisa Johnson-Shull and Diane Kelley-Riley (and others from the Beyond Outcomes book edited by Rich Haswell – a volume we recommend to any anyone looking for models[5]) and many others (see also Assessing Writing Across the Curriculum, and the Outcomes Book edited by Susanmarie Harrington, Keith Rhodes, Ruth Overman Fischer and Rita Malenczyk for great models for writing assessment). We selected the ones we did to represent not just a variety of models but also in many cases ones that have been influential. William L. Smith was one of the first to really move beyond holistic scoring with his set of procedures that produced decisions directly with greater accuracy and a higher degree of interrater reliability; Smith’s work also furnishes a strong model for validating a writing assessment program (See also Hester et al who adapt Smith’s procedures for writing assessment and validation over a six period). Haswell and Wyche-Smith also provide one of the first procedures to move beyond holistic scoring and beyond having each paper read twice. Durst, Roemer and Schultz report on a model for exit testing that includes teams of teachers reading student portfolios. Royer and Gilles’ work on placement moves the decision to the student; their work has been highly influential, and we recommend interested folks to look at their edited collection. Haswell and McLeod model various strategies for reporting on writing across the curriculum assessment. Michael Carter’s piece provides a clear introduction to conducting an outcomes assessment for a writing across the curriculum program, though it could be easily adapted to any writing program. (Carter’s piece should be especially useful for the many writing teachers and program administrators responding to calls for outcomes assessment.) Together these models offer a solid introduction and resource into the various kinds of writing assessment. They should be useful for readers who want to understand how theories such as reliability and validity function in practice as well as demonstrate how to design site-based, context-sensitive, rhetorically informed, and accessible assessments that meet local needs. In other words, the assessment programs described in this section are not designed as models to be imported or mimicked but rather as samples of how writing assessments can be tailored for specific purposes, programs, and institutions.

The choice of Issues as our last category—and the choices in this category—illustrate how the field of writing assessment has changed over the last several years. In 1990 Brian published a bibliographic essay in Review of Educational Research in which he had determined that the three main areas of research in writing assessment had been topic selection and task development, text and writing quality, and the influences on raters. However, through the 1990s, researchers’ attention shifted, or rather expanded, to include many other topics beyond technical concerns of procedures and practices. Our Issues category attempts to capture this wide range of topics, from the ways in which teachers construct student writing (Freedman) to the machine-scoring of student essays (Williamson). While we could certainly devote an entire volume to the various issues now being addressed in writing assessment, the purpose of this category is to provide a strong introduction into such important issues as programmatic and extracurricular influences (Condon and Hamp-Lyons), gender (Haswell and Haswell), race (Ball), politics and power (Huot and Williamson), teaching and assessing English as a second language (Hamp-Lyons) and the tensions inherent in many assessment programs. The content of this section identifies some of the key issues that we need to attend to not just in designing and implementing writing assessments, but also in reading and evaluating the results of ours (or others’) assessments. This section can also serve readers as a starting place for critical issues when discussing writing assessments with institutional administrators or testing staff by helping readers frame questions and articulate concerns that may relate to a specific assessment decision. Certainly, we cannot cover all issues, though we hope to give readers a sense of many of the issues being addressed and the wide range of issues that writing assessment touches. In addition to the articles listed here, we include an appendix with additional readings for each of the sections and an annotation in the Table of Contents for each of the pieces in the volume.

Choosing the pieces and deciding on the categories for this volume has been has been quite an adventure. This version is quite different from the one we sent out for review. That version was well over seven hundred pages and we have revised more than half of the selections and kept only one of the original categories. We learned much from the reviewers who made several suggestions about the selections we should keep, add or delete. The reviewers also pushed on our conceptions of this book. We could not have done the job we finally were able to do without the expert help of Linda Adler-Kassner, Eastern Michigan University; Chris Anson, North Carolina State University; Norbert Elliot, New Jersey Institute of Technology; Anne Herrington, University of MassachusettsAmherst; William Smith, retired from Oregon Health & Science University; and two anonymous reviewers. In some ways, this project was like a puzzle: “Pick X number of readings on your area of expertise that make a statement about the area within X number of pages.” As with all puzzles, we learned a lot and eventually arrived at a spot in which the pieces seemed to fit. We hope they fit for you, our readers. What is even clearer to us now than when first started is that a sourcebook on writing assessment for those who teach writing and administer writing programs at the college-level is an important and needed resource. We are grateful to Bedford/St. Martin’s for giving us the opportunity and support to produce such a volume and for assigning Leasa Burton to work with. She and her assistant Sarah Guariglia, have been insightful and supportive in too many ways for us to mention. This volume would not be as good without their help. We would also like to recognize and thank Joan Feinberg, Denise Wydra, Karen Henry, and Emily Berleth.

Brian Huot, Hudson, Ohio

Peggy O’Neill, Baltimore, Maryland

WORKS CITED

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. 1999.

Angoff, William H. “Validity: An Evolving Concept.” Test Validity. Eds. Howard Wainer and Henry I. Braun. Hillsdale NJ: Lawrence Erlbaum Associates, 1988. 19-32.

Cherry, Roger and Paul Meyer. “Reliability Issues in Holistic Assessment.” Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations. Eds. Michael Williamson and Brian. Huot. New Jersey: Hampton P, 1993. 109-141.

Cooper, Charles and Lee Odell, eds. Evaluating Writing: Describing, Measuring and Judging. Urbana, IL: NCTE, 1977.

Cronbach, Lee J. “Construct Validation after Thirty Years.” Intelligence Measurement, Theory and Public Policy: Proceedings of a Symposium in Honor of L. G. Humphreys. Ed. R. L. Linn. Urbana and Chicago: U of Illinois P, 1989. 147-171.

- - -. “Five Perspectives on Validity Argument.” Test Validity. Eds. H. Wainer and H. I. Braun. Hillsdale, NJ: Lawrence Erlbaum, 1988. 3-17.

Diederich, Paul, B. John W. French and Sydell T. Carlton. Factors in Judgments of Writing Quality. Princeton: Educational Testing Service, 1961. RB No. 61-15 ED 002 172.

Elliot, Norbert. On a Scale: A Social History of Writing Assessment in America. New York: NY: Peter Lang, 2005.

Fuess, Claude. The College Board: Its First Fifty Years. New York: College Entrance Examination Board, 1967.

Godshalk, Fred I., Frances Swineford, and William E. Coffman. The Measurement of Writing Ability. Princeton, NJ: Educational Testing Service, 1966. CEEB RM No. 6.

Huot, Brian. "The Literature of Direct Writing Assessment: Major Concerns and Prevailing

Trends." Review of Educational Research 60 (1990): 237-263.

- - -. (Re)Articulating Writing Assessment for Teaching and Learning. Logan, UT: Utah State UP, 2002.

Ittenbach, Richard F., Irvin G. Esters and Howard Wainer. “The History of Test Development.” Contemporary Intellectual Assessment: Theories Tests and Issues. Eds. Dawn P. Flanagan, Judy L. Genshaft and Patti L. Harrison. New York, Guilford, 1997. pp. 17-31.

Lloyd-Jones, Richard. "Primary Trait Scoring." Evaluating Writing: Describing, Measuring

and Judging. Eds. Charles R. Cooper and Lee Odell. Urbana, IL: NCTE, 1977.

Lynne, Patricia. Coming to Terms: A Theory of Writing Assessment. Logan, UT: Utah State UP, 2004.

Mayrhauser Von, Richard T. “The Mental Testing Community and Validity: A Prehistory.” Evolving Perspectives on the History of Psychology. Eds. Wade E. Pickren and Donald A. Dewsbury. Washington DC: American Psychological Association, 2005. pp. 303-321.

Messick, Samuel. “Meaning and Values in Test Validation: The Science and Ethics of Assessment.” Educational Researcher 18.2 (1989): 5-11.

O’Neill, Peggy, Ellen Schendel, Michael Williamson, and Brian Huot. “Assessment as Labor and the Labor of Assessment.” Labor, Writing Technologies, and the Shaping of Composition in the Academy. Eds. Pamela Takayoshi and Patricia Sullivan. Creskill, NJ: Hampton P, 2007. 75-96.

Starch, D. and E. C. Elliott. “Reliability of the Grading of High School Work in English.” School Review 20 (1912): 442-457.

White, Edward M. Teaching and Assessing Writing 2nd Edition. San Francisco: Jossey Bass, 1994.

Yancey, Kathleen Blake. "Looking Back as We Look Forward: Historicizing Writing Assessment." College Composition and Communication 1999 (50): 483-503.



[1] These tests of usage and mechanics used to assessment writing ability and later generations like the COMPASS test (an untimed editing test delivered on computer) were called indirect writing assessment, a term we refuse to use, since we believe that writing assessment must include students actually writing.

[2] These rates of reliability were often not high enough in a strict psychometric sense for the scores produced to stand by themselves (Camp, P. Cooper.) The Godshalk study for example included scores from multiple choice test as well as holistically scored essays.

[3] For a discussion of the history of intelligence testing and its connection to the history of writing assessment see the history chapter in upcoming volume on writing assessment for Writing Program administrators and writing teachers.

[4] This version of the Standards was the fifth to be produced since the 1950s; it is regularly updated every ten years or so to reflect the professional standards for educational and psychological testing.

[5] The Appendix includes the bibliographic information for this work and the others mentioned which are not include in the book’s contents.

No comments: