Key URLs and Links from Talks

Brian Huot:
The Big Test by Nicholas Lemann
On a Scale: A Social History of Writing Assessment in America by Norbert Elliot
Standards For Educational And Psychological Testing 1999 by AERA
Assessing Writing: A Critical Sourcebook by Brian Huot and Peggy O'Neill

Bob Cummings/Ron Balthazor:
No Gr_du_te Left Behind by James Traub
EMMA, UGA's electronic and e-portfolio environment

Marti Singer:
GSU's Critical Thinking Through Writing Project



Wednesday, October 24, 2007

8:45 – 10:15: Understanding and Using Mandates for Writing Assessment as Opportunities for Improving Our Programs and Teaching

Brian Huot, Kent State University Department of English

Along with Peggy O'Neill, Brian Huot is the editor of a soon to be published collection of essays on assessment from Bedford/St. Martin's called Assessing Writing: A Critical Sourcebook.

To learn more about Brian and Peggy see Assesing Writing: About the Editors.

To see an annotated table of contents see Assessing Writing: Contents.

To read the introduction to the collection see Assessing Writing: The Introduction.



Brian's talk:

Raw, directly typed notes. Typos and errors are mine –Nick Carbone—not Brian's.

History
History of search for reliability, mainly inter-rater reliability. If you had a target, reliability would be how often you hit a part of the target. It's about consistency more than being dead center.

1912 Starch and Elliot on how English teachers don't agree on grades they give students.

1930's SAT's were new and could help w/ scholarship. Used for students who needed to get scholarship and info had to get in sooner.

12/1941 WWII SAT becomes part of national security zietgiest. SAT grows as CEB gives ECT (English Comprehensive Test) – they right prompts and teachers read and grade. No reliability.

Nicholas Lamont the _Big Test_ ETS out in 1947.

Norbert Elliott, _Writing on a Scale_

In 1950's common to assess writing with no writing at all.

No we have computer scoring: erater, accuplacer, and scores are more reliable than what you get w/ human readers (but at sacrifice of validity).

From moving to arena where you have people who cannot agree to a machine that always agrees.

Validity
Intelligence testing takes off at turn of century in response to laws demanding universal education. So children whose parents never went to school and others come and they're hard to teach. What worked for privileged didn't work for masses. So testing began. Intelligence testing starts to measure students to find out why they can't learn.
Don't hear validity defined. Don't hear a lot about it on validity. Trusted testmakers to affirm that their tests were valid for measuring the thing they purport to measure.

Validity: how well a particular measure correlates to another measures of the same thing. It becomes a circle of one test checking another.

Traditional approach: does a test measure what it purports to measure.

In the '50's a test is valide if it serves the purpose for which it is used. This raises a question of worth and value: what's the value of this test? Do you have a rationale for having this assessment? How will it increase

Three keys:

Content Validity -- is it on the content and skills it purports to measure. are writers writing.

Criterion Validity – is it a consistent measure and does it match other measures

Conformity of measure with phenomenon of interest – does it support theory of what good content is


Compass Test: Indirect test of writing (yeah right). Instead of grammar test, it's an untimed editing test on a computer and that information is used to place you in writing courses (750,000 students use it).

To get validity, you have a group of essays and the correlation of compass test and scores on essay test were close enough to match (criterion validity)

The degree to which an argument can be made for integrated

n it's about judgment
n it's about decision making
n it's not about the measures you use, it's about the decisions you make
n validity is partial and always ongoing
n it assesses the assessment
n it's about consequences

For example if instructors will do some things and not others, you can't force them or measure the thing they won't do.

Reliability is subsumed under validity. Any argument about validity must consider reliability. Newer assessment models where folks don't give scores, works better. So instead of wholistic scoring, use a scheme that asks teachers which classes a student should be in.

If assessment has no benefits for teaching and learning, don't do it. But assessment is part of our jobs and can impact in a positive way.

If you know a program is working well, assessment can protect it. Make your own assessment or you will be assessed upon.

_Standards for Educational and Psychological Testing: 1999_ not from NCTE, CCC, this is from measurement community says is ethical and responsible use of assessment.

Research

Kent started new program. Moved second course to sophomore year. Move from literature based CT to process based comp program with more computer use and multimodal text.

Prepare folder:

syllabus – all of assignments and handouts.
sample of student writing: Above, at, and below students should be doing.
One page self-assessment of how class went from teacher's point of view.

Pay a $100 per folder. Ten teams w/ four people in each team and every team will ready four portfolio. Pay readers a $100 to read. Cost for large program $15,000.00

This will give them snapshop of what their program is doing and will get a sense of how well courses are meeting requirements and goals of new curriculum. Removed from personnel.

People will also get a sense of cool ideas from other classes.

This will give a sense of what's going on.

Will write a report to administration on what is going on. Will see patterns if certain goals aren't being met, can address that.

No performance by students except for samples. In year five they will do deeper research on student performance.

Opportunity

Assessment can be a community building activity. They can talk about teaching and students, but need to do it so people don't feel under the gun. Research into second year is not evaluative but descriptive because they are asking people to change the way they're teaching. It's radical change for some, for example, teaching in a computer classroom. So adjustment period is required and it needs nurturing, not chain-ganging, of instructors.

10:30 – 11:15: Outcomes-Based Assessment and Writing Portfolios: Feedback for Students, Insights for Instructors, Guidance for WPAs, and Data for VP

Robert Cummings, Columbus State University English Department, for himself and Ron Balthazor, University of Georgia at Athens English Department

Bob added the following comment, that I want to elevate because you'll find it helpful:

Ron and I would love to hear your opinions, positive or negative. Please feel free to contact us at either cummings_robert1@colstate.edu and/or rlbaltha@uga.edu.

Nick's raw notes on Bob's and Ron's presentation

Bob: Evaluating Levels of Cognitive Thought

Context: Spellings leveraging NCLB like practices to go after colleges via calls for accountability assessment.

quote from "No Gr_du_ate Left Behind," James Traub, NYT Magazine, 9/30/07

To anticipate NCLB like tests of learning to "prove" education, testing companies are gearing up. For example, ETS has Major Field Tests. From what Bob and Ron could see, English major questions were cannonical, like GRE questions. Tests designed for broad scale assessment using multiple choice exams.

Purpose is to demonstrate learning outcomes for a field. Administrators go to these tests because they need to meet the purpose test claim to meet. Need to show to some constituent that learning is happening.

Meanwhile, the University System of Georgia is considering moving to outcomes rather than credit hours as way to measure achievement. If outcomes based testing is coming, then it's worth keeping this idea in mind:

"If the proverbial gun is being pointed at our collective heads, how can we improve our current assessment systems to meet these demands without shooting ourselves first?"

What teachers don't want:

* Bubble sheet exams to measure outcomes which attempt to nationally normed.

* Machine readers for student writing. (Computers are fast, efficient and dumb)

* Or human readers in Bangalore (Shorthand for doesn't want to outsource assessment.)

* Exams which are external to our curriculm.

* Assessment which determines our curriculum.

What teachers want:

assessment that allows us to teach to our strengths and caps on passions to determine curriculm

FYC Directors want:

to see who is learning to write better according to disciplinary measures and consistent w/ values

assessment that helps teachers strengthen their pedagogy.

U. Administrators want
assessment that is quantitative; is independently verified; broad enough to allow comparisons across system institutions so every institution is a winner in some way.

We have to have some type quanitative element to assessment because of ETS/College Board precedents.

How do we meet all disparat these wants?

One idea is to use an Outcomes Based E-portfolio Assessment

At UGA eportfolio requires reflection, two revised essay, bio w/ image, peer review, and so on

End of semester eportfolio is capstone and must be substantial part of course

Reflective intro would be on board of regents first year composition outcomes, describing how writing artifacts from course satisified those outcomes.

Student reflective essay connects course artifacts in portfolio to the BOR outcomes

Persuade the reader the artifacts meet the outcomes.

Instructor of record using wholistic rubric to assess the outcomes reflective essay.

Same essay also read by another FYC comp instructor in USG system and scores reflective essay w/ same rubric as instructor of record.

What would this do?

  1. Have departmental discussion about what works
  2. Increase student transference, or know what they learned sooner, and are able to appy it only courses.
  3. Student, teachers, depts have artifacts to share w/ external stakeholders: parents, employers.

Problems with a portfolio outcome assessment

Students

What if I don't want to switch to portfolios?

What's an outcome?

Teachers

Is this extra grading?

I don't want to use portfolios.

I don't use e-writing platform. I like to hold the paper. I like to grade in bed. I don't like compclass.

Administrators

Will reflective essays yield authentic prose?

Will my teachers suspect the panopticon?

Next step: Create a pilot.

Hardest thing to overcome after a pilot? – most likely the work says Bob.

11:15 – 12:00 Assessing Critical Thinking through Writing: a University-wide Initiative

Marti Singer, Georgia State University's Rhetoric and Composition Program



Nick's Raw Notes From Marti Singer's Presentation



Key URLs from Marti's talk:
Critical Thinking Through Writing: http://www.gsu.edu/ctw/

Marti currently chairs the SACs accreditation committee.

Leading up to that, she was part of a committee that worked on establishing university wide learning outcomes (LOs). In 2003 that committee came up w/ the following LOs:

Written communication; critical thinking; collaboration; contemporary issues; quantitative skills; technology; and then added an oral communication.

People on campus were aware of these new outcomes, but not always involved. Thus the LOs were on paper, but not yet in the curriculum, as was learned from 2005 reports on LO implementation university wide.

In 2006, university bought into Weave Online – an outcomes based software program that helps you analyze progress on outcomes.

Fall 2006 Marti asked to be chair of UW assessment committee to analyze Weave results.

Assoc. Provost for Institutional Effectiveness – had them look at one thing for SACs and they decided everyone cares about thinking and writing so that was focus. WAC helped and they developed Critical Thinking Through Writing initiative: http://www.gsu.edu/ctw/

Using U. Sentate committees to help get faculty buy in. Made a motion to require every student at G. state to take two courses that offer CTTW.

Have u-wide assessment committee (big committee, 15 people) who approve the CTTW course for each major.

Next step was group called coordinators who offer workshops to teach and help faculty create assisgnments that meet CTTW: What is critical thinking in . Hard to do.
Then when you add the writing piece, it's even harder.

How do you encourage disciplines to come on board if you say their writing has to look like mine?

As of today, all of the ambassadors (46) are to put forth and share the draft of their initiative plan and courses they are going to use to meet the CTTW goal.

Look at drafts and do follow up workshops in November.

Have to define critical thinking
Have to describe why they chose the courses they did.

1:00 – 1:45: Restrict Rubrics to Regain Rhetorical Responsibility

Rebecca Burnett, Georgia Institute of Technology, School of Literature, Communication, and Culture (LCC)


Nick's Notes:

Restrict Rubrics to Regain Rhetorical Responsibility:

Rubrics have huge benefit because of work load; they make it possible for time-challenged instructors and placement readers and other assessors to provide quicker feedback on writing. (Rubrics are often necessary for survival.)

But there is a tension between rhetorical theory and how rubrics are applied.

Rubrics have an inherent caste system:

There is the one size fits all rubric
or
the custom rubric

One-size-fits-all example: superficial commercial tools; Rebecca recalled seeing a site that boasted instructors could "create rubrics in seconds." Not just commercial issue, however; many higher ed. sites –including lots of WAC programs—offer examples of rubrics that fit this model. Maybe not necessarily intentionally, but certainly when an "example" of rubric is taken and applied without any thought to making custom changes to it.

custom rubrics:
Let you set item and cat criteria; when enacted programmatically, enable raters to compare artifacts. This use of rubrics make certain assessments not only easier but possible, i.e. home-grown placement exams where a program designs rubrics to match their curriculum and course goals.


Student benefits to rubrics:
• assignment specific feedback
• enable one kind of self assessment (Herb Simon says experts can self assess and teaching students to self-apply rubrics can help them to do that).
• encourage peer assessment – students use language and criteria of rubrics as guide to critically (and constructively) reading peers' essays.
• identify competence and weakness


Teacher Benefits:
• focus and sharpen thinking on expectations
• facilitate movement of instructors to new courses
• provoke new thinking about a process or an artifact
• provide consistency in multi-section course
• support Composing or Communication Across the Curriculum

Admin Benefits
• demo that courses deal w/ considerably more than "correctness"
• provide benchmarks for rater consistency, providing reliability for large-scale assessment.

Yet, for all these benefits, major and important-to-consider complications exist.

To excavate these complications, Rebecca passed out a copy of a rubric she found at the WWW site of a Research I university. The rubric is used to help faculty across the curriculum know what good writing is and to give students feedback on the criteria that go into good writing. (Sorry table is not formatted much, nc):


Scale ------------ excellent --------- good------------ fair-------------- poor
WRITING QUALITIES:
accuracy of content
adaptation to a particular audience
identification and development of key ideas
conformation w/ genre conventions
ogranization of information
supporting evidence
spelling
punctuation
grammar
usage

Some of the issues w/ the above as brought out in discussion Rebecca lead:

No way to know how to apply rubric – checks, numbers, scores?

Concepts were vague and some things overlapped – i.e. genre and audience.

40 percent of it is on mechanics and surface errors.

Reality is this rubric form is reproduced all over the place, using something akin to this. And people feel like they're doing a good job because they're using them.

On the plus side, at least 4 point scale does force a choice on the writing (Brian Huot noted from audience); with odd number of scale options, people tend to choose middle of the road option a disproportionate number of times.

Other inherent rubric problems: what will the rubric encourage and reward?

• will it reward risk-taking
• or will it encourage conformity.

Lot depends on the content of rubric to say whether it encourages risk-taking or conformity.

NC questions: Can you write a rubric that encourages risk-taking? How do you do that and apply it?

Synergy of Communication is lost when using rubrics:

An argument is not inherently persuasive, nor is it persuasive is isolation.

Instead, an argument is persuasive for a particular situation, to a particular auidence.

The synergy is lost when rubrics are typically presented.

Rubrics by their very nature create bounded objects of rhetorical elements. That is, they isolate qualities as distinct when in reality, many of those qualities can only be inferred and judged in their synergistic relationship to other qualities. You cannot, in reality, separate a consideration of idea development from one of audience. How much development an idea needs often depends upon whom the audience is, what the argument to that audience intends, how much space one has to write in, and other factors.

NC questions: is it possible to apply rubrics with synergy kept in mind? Can you assess development in light of intended audience? If so, how does the rubric scale and scoring communicate to the writer that the judgment on development was tied to an understanding/judgment on audience?

How might that look? What if a rubric was based on rhetorical elements? asks Rebecca.

Rhetorical Elements
• sufficiency and accuracy of content PLUS
• culture and context
• exigence
• roles/responsibilities
• purposes
• audiences
• argument
• organization
• kinds of support
• visual images
• design
• conventions of language, visuals, and desing
• evaluation criteria

Thinking about these things in interrelated way, some if forefront of mind, some absorbed/assumed or native even.

It's not about the specfic elements, or that you list them, all, but how they interact.

Alternative scale to excellent good poor fair might use these terms:
• exemplary
• mature
• competent
• developing
• beginning
• basic

(These come Communication across curriculum program at ISU where Rebecca taught before joining Georgia Tech)


GT now using a WOVEN Curriculum

Written communication
Oral communication
Visual communication
Electronic communication
Nonverbal communication
…individual and collbarative
…in cross-cultural and international contexts

Assessment/Rubric Features:
• idiosyncratic (match to specific assignment)
• organic
• synergistic
• self-assessing
• methodologically legitimate
• theoretically supportable

Activity we did.
Develop an assessment plan for an assignment where students where doing a Health project. The scenario is they work for a company and HR wants to do a health campaign. The students are working in teams to develop five pieces:
Posters, 15 second "push" phone messages, a powerpoint w/ voiceover, a memo, a boolet on better health. Students are sent to appropriate government and health WWW sites and print sources for researching data necessary.

At our workshop, tables worked on what they would do for developing a synergistic rubric for such an assignment. RB said rubrics can be for both formative and/or summative assessment.
I don't have as many notes on table reports because I was spending too much time listening. I do recall that our table had some disagreement on what we would emphasize: overall project rubric, or rubrics for parts of projects. Brian H. noted that because each of the five elements was different in media and purpose and audience (memo was to company heads as progress report for example), each would need its own rubric.

Others felt the whole project needed a unifying rubric. I thought the challenge was finding a way to do both.

NC final thoughts and questions: I remember a room consensus thinking that ideally the feedback would come from seeing what worked. In a real office, the effectiveness of the campaign would be measured by changes in employee behavior and their reception to the campaign. But in fact, a lot of the work on feedback on such a project wouldn't be rubric-based. It would be discussion based, meeting based, and workshop based. The team and the team's managers would meet to discuss their campaigns, to answer questions on why something was the way it was, on why they thought it was effective. Assessment would be in the form acceptance, rejecting, or acceptance with revisions (not so dissimilar from academic journal procedures, only faster).

NC Questions: Can you really create a rubric that approximates in some way that dynamic? What if the rubric were used in a feed back meeting w/ each team in a teacher conference?

1:45 – 2:30 Collaboration and Transparency: Creating and Disseminating Program Assessment Materials

Karen Gardiner and Jessica Kidd, University of Alabama English Department


Nick's Notes:

UA is building a new writing program. The old program had –and this is my word, not our speakers'—atrophied, fallen into a kind of auto pilot where adjunct and TA training had faded and folks were teaching pretty much how and what they wanted in the FYW course.

TA training wasn't working

FYWP was inconsistent, no common goals, no common course curriculum

Carolyn Handa came in as new WPA has helped them to make constructive changes and strides.

At about the same time as Carolyn began, the Dean of the school (a Math Prof by training) become interested in learning-centered teaching (because of SAC review).

So there was both need to redesign FYWP for both the sake of program integrity and to meet the pressures for changes from above that the courses be more learning-centered (rather than lecture centered, as many of the FYW courses had become).

Fall of 2005 C. Handa started composition committee to derive goals, mission statement, outcomes for the FYC. She reached out to all levels of dept. from grads, adjuncts, tenure track faculty and so on. So using the fact that there was pressure (and support) from the top, Carolyn moved to change the program from the roots up, which is key to buy in from key people – those who will teach the new courses and curriculum.

People were excited and engaged to be part of the project. They started with creating mission statement which they wanted to meet these qualities:

  1. Mission Statement -- needed to reflect UA's and UA's A&S's mission statements.
  2. Needed to address whole spectrum of FY – from ESL to Honors
  3. Wanted to avoid negative language
  4. Didn't want to claim were handling all writing teaching needs for everyone.

Next, they established goals, objectives and then outcomes from each course.

To see what they accomplised, including mission statement and go to http://www.as.ua.edu/fwp

To see how a particular course was given a Goals, Objectives and Outcomes (GOO) go to

http://www.as.ua.edu/fwp/101cg_ro.html

Happening now:

Ongoing process, not nearly done yet.

Mission this semester is to move from GOO to rubrics. The process is recursive – establishing rubrics makes them look back and see some goals/outcomes/objectives impossible to do or not clearly defined.

The process and discussion remains highly collaborative – lots of emailing to lots of people.

Also important to process is keeping things very transparent in 4 ways:

  1. Simply creating a WWW site so people can visit and see what was going on. (couldn't get elephant to sing alabama fight song). Enable to share quickly w/ constituents of all kinds what they were doing, and with talking to high school teachers.

  2. Created a customized handbook w/ Comp Program. And use A Writer's Reference and Comment. Trying to brainstorm and then have Comment code in tab and so on to make things easier for students. Inserted Comment instructions and goals, objectives, outcomes, and other materials from program are part of the book. Really important tool in getting program information into hands of people who needed the info. Can change each year to reflect program changes and new policies.

  3. Site and handbook combine to help w/ TA training. Had prescribed texts that were picked by someone and TA was given the book and assignments to use and sent to class to work on their own. TA's can pick own books now, and the TA's can use anything that works for them. TA's take a 3 credit hour pedagogy course where they select an analyze texts, create a syllabus, analyze and create course materials. And then before teaching a week long orientation. Also have four teaching observations: formative (2 by peers; 2 by faculty). Student opinion surveys on class and learning. And a teaching dossier that uses the Dean's form. GOO also helps w/ teacher growth.

  4. Opportunity that comes with being this aware does for outreach. Karen's interest is high school to college transition. Working on alignment. Did workshop w/ 25 high school teachers. HS teachers wanted to see what the college was urging so they could see what they were doing. So they worked on the fly on creating assignments that HS teachers could give to their students to ready them for college.

http://www.thenexus.ws/english

Karen's personal goal is to create a site for HS teachers to find info on what college expectations. GOO's are not just for one's own program, but also for others, especially sometimes HS teaches and students.

NC observation: It's also worth noting that the transparency and public information does two things that are important: it promotes and advertises the program change and shows U. admins that progress on meeting UA outcomes and learning objectives is being made. Also, rubrics and other elements will provide benchmarks for program to measure their progess on over time, setting a way for fine-tuning, program self-assessment, and continued innovation.

High School connection helps with recruitment, but more importantly retention. All good stuff.

2:45 – 3:30 Using the Web to Enable Authentic Assessment Practices

George Pullman, Georgia State University's Rhetoric and Composition Program.


Nick's Notes:

Activity for table to set discussion stage and get people thinking:

1. What is assessment? Define it in two sentences (one would be better).

2. What is grading?

3. How do you make assessment not an add-on to the act of teaching?

4. How do you increase learning w/out increasing teaching? (NC: Some days I want to say, how do you not.)

+++++++++++++++++++++++
GP says, as WAC coordinator, I tell people about revision and they have students go off and write a ten page paper and then ask them to revise it. Now he is trying to ask for thought experiments in writing because writing captures student engagement. Also want to capture faculty response to that thinking.

Set up a WWW site/database where tought experiments via writing can be created, assigned, assessed via rubrics and written feedback by both instructors and students. Goal is to have lots of short thinking pieces, a kind of writing to learn, but tied more specifically as critical thinking through writing (ala Marti's talk).

Asks depts to write and attach their own rubrics for these experiments in thought. Real value for the department is the conversation on creating the rubric.

Other goal is to have content and data that can be used for study and review of the programs.

Trying to get people to construct assignments they are often not used to constructing.

Program is elaborated email system with a database backend. Can set assignments, rubrics, peer review and work goes out to email.

Write an asssignment attach rubrics. Rubrics offer dimensions for an assignment.

Rubrics are created by departments, not be individual instructors.

+++++++++++++
RB says: The assignment interface and use of text only makes assumption that communication is entirely about words. At GTech, RB trys to teach them design is part of communication and not separable. George agrees that the system he has designed doesn't accommodate the kinds of composition or critical thinking beyond the means of text that Rebecca would need or want to use. For example, says GP, people in studio design don't want to use the program. Because they think visually and the program is prose-based.

The goal says, GP, is on attempted thought, not finished or even complete thought, and then getting feedback and re-thinking.

3:30 – 4:00: What have we learned?

We'll use this half hour to thank our discussion and workshop leaders and to summarize what we've learned.

Tuesday, October 23, 2007

Assessing Writing: About the Editors"


About the Editors

Brian Huot has been working in writing assessment for nearly 20 years, publishing extensively in assessment theory and practice. His work has appeared in a range of journals including College Composition and Communication, College English and Review of Educational Research as well as numerous edited collections. He is one of the founding editors of the journal Assessing Writing, and more recently the Journal of Writing Assessment, which he continues to edit. He has co-edited several scholarly books, and in 2002 he published (Re)Articulating Writing Assessment for Teaching and Learning. He is currently at work on the Handbook of College Writing Assessment, co-authored with Peggy O’Neill and Cindy Moore. He is Professor of English and Coordinator of the Writing Program at Kent State University.

Peggy O’Neill’s scholarship focuses on writing assessment theory and practice as well as writing program administration and the disciplinarity of composition and rhetoric. Her work has appeared in journals such as College Composition and Communication, Composition Studies, and the Journal of Writing Assessment. She has edited or co-edited three books and is currently co-authoring the Handbook of College Writing Assessment with Brian Huot and Cindy Moore. She serves as co-editor of the Hampton press scholarly book series Research and Teaching in Rhetoric and Composition and on the editorial board of several journals. She is an associate professor and director of composition in the Writing Department at Loyola College in Maryland.

Assessing Writing: Contents

introduction

FOUNDATIONS

  1. Direct and Indirect Measures for Large-Scale Evaluation of Writing

Ramon Veal and Sally Hudson

Veal and Hudson summarize the differences between holistic, analytic, and primary trait scoring, and provide the research basis for comparing different commonly used writing assessment procedures.

  1. Holisticism

Edward M. White

White offers a strong theoretical argument for holistic scoring, and he reviews various theories of reading and interpretation from the scholarly literature and applies them to the reading of student writing. He furnishes a strong theoretical basis for the holistic scoring that ranges beyond the need for training readers and producing scores.

  1. Reliability Issues in Holistic Assessment

Roger Cherry and Paul Meyer

Cherry and Meyer provide detailed discussions of both instrument and interrater reliability, and the article distinguishes between different kinds of reliability and their uses while supplying relevant formula and description of how to best calculate interrater reliability in scoring sessions.

  1. The Worship of Efficiency: Untangling Theoretical and Practical Considerations in Writing Assessment

Michael M. Williamson

Williamson links the history of pedagogical approaches to the development of writing assessment and the value of efficiency throughout the twentieth century, and he pushes on the traditional importance given to reliability and its inability to drive the most important (validity) aspects of any writing assessment.

  1. Can There Be Validity Without Reliability?

Pamela A. Moss

Moss builds upon work in educational measurement on validity and performance assessment, and argues for a new, more flexible understanding of reliability as a measurement concept by challenging traditional notions of reliability in educational measurement.

  1. Portfolios as a Substitute for Proficiency Exams

Peter Elbow and Pat Belanoff

Elbow and Belanoff demonstrate the value, efficacy and practicality of using portfolios to assess student writing at the college level and ground the use of portfolios in a particular program’s need to assess student writing, providing writing teachers and program administrators with a strong model for responding to the need to assess in positive productive ways.

  1. Changing the Model for the Direct Assessment of Writing

Roberta Camp

Camp chronicles the development of writing assessment from an educational measurement perspective and draws upon work in cognitive psychology and literacy studies to make the argument that once researchers were able to furnish a more detailed and complicated picture of reading and writing, writing assessment developers were able to follow suit in developing more direct and authentic forms of writing assessment.

  1. Looking Back as We Look Forward: Historicizing Writing Assessment

Kathleen Blake Yancey

Yancey follows the development of writing assessment over a fifty year period from a college writing assessment perspective and illustrates the ongoing importance of assessment for the teaching of college writing, even as the assessment themselves change.(This article originally appeared in a fiftieth anniversary issue of CCCs.)

  1. Testing the Test of a Test: A Response to the Multiple Inquiry in the Validation of Writing Tests

Pamela A. Moss

Moss responds to college writing assessment as a field about its use of empirical methods and its understanding of test validity and makes an argument for validity in writing assessment as ongoing reflective practice in which all test use must include a process of inquiry and reflection within which we describe the limitations of the test and decisions made on its behalf.

  1. Toward a New Theory of Writing Assessment

Brian Huot

Huot introduces college writing assessment to relevant literature from educational measurement, arguing that current theories of test validity can advance and support a new set of theories and practices for writing assessment and challenges current notions and uses of reliability and validity to foster a new agenda for writing assessment.

MODELS

  1. The Importance of Teacher Knowledge in College Composition Placement Testing

William L. Smith

Smith provides one of the first models for writing assessment that moves beyond holistic scoring, opening up the possibilities for a range of assessment models based upon local values and expertise, which is equally important for providing a strong model for validity inquiry in which each use of an assessment requires research into its accuracy, adequacy and consequences for the program, students and teachers.

  1. Adventuring into Writing Assessment

Richard Haswell and Susan Wyche Smith

Haswell and Smith move beyond holistic scoring and interrater reliability to argue that writing teachers and administrators with expertise in writing assessment can learn to change writing assessment and create a productive assessment culture at their institutions.

  1. Portfolio Negotiations: Acts in Speech

Russel K. Durst, Marjorie Roemer and Lucille Schultz

Durst, Roemer and Schultz report on a model for exit testing in which three teacher teams, or “trios,” read student portfolios. This communal approach to making important judgments about students based upon a collection of their work breaks new ground and provides a strong model of a program that makes decisions about students and helps to create an assessment culture in which teachers talk with each other about their students and teaching.

  1. Directed Self-Placement: An Attitude of Orientation

Daniel Royer and Roger Gilles

Royer and Gilles authored this breakthrough piece that establishes the possibility for students to make their own decisions about placement into first-year writing and argues that empowering students to make their own placement decisions is theoretically sound and promotes learning and responsibility, creating basic writing courses comprised of self-selected students.

  1. WAC Assessment and Internal Audiences: A Dialogue

Richard Haswell and Susan McLeod

Haswell and McLeod model a series of “mock” conversations in which a writing assessment researcher and an administrator work through various problems in presenting assessment and outcomes data to various administrators and audiences throughout the academy.

  1. A Process for Establishing Outcomes-Based Assessment Plans for Writing and Speaking in the Disciplines

Michael Carter

Carter offers a practical, hands on explanation of how to conduct outcomes assessment and argues that outcomes assessment can be valuable beyond the need for accountability and can provide institutions and writing program administrators with important information to enhance teaching and learning.

ISSUES

  1. Influences on Evaluators of Expository Essays: Beyond the Text

Sarah Warshauer Freedman

Freedman reports on a study that explores the effects on holistic scores given to college students' expository essays due to three variables—essay, reader, and environment—and finds as Smith did years later that raters were the chief influence on students scores.

  1. “Portfolio Scoring”: A Contradiction in Terms

Robert L. Broad

Broad challenges the drive for interrater reliability associated with holistic scoring and procedures such as norming that prize consensus and agreement, arguing for “post-positivist” methods of assessment that are situated and located, which value context and diverse perspectives and are more theoretically aligned with writing portfolios.

  1. Questioning Assumptions about Portfolio-Based Assessment

Liz Hamp-Lyons and William Condon

Hamp-Lyons and Condon report on a study of portfolio readers who evaluated portfolios for exit from a first-year writing practicum, identify five assumptions typically made about portfolio assessment, and discuss them in light of the study’s findings and the authors’ experiences, concluding that portfolios are not inherently more accurate or better than essay testing depending on how they are used.

  1. Rethinking Portfolios for Evaluating Writing: Issues of Assessment and Power

Brian Huot and Michael M. Williamson

Huot and Williamson explore the connections between assessment and issues of power, politics and surveillance and contend that unless the power relationships in assessment situations are made explicit, the potential of portfolios to transform writing assessment—and positively influence teaching and learning—will be compromised.

  1. The Challenges of Second Language Writing Assessment

Liz Hamp-Lyons

Hamp-Lyons presents an overview of the challenges of assessing the writing of non-native English speakers and identifies some of the key differences in reading and evaluating texts by NNS and native speakers, concluding with a lengthy discussion of portfolio assessment. Since this article was published, the literature on assessing writing of NNS (or ESL) students is much more extensive; however, it provides a foundation for approaching the more recent work.

  1. Expanding the Dialogue on Culture as a Critical Component When Assessing Writing

Arnetha F. Ball

Ball addresses issues of teacher evaluation of writing produced by ethnically diverse students and reports on two studies that examine the rhetorical and linguistic features of texts and how they contribute to holistic assessment of student writing as well as a reflective discussion with African-American teachers about the evaluation of student writing.

  1. Gender Bias and Critique of Student Writing

Richard Haswell and Janis Tedesco Haswell

Haswell and Haswell provide an empirical study of how knowledge of writers’ gender affects readers evaluations of writing and discusses implications of the findings for teachers and assessors; their findings strongly suggest that readers are influenced by gender stereotypes in complex ways. The article also provides a thorough overview of gender and writing assessment.

  1. Validity of Automated Scoring: Prologue for a Continuing Discussion of Machine Scoring Student Writing

Michael M. Williamson

Williamson positions automated scoring within the broader assessment community and emerging concepts of validity and compares the way the composition and writing communities have considered these concepts to that of the educational measurement community’s position, challenging writing assessment professionals to become conversant with the field of educational measurement so that they can communicate effectively to those outside of their own community.

additional readings

about the editors

credits

index

Assessing Writing: The Introduction

Introduction

Now more than ever, writing assessment is a critical activity in our profession—with accrediting agencies, policymakers, and government organizations demanding evidence of learning from educational institutions. While the demand for assessment often comes from those outside of the field, it is still a critical component in teaching writing, creating curriculum, and developing programs. Increasingly, to be considered an effective writing teacher—not to mention a writing program director—means being able to respond to a variety of writing assessment needs confronting students, faculty, and institutions. However, for many practicing composition teachers and administrators, formal preparation in assessment was not part of their graduate education. In addition to a lack of preparation, many writing teachers and writing program administrators have negative views of assessment as a punitive force toward students, faculty and progressive forms of instruction. These fears of assessment are not unfounded although we believe that a powerful discourse like assessment can be harnessed for positive and productive writing instruction and administration. While many conversations about assessment seem to assume formal, standardized activities such as placement or exit testing, it is important to remember that assessment is a critical component of writing: writers self-assess and frequently seek evaluative input from readers throughout their writing processes. While our concern in this volume is not focused on the assessment writers do as they write, it does remind us that assessment activities are not hostile to writing but, on the contrary, contribute to success (Huot 2002).

The purpose of this collection of articles is to gather together readings that can help both practicing professionals and graduate students understand the theory and practice of writing assessment. We focus on assessment that functions outside of a single course—i.e., placement, exit testing, and program evaluation. While we realize the significance of response and classroom evaluation, and we see it as closely connected to large-scale assessment, we have not included articles about it in this volume. There just simply isn’t enough space. For example, portfolios could be the topic of a stand alone volume, but it is important for readers to see portfolios as part of the wider field, not isolated or distinct from other methods and not separate from assessment theory.

The readings collected here reflect a field of writing assessment that encompasses scholars, ideas, issues and research from a range of scholars in college writing assessment, K-12 teachers and researchers and educational measurement theory and practice. For example, we include a piece by Roberta Camp, who was an employee of the Educational Testing Service (ETS). Camp’s research on portfolios in the late 1970s and early 1980s was a precursor for the wealth of activity in portfolios later on in composition and education. While it is not unusual in college assessment scholarship to represent ETS and other educational measurement scholars in opposition to the models, theories and practices best suited to theoretically and pedagogically informed assessment practices (Lynne, White, Yancey and others), we see theories from educational measurement, especially the development of validity theory, as a strong platform for advocating for newer, more progressive forms of writing assessment. In addition to Camp, the volume also includes work from those like Pamela Moss who are not primarily associated with college composition and are from education and educational measurement. To illustrate what we mean about how theories from outside college writing assessment can be valuable and how this volume is an argument for a specific vision for writing assessment, we look at dual historical threads in writing assessment and assessment in general since the 1940s that illustrate our vision for the volume and the field of writing assessment. The first thread reviews the role of reliability in the development of holistic scoring of student writing, and the second looks at the development of validity as a psychometric concept. For us, these two threads—the development of reliable scoring of writing and the development of validity theory—come together to produce a rich understanding of writing assessment as a field.

Writing assessment has had reliability problems since at least the study by Daniel Starch and Edward Elliot in 1912 reporting that teachers couldn't agree on grades for the same student essays. Reliability, in this sense, referred to consistency of scoring across readers, what later became known as interrater reliability. By the beginning of the 1940s, the College Entrance Examination Board (CEEB) had been looking at the great success they had had piloting the Scholastic Aptitude Test (SAT) for college admission for special populations of students. The death knell for essay testing as the centerpiece of the CEEB was sounded when America entered the war in late 1941, and the CEEB immediately suspended essay testing (for good) because the then new SAT could produce admission data for students more quickly, and this would contribute to America’s war effort. Colleges, such as Princeton, accelerated instruction for students who had deferred military service and upon graduation would be called to active service (Elliot 100) .The backlash from English teachers was strong and vociferous, causing John Stalnaker, the chief officer of the CEEB, to note:

The type of test so highly valued by teachers of English, which requires the candidate to write a theme or essay, is not a worthwhile testing device. Whether or not the writing of essaysas a means of teaching writing deserves the place it has in the secondary school curriculum may be equally questioned. Eventually, it is hoped, sufficient evidence may be accumulated to outlaw forever “the write a theme on…” type of examination. (Fuess 158)

English teachers’ protest over the termination of essay testing forced the CEEB to develop the English Composition Test (ECT) which was read and scored by teachers at their home institutions –a fact test developers found amusing considering the labor involved (see O’Neill et al’s discussion of the labor involved in writing assessment), though it might also be taken as a sign of the commitment teachers had to the direct assessment of student writing (Elliot).

At any rate the ECT reinforced the ongoing problems with the ability of test developers to furnish a testing environment in which consistent scores from different readers for the same papers (interrater reliability) could be generated. Throughout the 1950s, test developers now primarily from ETS worked on trying to develop methods that would ensure agreement among raters. Ultimately this work was not successful, and ETS discontinued the ECT and created tests of usage and mechanics[1] to assess student writing (Elliot; Yancey). During the 1960s, test developers made two important breakthroughs. In 1961 Diederich, French and Carlton published a study in which fifty-three raters scored three hundred papers on a nine point scale. Although the vast majority (over 94%) of the papers received at least seven different scores, the researchers used factor analysis to analyze and determine the main influences upon the scores given by the raters. These five main factors became the basis of what would eventually become analytic scoring in which raters read and scored student writing according to their evaluation of an essay’s Ideas, Form, Flavor (style), Mechanics and Wording. The scores could be weighted depending upon the purpose of the assessment. A decade or so later, a team led by Richard Lloyd-Jones revised analytic scoring using categories germane to the writing task to create what came to be known as primary trait scoring, which was used in early writing assessment for the National Assessment of Educational Progress (NAEP). In a 1966 ETS research bulletin, Godshalk, Swineford and Coffman published a study detailing a set of procedures, which would eventually become known as holistic scoring, in which readers arrived at a single score based upon a set of criteria or a rubric through which they were trained to agree. By the mid 1960s, then, ETS researchers had developed the procedures to begin the modern era in writing assessment in which writing assessment actually contained student writing that was read and scored at reliable rates.[2] By the mid 1970s, direct writing assessment was a common practice in education and composition and a subject for the scholarly literature (Cooper and Odell).

While educational measurement test developers—largely from ETS—had seemingly solved the reliability problem for writing within a psychometric framework that demands “reliability be a necessary but insufficient condition for validity” (Cherry and Meyer 110), validity theorists headed by Lee J. Cronbach had begun to repudiate the positivist basis for validity in educational and psychological tests. Since the creation of educational and psychological measurement and testing in the later nineteenth and early twentieth centuries, validity had been considered the most important aspect for assessment (Angoff; Ittenbach, Esters and Wainer; Mayrhauser). Nonetheless, very little scholarship about validity appeared during the first half or so of the twentieth century (Angoroff). Odd as it might seem to us, the validity of a specific test was often seen as something best left to the test author(s), since he or she would have the most information and the deepest understanding of how well a specific test or measure might work. The lack of scholarship in validity for the burgeoning field of assessment, which was in fact dominated by intelligence testing,[3] was also influenced by the development of psychometrics, the statistical apparatus for measurement, as well as the largely positivist paradigm within which most social science operated during the first half of the twentieth century.

J. P. Guilford’ s essay on validity in 1946 is the most substantive scholarship on validity in testing from the first part of the twentieth century and the best summary of the way in which validity was conceived and used. For Guilford, “a test is valid for anything with which it correlates” (429). This definition reflects both the acontextual nature of much positivist philosophy as well as the psychometric orientation for the social science and testing of that era. As Cronbach notes, validity’s definition at this time as the degree to which a test measures what it purports to measure focuses on the test’s accuracy – did it do what it said it would and how well did it do it? Subsequent notions that stipulated “that a test is valid if it serves the purpose for which it is used, raised a question about worth” (Cronbach, 1988, 5). This shift in focus from accuracy and truth (positivism is after all based upon a Platonic search for truth) to the value or meaning of a measure began the long road away from a positivist orientation for validity to one based upon a measure’s value. Whether we focus on Samuel Messick’s, Cronbach’s or the most recent Standards for Educational and Psychological Testing (authored jointly in 1999[4] by the American Psychological Association, the American Educational Research Association and the National Council on Measurement in Education), validity as it is currently understood is about validating the decisions made based upon an assessment. Cronbach sums up this position and the logic that guides it in his memorable statement about validity and interpretation:

To call Test A or Test B invalid is illogical. Particular interpretations are what we validate. To say “the evidence to date is consistent with the interpretation” makes far better sense than saying “the test has construct validity.” Validation is a lengthy even endless process (1989 151).

The shift in focus for validity from the test itself to the decisions and interpretations we make based upon the test marks a movement away from a positivist orientation that looks for an ideal reality or “true score” and recognizes the contingent nature of human experience and value. This shift in focus from the test itself to the decisions made based upon a test also moves away from the form of the examination to its use. In other words, we cannot assume, assert or even argue that one form of assessment is more valid than another because validation is a local, contingent process. It is also, as Cronbach notes, an ongoing process in which every use of an assessment implies a series of inquiries into its accuracy, appropriateness and consequences for the learners and the learning environment. This local, contingent, fluid nature of validity and validity inquiry also marks a movement away from a fixed, positivist notion of truth to a more postmodern notion of reality as something in which value is constructed by individuals and groups to reflect the ongoing, changing nature of human experience. This localized positioning for validity works to deconstruct unequal power relationships in which a central authority is in control and in which European, middle class, male, heterosexual values are held above all others.

In short, once the CEEB decided to switch to the cheaper, quicker, more reliable SAT, English teachers lobbied for reliable writing assessment that included students’ writing. At the same time that ETS is working toward new, reliable writing assessments that would eventually produce holistic, analytic and primary trait scoring, educational measurement theorists headed by Cronbach and later joined by Messick would work to develop theories and practices for validity that reflect the importance and weight of decisions made upon data generated from tests and exams.

There has been much development of writing assessment procedures beyond holistic, analytic and primary trait scoring, with many new procedures that work toward having readers make a decision (like the basis for validity) rather than score a piece of writing. Reliability is still an important part of writing assessment because without consistency in judgment, decisions about students’ writing would have more to do with who read the piece of writing than who wrote it. On the other hand, a current understanding of reliability positions it within the overall umbrella of validity, so each use of a writing assessment includes a check on the consistency of the decisions-making among judges, if multiple judges for the same decision are used. Currently, we see and present in this volume a field of writing assessment that draws together the issues of reliability and validity we reviewed above with its insistence upon using student writing or student-decision making (self-directed placement asks for students to decide what course they would rather be placed in) as the basis for making decisions about students. Exploring the accuracy, appropriateness and consequences of decision-making based upon assessment should be a part of the use of a writing assessment. The articles and book chapters we present here provide an introduction to this view of writing assessment.

Depending upon a host of factors, the field of writing assessment can be constructed in various ways. We certainly faced the question—or problem—of how we wanted to construct the volume itself. Outside of what articles to choose, we faced the question of how we wanted to organize the volume. In the end, after much discussion, negotiation, and reviewer feedback, we settled on three sections: Foundations, Models, and Issues. This structure recognizes that there is a variety of important and necessary work in writing assessment and that people who will be using this volume will have different needs and will use the volume in different ways. For example, some will look for Models for writing assessments or focus on specific issues within writing assessment, while others will look for basic, foundational knowledge of writing assessment history, practice and theory. Like all categories, ours have some overlap and slippage, but like the volume itself, they argue for a specific version of the field in which an understanding of foundational issues is as important as providing a range of assessment options or forays into specific issues.

We chose Foundations as one of our topics after originally using at least two categories like theory and history. While we believe that it is important that compositionists know both writing history and theory, we see this knowledge as foundational to other work they might undertake. These Foundations, as we call them, span such subjects like fundamental research comparing analytic, holistic and primary trait scoring (Veal and Hudson), theoretical work on holistic scoring (White) reliability (Cherry and Meyer; Moss, 1994; Williamson) and validity (Huot; Moss, 1998) and the history of writing assessment as a field (Camp; Yancey). This section functions as an abbreviated introduction to the field and, along with our bibliography in the appendix, should provide interested writing teachers and program administrators with an introduction to the most important historical and theoretical issues in writing assessment.

If the Foundations section recognizes the need for a theoretical and historical introduction for anyone who wants to really understand the field of writing assessment, the Models category reminds us that assessment is something you do – that’s one of the reasons writing assessment appeals to us because it’s not just a field in which you read, write and discuss. Writing assessment is an activity, a practice, as well as a theoretically rich scholarly field. Our selection for this section is hardly comprehensive – in fact, we left out several model programs we are very fond of, including work by SusanMarie Harrington, David Blakesley, Lisa Johnson-Shull and Diane Kelley-Riley (and others from the Beyond Outcomes book edited by Rich Haswell – a volume we recommend to any anyone looking for models[5]) and many others (see also Assessing Writing Across the Curriculum, and the Outcomes Book edited by Susanmarie Harrington, Keith Rhodes, Ruth Overman Fischer and Rita Malenczyk for great models for writing assessment). We selected the ones we did to represent not just a variety of models but also in many cases ones that have been influential. William L. Smith was one of the first to really move beyond holistic scoring with his set of procedures that produced decisions directly with greater accuracy and a higher degree of interrater reliability; Smith’s work also furnishes a strong model for validating a writing assessment program (See also Hester et al who adapt Smith’s procedures for writing assessment and validation over a six period). Haswell and Wyche-Smith also provide one of the first procedures to move beyond holistic scoring and beyond having each paper read twice. Durst, Roemer and Schultz report on a model for exit testing that includes teams of teachers reading student portfolios. Royer and Gilles’ work on placement moves the decision to the student; their work has been highly influential, and we recommend interested folks to look at their edited collection. Haswell and McLeod model various strategies for reporting on writing across the curriculum assessment. Michael Carter’s piece provides a clear introduction to conducting an outcomes assessment for a writing across the curriculum program, though it could be easily adapted to any writing program. (Carter’s piece should be especially useful for the many writing teachers and program administrators responding to calls for outcomes assessment.) Together these models offer a solid introduction and resource into the various kinds of writing assessment. They should be useful for readers who want to understand how theories such as reliability and validity function in practice as well as demonstrate how to design site-based, context-sensitive, rhetorically informed, and accessible assessments that meet local needs. In other words, the assessment programs described in this section are not designed as models to be imported or mimicked but rather as samples of how writing assessments can be tailored for specific purposes, programs, and institutions.

The choice of Issues as our last category—and the choices in this category—illustrate how the field of writing assessment has changed over the last several years. In 1990 Brian published a bibliographic essay in Review of Educational Research in which he had determined that the three main areas of research in writing assessment had been topic selection and task development, text and writing quality, and the influences on raters. However, through the 1990s, researchers’ attention shifted, or rather expanded, to include many other topics beyond technical concerns of procedures and practices. Our Issues category attempts to capture this wide range of topics, from the ways in which teachers construct student writing (Freedman) to the machine-scoring of student essays (Williamson). While we could certainly devote an entire volume to the various issues now being addressed in writing assessment, the purpose of this category is to provide a strong introduction into such important issues as programmatic and extracurricular influences (Condon and Hamp-Lyons), gender (Haswell and Haswell), race (Ball), politics and power (Huot and Williamson), teaching and assessing English as a second language (Hamp-Lyons) and the tensions inherent in many assessment programs. The content of this section identifies some of the key issues that we need to attend to not just in designing and implementing writing assessments, but also in reading and evaluating the results of ours (or others’) assessments. This section can also serve readers as a starting place for critical issues when discussing writing assessments with institutional administrators or testing staff by helping readers frame questions and articulate concerns that may relate to a specific assessment decision. Certainly, we cannot cover all issues, though we hope to give readers a sense of many of the issues being addressed and the wide range of issues that writing assessment touches. In addition to the articles listed here, we include an appendix with additional readings for each of the sections and an annotation in the Table of Contents for each of the pieces in the volume.

Choosing the pieces and deciding on the categories for this volume has been has been quite an adventure. This version is quite different from the one we sent out for review. That version was well over seven hundred pages and we have revised more than half of the selections and kept only one of the original categories. We learned much from the reviewers who made several suggestions about the selections we should keep, add or delete. The reviewers also pushed on our conceptions of this book. We could not have done the job we finally were able to do without the expert help of Linda Adler-Kassner, Eastern Michigan University; Chris Anson, North Carolina State University; Norbert Elliot, New Jersey Institute of Technology; Anne Herrington, University of MassachusettsAmherst; William Smith, retired from Oregon Health & Science University; and two anonymous reviewers. In some ways, this project was like a puzzle: “Pick X number of readings on your area of expertise that make a statement about the area within X number of pages.” As with all puzzles, we learned a lot and eventually arrived at a spot in which the pieces seemed to fit. We hope they fit for you, our readers. What is even clearer to us now than when first started is that a sourcebook on writing assessment for those who teach writing and administer writing programs at the college-level is an important and needed resource. We are grateful to Bedford/St. Martin’s for giving us the opportunity and support to produce such a volume and for assigning Leasa Burton to work with. She and her assistant Sarah Guariglia, have been insightful and supportive in too many ways for us to mention. This volume would not be as good without their help. We would also like to recognize and thank Joan Feinberg, Denise Wydra, Karen Henry, and Emily Berleth.

Brian Huot, Hudson, Ohio

Peggy O’Neill, Baltimore, Maryland

WORKS CITED

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. 1999.

Angoff, William H. “Validity: An Evolving Concept.” Test Validity. Eds. Howard Wainer and Henry I. Braun. Hillsdale NJ: Lawrence Erlbaum Associates, 1988. 19-32.

Cherry, Roger and Paul Meyer. “Reliability Issues in Holistic Assessment.” Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations. Eds. Michael Williamson and Brian. Huot. New Jersey: Hampton P, 1993. 109-141.

Cooper, Charles and Lee Odell, eds. Evaluating Writing: Describing, Measuring and Judging. Urbana, IL: NCTE, 1977.

Cronbach, Lee J. “Construct Validation after Thirty Years.” Intelligence Measurement, Theory and Public Policy: Proceedings of a Symposium in Honor of L. G. Humphreys. Ed. R. L. Linn. Urbana and Chicago: U of Illinois P, 1989. 147-171.

- - -. “Five Perspectives on Validity Argument.” Test Validity. Eds. H. Wainer and H. I. Braun. Hillsdale, NJ: Lawrence Erlbaum, 1988. 3-17.

Diederich, Paul, B. John W. French and Sydell T. Carlton. Factors in Judgments of Writing Quality. Princeton: Educational Testing Service, 1961. RB No. 61-15 ED 002 172.

Elliot, Norbert. On a Scale: A Social History of Writing Assessment in America. New York: NY: Peter Lang, 2005.

Fuess, Claude. The College Board: Its First Fifty Years. New York: College Entrance Examination Board, 1967.

Godshalk, Fred I., Frances Swineford, and William E. Coffman. The Measurement of Writing Ability. Princeton, NJ: Educational Testing Service, 1966. CEEB RM No. 6.

Huot, Brian. "The Literature of Direct Writing Assessment: Major Concerns and Prevailing

Trends." Review of Educational Research 60 (1990): 237-263.

- - -. (Re)Articulating Writing Assessment for Teaching and Learning. Logan, UT: Utah State UP, 2002.

Ittenbach, Richard F., Irvin G. Esters and Howard Wainer. “The History of Test Development.” Contemporary Intellectual Assessment: Theories Tests and Issues. Eds. Dawn P. Flanagan, Judy L. Genshaft and Patti L. Harrison. New York, Guilford, 1997. pp. 17-31.

Lloyd-Jones, Richard. "Primary Trait Scoring." Evaluating Writing: Describing, Measuring

and Judging. Eds. Charles R. Cooper and Lee Odell. Urbana, IL: NCTE, 1977.

Lynne, Patricia. Coming to Terms: A Theory of Writing Assessment. Logan, UT: Utah State UP, 2004.

Mayrhauser Von, Richard T. “The Mental Testing Community and Validity: A Prehistory.” Evolving Perspectives on the History of Psychology. Eds. Wade E. Pickren and Donald A. Dewsbury. Washington DC: American Psychological Association, 2005. pp. 303-321.

Messick, Samuel. “Meaning and Values in Test Validation: The Science and Ethics of Assessment.” Educational Researcher 18.2 (1989): 5-11.

O’Neill, Peggy, Ellen Schendel, Michael Williamson, and Brian Huot. “Assessment as Labor and the Labor of Assessment.” Labor, Writing Technologies, and the Shaping of Composition in the Academy. Eds. Pamela Takayoshi and Patricia Sullivan. Creskill, NJ: Hampton P, 2007. 75-96.

Starch, D. and E. C. Elliott. “Reliability of the Grading of High School Work in English.” School Review 20 (1912): 442-457.

White, Edward M. Teaching and Assessing Writing 2nd Edition. San Francisco: Jossey Bass, 1994.

Yancey, Kathleen Blake. "Looking Back as We Look Forward: Historicizing Writing Assessment." College Composition and Communication 1999 (50): 483-503.



[1] These tests of usage and mechanics used to assessment writing ability and later generations like the COMPASS test (an untimed editing test delivered on computer) were called indirect writing assessment, a term we refuse to use, since we believe that writing assessment must include students actually writing.

[2] These rates of reliability were often not high enough in a strict psychometric sense for the scores produced to stand by themselves (Camp, P. Cooper.) The Godshalk study for example included scores from multiple choice test as well as holistically scored essays.

[3] For a discussion of the history of intelligence testing and its connection to the history of writing assessment see the history chapter in upcoming volume on writing assessment for Writing Program administrators and writing teachers.

[4] This version of the Standards was the fifth to be produced since the 1950s; it is regularly updated every ten years or so to reflect the professional standards for educational and psychological testing.

[5] The Appendix includes the bibliographic information for this work and the others mentioned which are not include in the book’s contents.