Validity of Formative Assessment
This is part 4 in the AFL in Science symposium. A lot of my thoughts should be taken as tentative, provisional, claims and should always be treated with healthy skepticism. @sputniksteve has written elsewhere about self-writing and I can’t claim to summarise the whole idea here but I think people should see this blog as my public attempt to clarify some of the ideas I hold.You can read part 1 by Adam Boxer here and part 2 by Rosalind Walker here and part 3 by Niki Kaiser here.
The central question is “what makes a valid formative assessment in science?”. This is a question that deserves a slow unpacking, much like a game of pass-the-parcel. This piece will explore the issues generally and then move to the specific applications in science. I do not think this piece claims to be the ‘correct’ way to look at formative assessment in science, or indeed, more generally, formative assessment. I do hope, however, that this is a useful argument to have heard, and having heard it, teachers can be mindful of it on Friday last period.
The focus of this is on three questions:
- What is formative assessment?
- What does ‘validity’ mean?
- How can a claim to ‘validity’ be substantiated in the context of formative assessment?
1 What is formative assessment?
Much has been said about formative assessment both its successes and its failures (Wiliam, 2011) but this series of blog posts treats formative assessment as a good thing. The evidence base is large and largely positive. Two formative assessment projects KOMFAP and LHTL were successful and explored the experiences of teachers and schools implementing formative assessment (Swaffield, 2011).
Being relatively assured of its efficacy we can now begin to define formative assessment clearly. Having spent a little time with an awarding body, it was remarked that “in every paper that deals with formative assessment there is a redefinition of formative assessment”. Hyperbolic, yes, but the proliferation of definitions with varying emphases has, in my opinion, only functioned to distort authentic formative assessment. In attempting to trace the canonical definitions across a range of papers by either Paul Black, Dylan Wiliam or both, I’ve come up with an illuminating (but no doubt, incomplete) history:
Assessment’ refers to all those activities undertaken by teachers, and by the students in assessing themselves, which provide information to be used as feedback to modify the teaching and learning activities in which they are engaged. Such assessment become ‘formative assessment’ when the evidence is actually used to adapt the teaching to meet the needs. Black & Wiliam, 1998, p. 2 (Inside the black box)
An assessment is defined as serving a formative function when it elicits evidence that yields construct-referenced interpretations that form the basis of successful action in improving performance, whereas summative functions priortize the consistency of meanings across contexts and individuals. Wiliam and Black, 1996 Meanings and Consequences
Practice in a classroom is formative to the extent that evidence about student achievement is elicited, interpreted, and used by teachers, learners, or their peers, to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of the evidence that was elicited. Black, P. J., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation,and Accountability,21(1),5–31.
As far as I know, the 2009 definition is the current canonical one. There is a striking difference between the definitions in the late 1990s and the 2009 definition, namely that success is no longer a requirement. There are five other key features in the latest definition:
1 Anyone can be the agent in formative assessment
2 Focus is on the decisions
3 Focus is on the next steps
4 Is probabilistic
5 The assessment does not need to change teacher action (Andrade and Cizek, 2010, p24-25)
I’m fairly certain that it is better to go with the 2009 definition than, say, the 1996 definition partly because the latter was subject to criticism by Newton (2007):
the supposed distinction between formative and summative is not grounded in the use to which assessment judgements are put. Why? Simply because there is no meaningful distinction to be drawn.
Newton’s (2007) distinctions around “assessment purpose” are useful in resisting the temptation to talk about summative and formative assessments as being on a continuum – where more of something or less of something shifts the assessment along a line. For Newton (2007) ,and for me, it is useful to categorise assessment purposes into three discourse levels: the judgment level (“you have a grade 5 in science”), the decision level (what decisions, actions or processes it supports) and finally the impact level (the intended impacts of the assessment). Newton’s argument is that “summative” is only a type of assessment judgement whilst “formative” is only a type of assessment use (the decision level). His argument is convincing enough that I too think that to talk of ” summative ” and “formative” within a single continuum is to commit to a category error. For example, it makes very little sense to say the summative things (always) occur after an instructional unit and formative things (always) occur during an instructional unit.
2 What does validity mean?
This is not an easy question to answer at any level of detail. There are two approaches that can be taken here and neither is particularly satisfying. The first approach is to assert a definition and move on, and the other is an exploration of validity and how the myriad of approaches can illuminate the issue. The first approach would leave this analysis vulnerable to criticism stemming from the validity lens I have chosen. The second approach, done properly, would extend this blog post to hundreds of pages and far beyond what I know. However, despite the risks, an exploration is demanded at this stage. I will briefly summarise the range of thought and then offer a pragmatic argument on which approach is ought to be adopted.
2a A century of debate
If I were being mean I would say that the best way to describe validity is that
there is no widespread professional consensus concerning the best way to use the term” (Newton and Baird, 2016)
The past 100 years can be loosely split into four periods. Newton and Shaw (2014) suggest that the history of validity theory follows the periods:
1 Mid-1800s to 1920 gestational period
2 1921 to 1951 crystallisation
3 1952 to 1974 fragmentation
4 1975 to 1999 (re)unification
5 2000 and onward deconstruction
1 The roots of validity are North American and psychometric in nature. The first mental capabilities scales were constructed in the early 1900s and was greatly aided by the advanced statistical methodologies like the correlation coefficient. The period was not entirely defined by its reverence for the correlation coefficient as a number of theorists made inroads to validity that look remarkably modern.
2 In the second period education came to sustain and house many of these measurement instruments. Educationalists had a powerful stake in issues of quality and control. In 1921, the North American National Association of Directors of Educational Research attempted a movement towards consensus. Validity was defined as “the degree to which a test measures what it is supposed to measure”. So this had implications for validation (the substantiation of validity claims). There were two approaches:
- Logical analysis of test content
- Empirical evidence of correlation.
The first method was pretty much experts coming to an agreement that yes, this particular question measures something that we intend to measure. The second method, which many preferred, was to gather evidence of the correlation between the test and what it was supposed to measure. This leads us to a key question: what is “the what it was supposed to measure”? What should be used as the criterion to judge the accuracy of results? For many it was “expert” opinion. Or if you had a really massive test that measured what you wanted, you’d find the correlation between the two. This resulted in horrendous warping — it’s a warning sign that the operationalization is perhaps, the more important of the two questions.
3 Moving to 1954, validity is characterised by the attempts to establish it. The 1954 Standards document produced by the APA stated the types of validity as: content, predictive, concurrent and construct. It’s here that I want to spend a moment on construct validity. Sometimes, there was no logical analysis or empirical evidence you could gather. This is because certain kinds of test (achievement) are evaluated according to their content (so content validity) and other kinds of test (aptitude) are evaluated according to a criterion – a yardstick measure, and this criterion based validity subsumed predictive and concurrent validity. That is, did the was the aptitude test a good predictor or did it match up in its conclusions empirically, with a similar test you ran at the same time? For many personality tests you needed to determine what psychological construct accounted for test performance. By construct they meant the postulated attribute which was presumed to be manifest in test performance. This particular approach according to Cronbanch and Meehl embraced anything, any evidence, logical or empirical, that could shed light on the psychological meaning of the test score. So it was not a wholly logical or empirical, but scientific in nature.
4 This period is pretty exciting from a validity perspective. It is pretty much dominated by Samuel Messick. He stated that all validity was construct validity. For example, he argued that validation needed to demonstrate that the proficiency that each question was presumed to tap was actually tapped: that is, the performances were neither inflated or deflated by irrelevant factors. A validator needed to be sure that the variance observed was attributable to construct-relevant factors and not to construct -irrelevant ones. For example, people scored highly on a science test because they were better at science and not because they liked hobnob biscuits. Messick promoted all validity as construct validity in one tightly packed statement
“Validity is an overall evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment”(Messick 1995, p. 741)
A particularly interesting section concerns interpretations. people are fond of quoting Cronbach as saying
One validates, not a test, but an interpretation of outcomes from a measurement procedure” (Cronbach, 1971: 447).
And furthermore, unlike how popular education book authors describes validity, test use has equal weighting. And this for me, is a strong argument that the nuance about validity is lost in popular educational books. This is because assessments culminate in decisions and actions and these need to be taken into account in any argument about assessment quality.
But this truly is a laborious task. It’s a neverending scientific endeavour to get more and yet more information. And what’s worse, is that this is ultimately unhelpful — what evidence? and how do I make that judgement? And the introduction of social and ethical consequences has deeply divided the assessment community. By 1999 the Standards had all but copied Messick.
5 At the turn of the century there was a desire to step away from the laborious scientific approach of Messick and simplify validity theory. As well as this, I think there was a push to make validity useful. Micheal Kane had been developing a methodology to support validation practice. And this methodology was grounded in argumentation (if you’re interested I’ve written a piece for the Chartered College of Teaching in Issue 1 of Impact on this).
This is all, necessarily, a caricature of the last 100 years but it serves to illustrate the difficulty of when staking a claim to validity. The claim rests on theory that probably has, somewhere in the literature, an opponent.
Whilst Messick advocated a laborious and almost never-ending investigation into test validity that I dislike (it’s highly impractical if only for the fact that he never says how to do this), his other contributions provide a useful framework to now interrogate formative assessment.
3 How can a claim to ‘validity’ be substantiated in the context of formative assessment?
It would not do to give up and not formalise where I stand on validity. I will doubtless regret this choice when I am older (and hopefully, wiser). It’s fairly uncontroversial to state that I think validity is not a property of a test but of the inferences or interpretations. What is controversial is to invite the role of social consequences and value implications into validity. Yet this is what Messick did. Messick simulatenously muddied the waters by conflating the quality of validity with the process of validation. It appears the only way to read Messick’s formulation of validity is to read it as a (concise but useless) manual on validation.
Messick’s work extends to a framework that conceptualises validity as four facets (1995, p748). The four facets are contained in a 2 x 2 matrix that subdivides validity into two. The rows are concerned with how the assessment is justified and the columns are concerned with what the assessment resulted in.
|Test Interpretation||Test Use|
|Evidential Basis||Construct validity (A)||Construct validity and relevance/utility (B)|
|Consequential Basis||Construct validity and value implications (C)||Construct validity, relevance/utility and social consequences (D)|
(From Figure 1 of Messick 1995 pg 748)
This is perhaps one of the most frustrating 2 x 2 matrices this physicist has ever encountered. Messick’s writing is thick, difficult and impenetrable. And to make matters worse it’s arguable that Messick did not have a coherent position on what his matrix meant throughout his academic life (Newton, 2014 p121-129). I am not sure why Dylan Wiliam (2000) relies on Messick’s progressive matrix to invoke support for consequential validity given that Messick himself (1995) admitted the distinctions were fuzzy, interlinked, messy and overlapping. And not only that he stated “to interpret a test is to use it, and all other test uses involve interpretation either explicitly or tacitly”. In short, we have a unitary concept split into four facets which goes against the unitary nature of the concept!
The 2009 definition does not need a Messick-inspired defence. Formative assessment operates on the decision level of discourse and it issues a statement not on the interpretations but the consequences of the decisions. In this respect, the validity of formative assessment is whether the decision really does what it is claiming.
I don’t think we need to enter into a messy argument – I think we just need to ask “does it do what it say on the tin?”. In this sense, how I view (and others I’m sure!) the validity of formative assessment is more akin to the earlier classical definitions of test validity: that it does what it purports to do.
But what of validating formative assessment? This validity inquiry answers the question:
“Did the thing done induce decisions about the next steps in instruction that were likely to be better (at the time), or better founded, than the decisions teachers or students would have taken in the absence of the elicited evidence?”
This is an extraordinarily difficult inquiry partly because of counterfactual and we run into the same problems of Messick’s never ending scientific enquiry. It’s here that we should remind ourselves of the scale of the practice we see in classrooms day in, day out. Formative assessment is what makes the classroom teacher. It is arguably inseparable from their identity as a teacher and it is probably the business of teaching itself.
The science classroom on Friday, period 5 – personal reflections
The one thing missing in the discussion above is what it means by a “better” decision. I spoke with an NQT once who was worried that their students were not learning. I don’t think this is the right worry for vast majority of teachers. All students will learn something. The real question is how fast they learn that something. Time is what we should optimise for. If we were much, much, much longer lived beings then I don’t think the arguments around effect sizes and what-works would be as strong.
In light of this, the validation inquiry, at least in my mind, is centred around a key question:
Does my thing that I do get me to change the instruction faster for the students in front of me?
In other words: to what extent does the thing that I do allow me to respond?
The paper shared by Adam hits home on the disciplinary aspect of formative assessment and urges us to pay attention what is being said. Take for instance the teacher-student interaction observed during my time (anonymised and adjusted to ensure anonymity)
Teacher So Eve, does the moon reflect light?
Student The Moon is bright so it’s a source and it also reflects light.
Teacher Ok, so it is a good reflector, Abel …
Clearly the student learnt something but at the rate that was going it would be a long time before the students had the right mental schema of light. The consequence of the decision was that the student left with a substantially different concept of what a source was. And worryingly, a concept they felt was entirely consistent with what they had experienced. The arguments surrounding the need for a deeper and more substantive look at the content of teacher-student talk are better explored in the paper mentioned by Adam.
The implications of this general preoccupation with decision making could fill whole books and pamphlets. Having said that there are specific implications for science and the science classroom. In later blog posts I will expand on a series of questions to think about when planning, but for now I think three steps inspired by my own practice are useful.
1 As the majority of science teachers teach outside the specialism (especially in the concept heavy KS3), the quality of their decision making needs support.
2 Procedures and processes that get the most valuable data should be prioritized. Asking children to spend precious time creating a low-impact poster probably does not reach a data-rate threshold that most teachers would be comfortable with. It’s probably safe to say that an arsenal of multiple-choice questions is a must for science teachers. Automatic correct-incorrect judgments on MCQs are possible with apps such as zipgrade.
3 Departments should invest time in creating “formative-response” sheets. These sheets are collaborative efforts in recording the discussions staff have on the best ways to respond to student errors, slips and misconceptions. The Institute of Physics has, to some extent, begun to do this with their PIPER Project.
None of this is particularly earth-shattering or, indeed, new but I do hope that armed with an appreciation of validity in a formative context that the next decisions you take are better (or better founded).
Andrade, H. and Cizek, G. (2010). Handbook of formative assessment. New York [N.Y.]: Routledge.
Cronbach, L. (1971). Test Validation. In: R. Thorndike, ed., Educational Measurement. pp.443-507.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), pp.741-749.
Newton, P. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Principles, Policy & Practice, 14(2), pp.149-170.
Newton, P. and Baird, J. (2016). The great validity debate. Assessment in Education: Principles, Policy & Practice, 23(2), pp.173-177.
Newton, P. and Shaw, S. (2014). Validity in educational & psychological assessment. Los Angeles, Calif: SAGE.
Swaffield, S. (2011). Getting to the heart of authentic Assessment for Learning. Assessment in Education: Principles, Policy & Practice, 18(4), pp.433-449.
Wiliam, D. (2000). Education: The meanings and consequences of educational assessments. Critical Quarterly, 42(1), pp.105-127.