What educator hasn鈥檛 sometimes felt frustrated at the overreliance on test scores to measure the success of their students and their teaching?
Today鈥檚 guest post explores alternatives to that approach.
鈥楨xpanding Rigor鈥
James Soland is an associate professor of research, statistics, and evaluation at the University of Virginia School of Education and Human Development whose work focuses on assessment (psychometrics), evaluation, and data use:
Educators know the drill: A new program rolls out, someone gathers test scores before and after, and they are told whether it 鈥渨orked.鈥 But what does that really tell us? Does it help improve teaching? Does it help 糖心动漫vlog and policymakers understand why it worked here but not there? Does it help us understand what students actually gained and experienced?
As I discuss in , right now, the field of education evaluation is fixated on one question鈥斺渨hat works?鈥濃攏arrowly defined as whether a program causes measurable improvements in things like test scores. That is certainly valuable at times, but it misses too much of what matters in schools. It privileges what we can easily measure over what we ought to understand. And it treats school contexts as interchangeable backdrops rather than vital elements of success.
As teachers and school leaders, you know that learning is messy, local, and human. It unfolds in specific classrooms with specific students and adults doing real work together. The current dominant approach鈥攅valuating programs like black boxes and judging them by a narrow set of outcomes鈥攄oesn鈥檛 capture that reality. It leaves out teacher expertise, student experience, school culture, and the many contextual conditions that make a strategy effective here but not somewhere else.
We need evaluation that learns from you鈥攖丑别 糖心动漫vlog in the trenches鈥攏ot just about students鈥 test results. You are the people who see when a strategy sparks student curiosity, when it bumps up against local realities, or when it might not work when applied to a different set of students. You also know when a program is truly moving the needle, versus being done purely out of compliance.
What鈥檚 the problem with 鈥渂lack box鈥 evaluations?
Most rigorous program evaluations focus on isolating causal effects: 鈥淒id X cause Y?鈥 That鈥檚 mostly done with experiments or quasi-experiments using standardized outcomes that policymakers and researchers can compare across time and place. But this approach has two big limitations:
- It favors outcomes that are easy to quantify鈥攍ike state test scores鈥攐ver equally important outcomes that are harder to measure, like critical thinking, collaboration, or students鈥 sense of belonging. Those latter components are often at the heart of teachers鈥 daily decisions and matter deeply for long-term learning.
- It treats context鈥攖丑别 school, community norms, teacher skills, resources, and culture鈥攁s something to control away, rather than a source of insight about how and why something works.
This narrow focus leads to evaluations that feel detached from reality. They tell you whether something worked somewhere but not why it worked, how it worked, and under what conditions it might work in your own context.
So what might a better evaluation look like? Let鈥檚 use an example of a socio-emotional intervention (such as to boost growth mindset or self-management skills) designed to improve that socio-emotional competency and, thereby, also improve achievement. Here are some key elements:
1. Broaden outcomes beyond test scores.
Standardized tests capture important academic skills, but they miss socio-emotional growth, critical reasoning, cultural competence, and other dimensions of learning that teachers nurture every day. When evaluation counts these too鈥攅ven if they鈥檙e harder to quantify鈥攊t aligns more closely with what matters for students.
In the socio-emotional-competency example, it would mean not only looking at achievement gains but also at changes in self-management or growth mindset, ideally using a survey measure designed to understand change over time. It would further involve asking teachers whether they think the intervention actually improved the competency or if it was more likely a measurement artifact (e.g., students better anticipating the 鈥渃orrect鈥 answer on the survey after the intervention).
2. Mix numbers with narratives.
Rigorous causal work has its place鈥攂ut it should sit next to rich qualitative evidence. That means intentionally gathering teacher perspectives, student voices, and descriptions of administrator experiences. Qualitative research has often been peripheral in program evaluation, but it helps us understand mechanisms鈥攖丑别 how and why of what works鈥攏ot just the if. In the case of the socio-emotional-competency intervention, interviews with teachers would ask if the teacher felt there was a valid, causal chain where the intervention increased self-management or growth mindset and that improvement then caused achievement gains.
They would talk about whether the intervention is easy enough to implement that it could be part of common practice. If the intervention did not show gains (in the competency or achievement), teachers would provide qualitative data on why not.
3. Make context part of the question, not something to control away.
Instead of treating local conditions as noise, good evaluation treats them as data. Knowing how a rural school engaged parents or how a multilingual classroom adapted a reading program can teach us about transportability and adaptation.
For socio-emotional conferences, that could look like asking teachers whether they felt there was support for the intervention (e.g., sufficient time to implement it well), if there were bureaucratic hurdles, if it did or did not work for students with particular learning challenges (e.g., students with a particular IEP), how their particular school and setting affected outcomes, etc.
4. Use mechanisms to guide improvement.
Instead of only reporting that an intervention 鈥渨orked,鈥 evaluations should articulate how it produced results. Was it because teachers had more collaboration time? Because students engaged more deeply with texts that reflected their lived experience? Because instructional coaching supported risk-taking? Because teachers recognized the value and bought into the strategy? These mechanisms鈥攏ot just outcomes鈥攁re the critical lessons for replication and improvement.
In the socio-emotional-competency case, all of these mechanisms could emerge during teacher interviews, surveys, focus groups, or whatever was the most efficient use of their time. (Obviously, these additional data would be collected mainly in large-scale evaluations with sufficient resources to compensate teachers fairly.)
Putting contexts and conditions front and center with help from teachers
Teachers are constantly evaluating: They watch a lesson unfold, notice signs of engagement, troubleshoot misconceptions, and decide what to try next. That expertise鈥攇rounded in context and informed by deep knowledge of students鈥攏eeds to be part of how we study educational effectiveness. By expanding our definition of evidence and privileging why and how as much as whether, we make evaluation more useful.
And I believe we can, by expanding the aperture of the questions we ask, just maybe, move past a hyper-fixation on test scores.
Moving beyond test scores and black boxes doesn鈥檛 mean abandoning rigor. It means expanding rigor to include modes of inquiry that respect complexity without losing clarity. It means building evaluation systems that help 糖心动漫vlog and policymakers learn what works for whom, why it works, and what to do next.
Thanks to James for contributing his thoughts.
Consider contributing a question to be answered in a future post. You can send one to me at lferlazzo@epe.org. When you send it in, let me know if I can use your real name if it鈥檚 selected or if you鈥檇 prefer remaining anonymous and have a pseudonym in mind.
You can also contact me on X at or on Bluesky at
Just a reminder; you can subscribe and receive updates from this blog via . And if you missed any of the highlights from the first 13 years of this blog, you can see a categorized list here.