» NPR: Teacher Evaluation Dispute Echoes Beyond Chicago
In my defense, I was trapped in a car for a road trip when I heard the NPR story referenced above. But one quote jumped out at me when I heard the story …
Experts concede that teacher evaluation formulas are still a work in progress. But Dan Goldhaber, director of the Center for Education Data and Research at the University of Washington, Bothell, says algorithms have now become very sophisticated. They measure student improvement, not just scores, and they adjust for everything from socioeconomic factors to class size.
Maybe I’m not as well-versed in recent advances in mathematics, but algorithms themselves aren’t entirely new and I have a hard time accepting that “algorithms have no become very sophisticated”. I do find it to be far more believable that the data for processing those algorithms may have grown. So I’m not sure if that’s what was meant by this quote or not.
But it gets to the lingering problem I have with treating these type of evaluations as the be-all, end-all for identifying quality teachers. At least in my days as a student, we were told what percent the mid-term would count toward our final grade, what percent the final exam would count, what percent quizzes and homework would count, and whether the teacher was going to grade on some definition of “improvement” over the semester or on a straight mathematical accounting of those inputs. But in the drive to put a number of new inputs into an algorithm that defines whether a teacher is doing good or bad … I’d be willing to be that even most math teachers wouldn’t be able to explain the formula.
That brings me to this story …
» Washington Post: New teacher evaluations start to hurt students
The shortcomings of evaluating teachers by test scores were apparent in the recent report of the American Institute for Research (AIR), which developed the New York growth score model. AIR, in its BETA report, shows how as the percentage of students with disabilities and students of poverty in a class or school increases, the average teacher or principal growth score decreases. In short, the larger the share of such students, the more the teacher and principal are disadvantaged by the model. I predict that when the state results are made public, you will see a disproportionate amount of teachers of students with serious learning disabilities and teachers in schools with high levels of poverty labeled ineffective on scores. And that label will be unfair.
Likewise, in the model used this year, teachers who have students whose prior test scores were higher were advantaged, while teachers whose students have lower prior achievement were disadvantaged. This phenomenon, known as peer effects, has been observed in the literature since the 1980s. There is no control for peer effects in the model. We will see patterns of low scores for teachers of disadvantaged students. Over time, the students who need the best teachers and principals will see them leave their schools in order to escape the ‘ineffective’ label.
Perhaps the best critique of the model comes from AIR itself. The BETA report concludes that “the model selected to estimate growth scores for New York State represents a first effort to produce fair and accurate estimates of individual teacher and principal effectiveness based on a limited set of data” (p. 35). Not “our best attempt,” not even a “good first attempt,” but rather a “first effort” at fairness.
And yet, across the state, teachers and principals have received scores telling them that they are ineffective in producing student learning growth.
I’m all for generating as much data as possible and using that data to test theories on what works and doesn’t work in schools. But there’s a tendency to look at mounds of data and just assume that a conclusion can instantly be derived out of it all.
Unless those allegedly new & improved algorithms are fairly accounting for the factors external to teachers and doing so in a more rigorous, proven method, creating a flawed scoring system for teachers independent of that strikes me as a flawed attempt at accountability.
ADD-ON: Then again, do the algorithms account for per-student funding?