For The Star: Some Of the Evaluation; All Of the Decision

There is a recurring theme in reformy teacher evaluation schemes: "We're not encouraging teaching to the test, because we're using multiple measures. The test scores will only be x% of the teachers evaluation. Stop worrying!"

This theme is important to selling the plan, because even the reformyists acknowledge that teaching to the test will stifle children's learning. They also inherently acknowledge that taking a single test on a single day is going to be prone to error (especially if the test is full of nonsensical questions about rabbits and pineapples, or poorly formed math problems).

What these folks don't want to acknowledge is that even if standardized tests are a small part of the evaluation, they can still become a huge part of the high-stakes decision that follows. It doesn't really matter much if the numerical rating a teacher gets, based on her students' standardized test scores, is only 50% or 30% or 20% of her evaluation; the very nature of its phony "precision" gives that score far more weight when it comes time to make an actual decision based on the evaluation.

No metaphor is perfect, but let's try this:

Suppose we are going to rate and rank a group of restaurants. We need to come up with a system to evaluate each one, but we have to be able to show that the restaurant ranked #1 is better than the one ranked #2, because we'll only be giving a merit pay bonus to the manager of the "best" one. We'll also be firing the manager of the worst one.

We decide to take an approach that uses multiple measures. For 50% of the evaluation, we'll do an "observation" of each restaurant by having a meal there. Let's use a rubric: one that rates different parts of the restaurant experience - food, service, decor, etc - on a scale of "Unsatisfactory, Basic, Proficient, Distinguished." We're looking to get an overall sense of the restaurant here; we're also looking to help the restaurant get better by having the manager reflect on the things we say are important for its success.

For the other 50% of the evaluation, we're going to look at something absolutely critical: the amount of time it takes to get our food served. Everyone agrees that a good restaurant must serve the food promptly; there is simply no dispute here. It is a foundational characteristic of dining that customers must be served promptly.

Here's what we'll do: for one night only, we'll track how much time it takes each customer to receive his food. It doesn't matter what they order, it doesn't matter if it's a busy night, and it doesn't matter if the staff has problems that night that distract them: we're going to time every order and come up with a precise average.

Let's say two of our restaurants are "Distinguished," but we're only going to give that merit pay bonus to one manager. How do we choose which ones? Well, luckily for us, we have a highly precise variable to look at: average service time. So we look at who of the two had a better time between ordering and serving on that one particular night :

Chez Michelle: 10 minutes, 14.7 seconds.
Chez Diane: 10 minutes, 22.8 seconds.

Well, there you go: Chez Michelle wins. It's obviously a better restaurant than Chez Diane, right?

The problems with this are obvious. First: is the 8.1 second difference between the two service times really that important? Could it not be explained away by any number of factors, including pure chance? Does it at all matter?

Second: we have now given one metric all of the weight in a decision. In fact, we could make this 10% of the evaluation, but it would still be all of the decision.

Chez Michelle may be a "Distinguished" restaurant, but there is no reason to believe it is any better than Chez Diane. In fact, the food, the decor, and the service may be better at Chez Diane. Of course, that may be a matter of taste - which is kind of the point, isn't it?

Let's look at the other end: we're going to fire the manager with worst restaurant. Two eateries are "Usatisfactory":

Chez Eli: 38 minutes, 12 seconds
Chez Whitney: 38 minutes, 15 seconds

Well, that solves that, right? Congrats Chez Eli; you live to fight another year. Of course, Chez Whitney may have decent food while Chez Eli's is toxic, but we can't really tell that, can we?

You can see how we've lulled ourselves into thinking we have a precise metric when we really don't. Because the review (the "observation") of the restaurant is so imprecise, it becomes far less important than the very precise timing of when the food is served. It may only be part of the evaluation, but the decision to fire the manager is based solely on service speed.

This is the central problem with using standardized test scores in teacher evaluations: we are making high-stakes decisions with a phony sense of precision. In our restaurant analogy, the imprecision of the review puts all of the decision into the time it takes to serve the food. In teacher evaluations, the imprecision of classroom observations puts all of the decision into standardized tests scores.

And it doesn't matter if you use Value-Added Modeling (VAM) or Student Growth Percentiles (SGP) to rate the teacher: each shows a level of precision that forces them to become all of the decision.

It is worth noting that my metaphor falls apart on one important point: the time between ordering and service at a restaurant can be precisely and accurately measured. But what if the watch we used had error rates of 25%-35%? Methods of teacher evaluation using standardized tests have these kind of error rates, and yet we make them all of the decision.

Which brings us to seniority. Sure, everyone would like for a more-senior teacher who isn't as good to be laid off before a less-senior teacher who may be excellent. But let's not kid ourselves: we can't measure effectiveness with the same precision as seniority. It may not be ideal, but at least we can say with precision that one teacher has been on the job for 9 years and another for 10. We simply can't make those fine distinctions in effectiveness, even though the precision of the measurement falsely implies that we can.

Now, there's still going to be a great temptation for the reformyist to want to justify making the decision solely on effectiveness anyway. What she'll most likely try is to give the observation - in my analogy, the review of the restaurant that leads to one of four categories - more precision by averaging a series of observations. This is what Michelle Rhee did in Washington D.C. with the IMPACT system.

But that precision is even phonier than the one based on tests. More on this later.

EXTRA READING: This was the "Decisions for Dummies" version; now read the real work, courtesy of Bruce Baker:

The Toxic Trifecta, Bad Measurement & Evolving Teacher Evaluation Policies

Fire first, ask questions later? Comments on Recent Teacher Effectiveness Studies

Follow up on Fire First, Ask Questions Later

The problem – and a very big one – is that states (and districts) are actually mandating rigid use of these metrics including proposing that these metrics be used in layoff protocols (quality based RIF) – essentially deselection. Yes, most states are saying “use test-score based measures for 50%” and use other stuff for the other half. And political supporters are arguing – “no-one is saying to use test scores as the only measure.” The reality is that when you put a rigid metric (and policymakers will ignore those error bands) into an evaluation protocol and combine it with less rigid, less quantified other measures the rigid metric will invariably become the tipping factor. It may be 50% of the protocol, but will drive 100% of the decision.