Evaluate

At this point you should have:

A description of the problem you want to solve
A decision about who you want to help
A deep understanding of the problem and the people it affects
Knowledge of the causes of the problem and its context
A description of what success looks like
A survey of stakeholders and other influences
Your chosen approach to solving the problem
A theory of change for your solution
An understanding of the behaviors that are needed for your theory of change to work
Recognition of how your strategy might go wrong
A set of indicators and targets to assess your progress
The ultimate outcome of your program

Intro

Success means that your ultimate outcome met or exceeded your targets. Or does it?

Earlier you learned about the importance of collecting feedback and assessing progress as you implement your theory of change. Evaluation goes a step further to ask: If the ultimate outcome of your theory of change occurred, did it happen because of your work or unrelated factors?

In this step, you will discover why evaluating impact is critically important for your social enterprise, your beneficiaries, and other stakeholders. You’ll also learn some fundamental concepts—along with basic approaches to evaluation—that will help you hire and communicate with an expert evaluator.

Until now, everything you’ve learned in this module has prepared you to personally engage in developing a strategy for social change from start to finish. In contrast, program evaluation is a highly technical skill that requires considerable statistical knowledge and field experience to do well—which is why you need an outside expert.

Evaluating Your Impact

A pack of wolves howls at the moon to try to make it disappear so that the sun can rise. The moon goes down and the sun comes up, but one wolf asks the key question:

If the howling wolves only measured progress, they would conclude that they achieved their desired outcome of making the sun rise. But the skeptical wolf suggests that their activity had no impact on that outcome. This illustrates the difference between measuring outcomes and assessing impact—which is the task of program evaluation.

Measuring the Impact of a Social Enterprise

Are you wondering how the skeptical wolf’s question applies to a more human-centered program like yours?

Consider a social enterprise that prepares people released from prison for stable jobs. Without the program, an average of 30% of ex-offenders are employed one year after they are released. The organization’s outcome target is that 45% of its clients will be employed.

A year after the program is launched, 45% of its clients are indeed employed. Success? Suppose that severe immigration restrictions were imposed during the year. The businesses where the ex-offenders work were suddenly in desperate need of employees, and would have hired the ex-offenders even without the social enterprise’s job training.

For your program to have impact means that it is causing, or contributing to, an outcome that would not occur—or not occur to as great an extent—without your activities. To use a technical term, the “counterfactual” is what would have happened without your intervention. Impact means that your program has, in effect, beaten the counterfactual.

Causation or Correlation?

Social outcomes inevitably have multiple causes. So your program may be a cause of an outcome, but not necessarily the sole cause. For most programs, it’s more accurate to say that they contributed to some extent to the outcome.

For one event to cause another necessarily means that the events are correlated (or connected). But the fact that events are correlated does not mean that one caused the other.

Imagine that it’s a hot sunny day and you’re at the beach. People are getting sunburns and are eating ice cream—sometimes the same people. Perhaps getting sunburnt causes people to eat ice cream to cool off. Or perhaps eating ice cream causes people to get sunburns—they feel so cool they hang out in the sun. Of course you know that something else is at work. While sunburns and ice cream eating are correlated, both are caused by another variable: the weather.

In this example correlation and causation are easy to distinguish. But in more complex situations—such as the complicated system surrounding the social problem you’re trying to solve—it’s not so easy to determine causation. That’s the goal of program evaluation.

Why You Should Measure Impact

Program evaluation takes time, money, and a lot of work. So why bother? Because your program is consuming human and financial resources—and you, your stakeholders, and your funders will want to know whether it’s actually making a difference in improving people’s lives.

Evaluations of your program will be valuable for future social entrepreneurs considering whether to launch similar programs. In designing your own program, evaluations indicating that similar programs have or have not worked can be an important guide to developing your theory of change. Sound evaluations are the core of what people mean by strategies informed by evidence.

Who Should Measure Impact and When

Determining your program’s impact—whether the program accomplished better results than would have happened without it—is a complicated task. While monitoring the progress of a strategy can be done by an organization’s own staff, impact evaluation typically requires an outside expert. Most program staff don’t have the necessary expertise, and the findings are more credible to stakeholders (including funders) when someone outside your organization evaluates the program.

Measuring impact is an activity you do after your program is operational. The timing of evaluation is different for every program. As a general rule, you can dive into the process when you’ve operated the program long enough at scale to collect adequate data on program outcomes. But it’s a mistake to wait until after you develop and implement a program before planning for evaluation. In most cases, it’s impossible to assemble the necessary data retroactively.

A professional evaluator will use some of the same feedback that you collect to monitor your progress along the theory of change. That’s why it’s so critical to have a good set of monitoring indicators on day one, combined with a reliable system for collecting the data. Feedback is invaluable for the evaluation process as well as for informing your ongoing decision-making.

Basic Evaluation Techniques

Now let’s explore four basic techniques that are used to evaluate a program’s impact.

1 Before/After

As the name suggests, the before/after technique assesses impact by comparing outcomes before and after a program was implemented.

When inferring causation from a before/after test, it’s important to think of all factors besides the intervention that might account for the result. In considering the spectacular results of the job placement program for ex-offenders, an evaluator would look for evidence that something other than the program—such as a sudden increase in the demand for employees—might have contributed to the outcome.

Here’s another example: Like many other U.S. cities in the 1990s, New York City had a high rate of homicides, robberies, and other serious crime. It also had many petty offenses, such as smoking marijuana on the streets and spraying graffiti on walls.

In 1994 Mayor Rudy Giuliani appointed William Bratton as Police Commissioner, and he adopted the “Broken Windows” approach to policing. The basic idea is that when petty offenses go unpunished, it creates an environment that encourages serious offenses. So in an effort to prevent serious felonies, the police department made many arrests for petty offenses.

Between 1994 and 2000, the rate of serious felonies dropped drastically. Mayor Giuliani and Commissioner Bratton took credit for the decline and attributed it to the Broken Windows approach. The trend data seems to support the theory that their approach caused the reduction in crime.

But does this tell the whole story?

One thing an evaluator would want to see is the trend before Bratton became police commissioner. Looking back a few years, the data shows the crime rates steadily declining. This suggests that the rates were declining for some reason other than the commissioner’s use of the Broken Windows approach.

It’s not unusual for a new policy to be adopted at a time when the problem it seeks to address seems atypically severe and might diminish even without the policy. This is known as regression to the mean. It captures the idea that after a particularly high or low point, an indicator is likely to move back to its average or “mean.”

For example, in 1955 a record 324 people were killed in automobile accidents on Connecticut highways. The state addressed the problem by imposing harsh penalties on people who were caught speeding. A year later, the number of fatalities dropped to 284.

The governor boasted that “Connecticut has succeeded in stopping the upward surge in highway deaths … by enforcing the law.” Perhaps so, but a look at the data suggests that regression to the mean may have been at work.

In any event, the large fluctuations of fatalities during the years surrounding 1955 make it difficult to identify the effect of the policy. Such difficulties don’t mean that before/after comparisons can’t be informative. For one thing, many interventions occur when the problem is at its usual level, not a peak. The main question is whether a change in outcomes during the period can be explained by factors other than the intervention.

Are there plausible explanations for the decline in serious crimes in New York beside Broken Window policing? One possible explanation was another major reform during that time: CompStat, the city’s first data-driven system for holding precincts accountable for crime rates.

In the case of Connecticut, it would be helpful to know whether innovations in highway safety—or perhaps just changes in weather—might have accounted for the reduction of fatalities.

2 Matching

Matching is an evaluation technique that compares the targeted population with other similar populations that were not subject to the intervention.

If you wanted to use matching to help you decide whether Broken Windows policing was responsible for decreasing crime in New York, what information would you need?

Perhaps you’d like to see what the trend of crime rates was in similar U.S. cities during the same period. If crime in those cities fell at about the same rate as in New York, that would suggest that the Broken Window’s intervention had no effect. But if crime fell far more in New York than in those cities, that would suggest that the intervention had an effect—though you would still need to account for other factors besides Broken Windows policing that were unique to New York, such as the new system to hold precincts accountable for crime rates.

The data shows a general decline in crime in U.S. cities, including in cities that did not adopt the Broken Windows approach to policing.

The fact that the national crime rate and the New York crime rate both went down during the 1990s suggests something broader at work than particular policing policies. For example, some criminologists believe that the crack cocaine epidemic—which was a major cause of violent crimes—had peaked and was on the decline nationwide during this period.

In the case of the Connecticut crackdown on speeders, researchers compared its highway fatality rate with that of neighboring and similar Northeastern states.

The comparative reduction in Connecticut’s fatalities is some evidence that its harsher penalties on speeding did make a difference.

The idea behind matching is that if the people in the two groups are, on average, identical in all relevant characteristics except that one group participates in the program and the other does not, then you have excluded—“controlled for” is the technical term—all causal factors except the program itself. The challenge is to identify all characteristics besides the program that might plausibly affect the outcome.

Under the matching approach, an evaluator of the job placement program would compare ex-offenders who participated in the program with a similar group of ex-offenders who did not. In evaluation terminology, the first group is called the treatment group and the second the comparison group. In constructing a comparison group of similar individuals, an evaluator would look for factors that might affect the outcome—such as the nature of the offense, prior offenses, their educational and work experience, gender, race, and ethnicity.

The evaluator would then compare the employment outcomes of the two groups.

The key to a successful matching approach to evaluation is identifying all the characteristics that might distinguish the outcomes for people in the treatment and comparison groups. The evaluator’s problem is nicely presented by this “spot the differences” puzzle:

Being reasonably sure that there are no significant differences requires deep knowledge of the social dynamics surrounding the job placement program. It also requires avoiding one kind of selection bias—allowing the program administrator to select ex-offenders for the program based on their belief that the individuals will succeed. Also, there’s a possibility that the ex-offenders most likely to succeed will self-select into the program.

3 Difference in Differences

The difference in differences technique combines matching with a before/after approach. It is often referred to as “DiD” or “diff-in-diff.” This approach helps compare the change experienced by the treatment group with the change experienced by a comparison group over the same period of time. If the change experienced by the group in your program is significantly different from the change experienced by other people, then it’s likely that your program is having an effect.

Under diff-in-diff, an evaluator would note the number of ex-offenders in the treatment group who were employed when they enrolled in the program, and the number employed, say, three years later. The evaluator would subtract the number at the beginning of the program from the number three years later.

Difference #1 = Number of jobs held by participants in the treatment group at 3 years - Number of jobs they held at the start

The evaluator would do the same for those in the comparison group:

Difference #2 = Number of jobs held by people in the comparison group at 3 years - Number of jobs they held the start

Finally, the evaluator would subtract the difference between the two numbers in the comparison group from the difference between the two numbers in the treatment group.

Difference in differences = Difference #2 - Difference #1

The difference in differences estimates the effect of the program.

The diff-in-diff technique works even if the demographics of two groups are different. As long as these differences would be constant over time if the groups did not receive treatment, the differences can be controlled for by subtracting each group's original metric from the outcome metric at the end of the evaluation period. The difference between these group-specific differences is likely to reflect the causal effect of the program.¹

4 Random Selection

In order to match people in a treatment and comparison group, you must identify characteristics that could affect the outcome. In the job placement program for ex-offenders, the characteristics might include race, prior convictions, and drug or alcohol use. But despite your best intentions, you may not think of all the factors that could affect the outcome.

A good way to avoid this problem is to assign people to the program randomly. For each eligible person, the program administrator flips a coin—heads, they participate; tails, they’re part of the comparison group. This process is the essence of a randomized controlled trial or “RCT.” In RCTs the comparison group is also called a control group.

Randomization remedies the major weaknesses of before/after, matching, and difference-in-differences techniques: their inability to rule out the possibility that any observed difference in outcomes over time may have been caused by factors other than the intervention.

The power of RCTs is that the treatment and control groups are just as likely to have people with all the characteristics you didn’t think of as well as those you did. It’s difficult to do an RCT well, but when you do, it provides the strongest possible evidence that a program caused, or didn’t cause, its intended outcome.

Randomization avoids both selection bias by the program’s administrator and by participants, where the concern is that those most likely to succeed without the program are selected or self-select to participate in it. But it may be subject to another form of selection bias when both the treatment and control groups consist of volunteers, because those who volunteer may behave differently than those who don’t.

For example, ex-offenders are typically enrolled in job placement programs only if they volunteer. For purposes of evaluation, one can randomly assign people from a pool of volunteers to the treatment or control groups. But the pool excludes people who have no desire to be in the program—for example because they aren’t interested in getting the type of job—and the program may not work for them.

More broadly, almost all social programs have criteria for who may participate, which introduces limitations on what can be inferred from the results. Additionally, the characteristics of participants and the social environment in which a program operates may differ significantly from one locale to another. These are just some examples of the broader problem of what’s called “generalizability” or “external validity.”

Jordan's Journey

Let’s check-in with Jordan and see what she finds when assessing the impact of her company’s diabetes prevention program.

Statistically Significant or a Matter of Chance?

Even if there is a noticeable difference between the average outcomes of the groups compared in an evaluation, it is possible that the difference is just a matter of chance rather than the result of the program’s activities. Underlying every evaluation technique is the concept of statistical significance. Essentially, statistical significance is a measure of whether it’s appropriate to attribute the difference to the program—or just to chance.

To oversimplify, statistical significance is a function of the size of the sample groups and the size of the difference in outcomes between the two groups. The larger the sample size, the more likely that any observed differences between groups is statistically significant. Similarly, if the size of the difference in outcomes is large, then the difference is more likely to be significant—even if the sample size is not very large.

Learn More

Wrap-up

Assessing your impact is a complicated job, which is why you need an outside expert to help.

Even though you won’t be the one evaluating your program, you still have an important role in the process. Position yourself for a successful evaluation before you launch. Lay the groundwork for a robust plan to collect feedback and monitor your progress. And when the time is right, devote the resources necessary to hire a professional evaluator.

You want to change the world. At some point, you’ll need to learn whether your program’s impact is significant enough that others can build on your experience and funders will want to continue to support your work.

¹ https://www.sciencedirect.com/topics/economics-econometrics-and-finance/difference-in-differences