Transcript of A Brookings Press Forum
Can We Make Education Policy on the Basis of Evidence?
What Constitutes High Quality Education Research and How Can it Be Incorporated Into Policymaking?
Wednesday, December 8, 1999
The Brookings Institution
Somers Room
1775 Massachusetts Avenue N.W.
Washington, DC 20036
Host:
Tom Loveless, The Brookings Institution
Chair:
Paul Peterson, Harvard University
Presenters:
Robert Boruch, University of Pennsylvania
Thomas Cook, Northwestern University
Judy Gueron, Manpower Development Research Corporation
Howard Hyatt, American American Academy
Fred Mosteller, Harvard University
Mr. Loveless: Education policy. This is one of several forums that we have scheduled during the course of this year on education policy issues. Today is on what constitutes high quality evidence in educational research. For your calendars, we have a meeting in January on vouchers and charter schools. Checker Finn is one of the presenters there. We have a panel of four presenters there. And on April 11th--excuse me. That's February. February 24th and on April 11th, we will be holding another one of these on the role of teacher unions in educational reform. That will be April 11th. So thank you all for coming.
My duty today is to simply introduce the moderator of our panel and I am quite pleased to introduce Paul Peterson. Paul is the Henry Lee Shadick Professor of Government at Harvard University. He is also the director of the program on educational policy and governance at the John F. Kennedy School of Government at the Harvard University. The program on educational policy and governance along with the Brown Center and the American Academy of Arts and Sciences in Cambridge are the three sponsors of this event today.
So please welcome Paul Peterson.
Mr. Peterson: Thank you, Tom. A couple of years ago, I hosted a seminar at Harvard on "How We Can Improve the Quality of Education Research" and as I was getting this underway, Howard Hyatt at the American Academy of Arts and Sciences was beginning to plan a conference on "How We Can Improve the Quality of Research and Education." We found out that our interests were very similar. That indeed we felt that the kind of research that is characteristic of medical research, the randomized field trial, the use of the pill and placebo, one person gets the pill and the other person gets the placebo, it is all done by pulling a name out of a hat at random as we say in the research enterprise, that that is the best way of conducting a scientific experience to find out whether something works or not. If you don't have two groups that can be said to be the same at the beginning of the treatment, where they are different at the end, you don't know that the difference at the end is a true difference unless they were the same at the beginning. But by randomly assigning people to one of two groups, you know the two groups on average, if you have got a bigger enough number, likely to be pretty much the same at the beginning.
So that is a very simple idea actually. It is a very simple idea. This simple idea is used to dramatically and extensively and to great effect in medical research. It is sometimes used in education. Sometimes educational research is designed in this way but it isn't very open often. Most of the time, that's not the way that we do research in education. So we want to talk about that today.
First thing I want to do is introduce Howard Hyatt from the American Academy. He is the Director of the Initiatives of Children and he is also a professor of medicine at Harvard and in that role played a major role in introducing into public health practice the concept of randomized experiments. So it is drawing on this vast experience that Howard and his colleagues have had, that motivates what we are going to be talking about today. So Howard would you say just a few words what we can learn.
Mr. Hyatt: It is almost 25 years ago that Fred Mosteller from whom you will hear shortly, then Chairman of the Department of Biostatistics at the Harvard School of Public Health and I, then dean of that school, set up something that we called a Center for the Analysis of Health Practices. There was a time in the lives of your parents when tonsillectomy was a right of passage for all children.
One of the several areas of medical practice that we all looked at was the validity of tonsillectomy. We examined evidence and found that patients who underwent tonsillectomy were approximately as well off one year, five years, and 10 years off as patients who received penicillin and the later group, of course, had much less in the way of financial cost and much more important had much less in the way of any kind of risk.
This approach has been taken now with many, and not nearly enough, procedures that are part of the--were part of the medical culture and now represent what we consider the basis for evidence based medicine. When I undertook the direction of that program that Paul just referred to, Initiatives for Children, at the American Academy and Fred became the head of its Center for Evaluation, we decided to look at other policies effecting children starting with education and it was that undertaking that led Fred to do some of the work that in turn led us to become interested in this conference that Paul has helped organize with Tom Loveless today.
Mr. Peterson: So thank you, Howard, and of course, this is exactly the right moment to introduce to you Fred Mosteller, the professor of statistics at Harvard University who has played such a major role in helping nonstatisticians understand how statistics can be put to public benefit. First made enormous contributions in the field of medicine and has most recently brought this same creative mind to the field of education. Were it not for Fred, we would not have had these activities in Cambridge and we wouldn't have the energy that is needed to assemble here today. So Fred, could you give us your thoughts for a few minutes.
Mr. Mosteller: Thank you, Paul and Howard. I plan to report on some activities that have occurred since the conference we had in May at the American Academy in Boston. The events that are relative to today's meeting I want to report on them because they have led to the establishment of an organization called the Camel Collaboration. When we think of medical research and parallel with recent efforts in education, we think especially of the massive research done at Oxford University and the Oxford database of clinical prenatal trials put out by the Oxford University Press with editors Eon Chalmers and Ainken and Kearce. They prepared an impressive two volume work affectionately called ECPC which is an acronym for Effective Care in Pregnancy and Childbirth. To do this, they organized many different teams of researchers that collected and analyzed data from the literature and solved many specific questions in that area of pregnancy and childbirth.
On completion of that effort, they decided that all of medicine needed an organization design to carry out similar data gathering and analysis and not just pregnancy and childbirth. So they organized what is now called the Cochran Collaboration which consists of teams of physicians and related experts scattered over the world but joined in an effort to carry out organized studies of collections of medical papers for all of medicine.
By the way, Cochran was the name of a physician who was very encouraging to the medical profession about the use of clinical trials in medicine. Dr. Chalmers found that some areas of medicine overlapped the social sciences and education and was impressed by a need for similar data gathering and analyses in the social sciences, education and other areas with important social problems such as economics, juvenile delinquency, crime and punishment and so on.
Indeed, his son who is a school teacher in the United Kingdom, has been asking him about the availability of analyses like those in ECPC but oriented to education. In July of this year, in London, about a hundred people gathered to consider extensions of that idea that Cochran Collaboration to the social sciences, education and so on. The meeting considered what ought to be developed and they agreed that they should develop an organization parallel to the Cochran Collaboration but in a general area of social science and economics.
They were encouraging this effort by a British governmental organization called the Counsel of Economics and Social Science Research. It is an organization roughly like the more social part of our National Science Foundation but it is not like our Social Science Research Counsel which is not a governmental agency. At any rate, at the meeting after discussion, various activities could be undertaken, the meeting in London decided what they ought to do is to create a counterpart to the Cochran Collaboration which they proposed to call the Camel Collaboration in honor of an American social scientist who was especially interested in encouraging the use of experimental approaches for social problems.
The purpose of the collaboration, the Camel Collaboration, would be to develop methods of gathering and analyzing data relevant to social problems. In the United States, Professor Barouch who you will hear from shortly is the leader of the Camel Collaboration and indeed he is also the chairman of the steering committee for the Camel Collaboration.
The main point of this set of remarks is just to show you that there is considerable continuity in the problem data gathering and analysis and education and social sciences and I wanted to mention that the Camel Collaboration offers one direction for improving our data gathering and analysis. And a note that we do have the director of the Camel Collaboration here. They have already begun a database called SPECTR which is acronym for--it is not a anachronism. It is an acronym for the Social Psychological Educational and Criminological Trial Register and they plan a meeting in Philadelphia in February. Thank you.
Mr. Peterson: See, that's the danger of word spell. It corrects the word but it gives you the wrong word in the end.
Well, I have a feeling that all my speakers are being introduced before I have the chance to do so. We would now ask Robert Barouch the University Trustee Professor of Education and Statistics to layout what are the basic principles of a randomized experiment and how has this concept spread in the study of social problems and how it might be utilized effectively in the field of education. I can think of no person who is having a larger impact in general in social problems than Bob Barouch, and it is a real pleasure to have him here today. Bob.
Mr. Barouch: Thank you, Paul. Fred is sensitive to the history of trying to generate high quality of evidence in a variety of social sectors. Let me be no less sensitive by--by starting off with Florence Nightingale. The problem of actually trying to encourage governments or foundations or entities or organizations to generate evidence of the effectiveness of what we do is an old one. At least in records literature. Florence was [inaudible] put out and put out by the British government. There were some changes in the law--such that without looking after the effects, without any inquiry after results, past or present, it is all experiment, that is to say haphazard, a sea saw doctrinaire, a shuttle cop between two battle doors. The general--thank you.
The General Accounting Office usually puts this a lot more briskly with statements like the U.S. Department of Education or the U.S. Department of Labor does not know what works. Something less elegant than Florence but certainly a good point. Let me give a definition of randomized--incidentally, there are handouts back there someplace. Definition of randomized experiment here is almost the random assignment of individuals or entire entities with one of two alternative interventions in order to produce fair comparison. That is to say, statistically unbiased estimate of the relative effective two interventions and legitimate statistical statement of one's confidence in the results.
By way of illustration, at this point, Philadelphia White Wing Foundation is running randomized experiments on the effect of awarding small scholarships to children who achieve As and Bs in school and whose family are poor, very poor. They have been doling out money for the past 80s years and have come around to the point of view that they would like to estimate the relative effects of those scholarship awards on things like drop out rates, subsequent achievement of kids and the like.
The situation is such that the supply of poor kids who actually achieve good grades relatively exceeds the supply of moneys for scholarship awards and so a lottery system is put in to place in order to produce two groups that are comparable. One group gets the scholarships and the other groups does not in the interest of estimating relative effectiveness of the scholarship awards.
There are a set of large scale less original experiments that involve entire entities to alternative regiments to understand what works. Most of you will run into the health sector or the health education sector. So for example, schools are randomly assigned to alternative forms of--to alternative--to systematic approaches to understanding how to reduce the level of conflict, assault, aggressiveness among middle school kids versus controlled schools to which these are later compared.
With respect to the sort of rationale for this, as you might expect, we do these randomized experiments, our bottom line in most of the cases is that where the problems are severe and the evidence on effectiveness of purported solutions is ambiguous as it often is, then generating better evidence can help elevate the level of the quality of the debate about which programs out to be sustained, terminated and so on.
Certainly bilingual education, there is still lots of debate. Over the past 20 years, there has been only one sizeable randomized experiment which produced relatively unequivocal evidence about the size of a particular bilingual programs. All the rest of the studies are correlation of post facto, involve match comparison groups in which one can never know whether the matches are good and so on. Until the Star Experiment and the reanalyses of those Tennessee class size experiments by Fred Mosteller, Alan Krueger and others, there was lots of debate about whether the recent class size have any effect at all or it might have negative effects. Those Tennessee experiments helped to illuminate that argument certainly. There were other areas that we could cover.
The second major reason why these experiments are justified is that there are indeed good examples of interventions that have been developed and tested using randomized field trials. Some of you--how many of us watched Sesame Street? That program is good, excellent only because historically it was developed on the basis of lots of little micro experiments understanding how kids watch, how distractible they are when they are watching, when they are spelling, when they are counting so on. Those were actually field tested in randomized field trials to estimate the effectiveness of that television program on children. There were a couple of other people who did reanalyses of those in order to understand its quality credibility and the like. Certainly preschool project and the like.
These are sort of exitory experiments in education. One can probably find, many, three or probably a couple dozen examples like this. It is remarkable that if one looks in the Cochran Library on clinical trials in medicine one find 250 independent trials being listed. Want to do the same thing in education? Probably find, oh, 500 or so which is exceedingly small --
Unidentified Speaker: 250,000.
Mr. Barouch: 250,000. I am sorry. Second, a third major reason as to why we do these randomized experiments is that alternative approaches to estimating the effects of education and waiver programs and programs in medicine, can often lead to estimates that are biased in a negative direction. That is we can make useless programs look harmful and programs that have relatively modest effect appear that if they have zero effects. Like failing to develop control groups that you generate when you do these randomized experiments.
The sort of major example actually comes out of the labor and employment arena. One of, I don't know, this may have been your work, in which the randomized trials, the effect size is here, I think it is average effective rate of employment and training program on annual wage rates who are high risk young males. Basically for these three years 1977, 1978, 1979, this showed that these programs had no effect, nontrivial or zero effect. Relying on sophisticated modeling of passive observational data, either census data or manpower study data so on and so forth, generated estimates that were negative and certainly embarrassing.
If we take this as the gold standard, that is the experiment and one takes serious statistical [inaudible] modeling as another kind of approach to estimating effectiveness of these labor or education programs, the bottom line on this is that negative estimates do appear when you run some parallel trials. They don't always appear. There are occasions where there are positive biases.
The most easy to find example happens to come from medicine. There are also situations in which the alternative to randomization lead to results that are stupefying in their ambiguity. There is no way of telling what is right. Now, the fifth or sixth reason why we should be paying attention to this stuff is these are the results of a hand search of the American Educational Association, every issue, every article and issue from 1964 up until the present. There are about 1200 articles all researched. This is the main stream education that serves general. It is usually high quality as things go.
We look only for experiments, real randomized experiments in mathematics and science education. That sounds like an important topic and one that should be covered. The total number over the period--over that 30 year or more period, there are 31 experiments out of 1200 studies. This is a little absurd somehow. There is something wrong with the picture. This one looks like the incidents of randomized trial in criminal and civil justice on the other hand and also welfare experiments in the economic sector including employment and training are showing an upward trend. My students who did this stopped--he stopped he concluded his dissertation in '85. That's the reason for the drop off date. It actually continues beyond this. All right. When you look at other literature, including medicines, one finds accelerated of doing randomized experiments in those arenas as well. Okay.
Final reason, justification for being interested in all this stuff. It goes back to Walter Lippman's criticism of Franklin Roosevelt who he accused of being dishonestly experimental. That his planning--Roosevelt's planning that some of these programs were experimental when in fact he wanted to drive it through and override objections. In talking about this, Lippman suggested that without conscientious experimentation, roughly the same sense as being honest about generating evidence, we will leave all these matters to be unwise, those who bring nothing constructive to the process and who greatly imperil the future by leaving great questions to be fought out between ignorant change on the one hand and ignorant opposition to change on the other.
Mr. Peterson: Bob, you were fabulous. Let me just underline what I took to be your fundamental point that only 31 out of 1200 articles addressing math and science education used the gold standard, the randomized field trial, to generate evidence which is very similar to the kinds of numbers that you get when you look at all the research that is going on in the field of education and the percentage of those that are being conducted using the randomized field trial design and it is because of the positivity of such studies in a field where we desperately need to have evidence in order to get beyond ideology, we think it is reasonable to ask you people here in Washington who think about how to write our laws and administer our laws to ponder this topic.
It is a significant issue that really has not penetrated the policy debate to the extent that some of us feel is necessary. Now, Tom Cook has thought about this a good deal as to why this is and how this came to be. He is a professor in the Department of Sociology at Northwestern University. He is a major figure in design and analysis of evaluation research, and I won't tell you about all of his enormous accomplishments but I will give him his fair share of time instead.
Mr. Cook: Thank you, Paul. I was asked to speak about the barriers to conducting experiments in education, and I want to talk about two of them. The first is the belief system of those who commission and design evaluations of educational practices including educational innovations. One relevant belief is that experiments simply can't be done in schools. This is because experiments create focused inequalities. Some students get different services from others. Often those that seem more desirable and it is superintendents, principals, teachers and parents don't like these focused inequities so it can't be done. Now, this is [inaudible] false. In a review in 1991 of studies done in school on prevention issues, preventing smoking, violence and the like, in 1991, Dollack found 190 randomized experiments. Obviously since then there are many, many more. These were done in schools. In the last few years there have been a flurry of experiments, not very many of them, some. And all except one were done by people outside of education in the sense of being housed in schools of education. As were the people who did the experiments, they reported they were all psychologists or in public health.
So there is an absence of experiments as we saw before from Bob's graph done by people trained in education and evaluation and coming out of schools of education. It is not an absence of experiments.
The second belief is that experiments are unethical because they involve withholding potentially a share of educational practices from some children who might need them. Now, one only has to think a little about what is happening nowadays with standards movement in education. We are seeing that there are unanticipated side effects of the standards movement in that so many children are being potentially being held back that their parents don't like it. A lot of scholars and other people worry about the effects of being held back for one year or two years. The social consequence of this. We are seeing the possibility here of unplanned side effects so that some people who were riding the waive of the standards movement, if you looked at last weeks New York Times, are now saying we have to back up and go away or else this movement is going over a cliff. It is almost as though not to do experiments is unethical given how many states have climbed onto the standards bandwagon and are now about to face these unexpected side effects.
The third belief is that educational interventions can't be standardize and experiments require a standardized treatment, a standardized package of practices. Now, the first part of that statement is true. When you look at the implementation of educational interventions, some of them get adopted by a school fairly faithfully. Others get adapted by a school. Many get ignored by schools or sometimes by teachers within schools. So there is, in fact, empirically great variation in the quality of implementation of educational innovations.
The second part of the belief is not true that because there is this variation implementation, you can't do experiments. The requirement given variation is that experiments be larger and not that they are therefore useful. It is better and experimenters would prefer standardized interventions but they are not a necessary condition for intervening. Indeed if variation in the quality of implementation of educational practices from school to school and from teacher to teacher is the norm, then we have to identify effective educational practices that remain effective despite that variation in how schools implemented in our decentralized system and give them a control that individual teachers have once they close the door of their room.
The first belief that one hears a lot is that better methods already exist for evaluating educational interventions than the experiment. Now, Barbara's work goes through some of the quantitative alternatives but to me the most important alternative is qualitative rather than quantitative and it comes from a set of belief that goes something like this. Schools are complex social organizations. They try to improve themselves which like any other organization, say a business or a newspaper, businesses rarely do experiments on themselves. They do trial and error and use consensus practices for finding out what works or they call in management consultants and these freshly minted MBAs come and they examine you, right, interview, they look at records and they give prognostications about what you should be doing. You make some of the changes so that you are compatible with the culture of your organization and then you look at your bottom line, productivity, profits and you make a judgment as to how successful this was. Given the premise that schools are complex social organizations, the alternative model is not the randomized experiment, it is the management consultants and many, many school districts believe this is the way to go to create change and to evaluate it. They believe that craft knowledge, through the craft of being an educator is more important than scientistic knowledge. We call it scientific knowledge.
Now, so their preferred methods are not technical or statistical. The problem with this position is that if you look at the work that management consultants do, much of what they advise is predicated upon experiments and good quasi experiments that have been done on business practices that get into the law of what is taught in business schools and that therefore become the foundation from which the management consultants are offering advice. So the advice that they give you is actually often founded on the experimental and good quasi experimental evidence.
That is not the case in education but the most severe barrier is this belief that complex social organizations require a management consulting model and not a scientific model. The second reason, the second barrier that I want to briefly mention given my short time is the issue of foundation. In educations sources in government and in foundations, both of them, have not pressed for the quality evaluation of what works in education. There is no culture of that unlike at NIH, let's say, or the Food and Drug Administration. And even unlike in early childhood education, right, there have been randomized experiments done. Perry Preschool, the Absidarian Project, the Old Project and even in the Congressional Call for an evaluation of Head Start to set up a committee to come up with a plan to evaluate Head Start. The requirement that Congress put in was that there be a randomized experiment or else a convincing rationale for not doing one. So the presumption one was that a randomized experiment will be done and it will be the given the design that has come up with in that release. So in other areas that are close to education, there is almost a requirement that randomized experiments be done but not in education per se.
Now, we've seen more and more moves towards the Office of Education putting out programs where jointly with NICHD and jointly with NSF, the presumption being that perhaps different standards of what constitutes acceptable evidence for what works, might come from those trimmings. In any event, funders have to step forward and say that higher quality evidence is needed otherwise 30 years after COMA, we still don't know whether COMA works, stuff works. 30 years after accelerated schools, we still don't know whether it works. 30 years after vouchers, we still don't know whether it works. 30 years after the effect of school movement, we still don't know whether it works. Kids deserve better. Thank you.
Mr. Peterson: Thank you, Tom. Your comments reminded me of the debate that went on with the war on poverty back in 1964 when Linden Johnson first proposed it. The interest groups fought over--everybody knew that education was central to addressing the problems of poverty but no one could agree on what you could do in the crucial years of education, from 6 to 16, so the legislation was passed but said we shall have educational innovation but not for the ages 6 to 16. So that's why we got Head Start under the war on poverty and that's why we got the manpower development and training programs that developed out of the war on poverty because the one aimed at kids below the age of 6 and the other one aimed at young people beyond the age of 16.
Well, that's just about exactly what happened in terms of evaluation, research and randomized experiments. We've done randomized experiments on the very young children. That's thought to be ethical and we've done randomized experiments on adults over the age of 16 but we very, very seldom do it on the crucial years between 6 and 16. Well, Judy Gueron has been an absolute leader in this respect. She is president of the Manpower Demonstration Research Corporation which began back in the 1970s to look at what training programs were effected. Of course, training programs are educational programs. There are really no fundamental differences as you can fully understand but because of the culture within which these programs were being developed, it was possible to introduce the concept of randomized field trials to find out what is effective and what is not effective and a lot of studies have been undertaken over the years and we have learned a lot from those studies and no organization has played a more significant role than the Manpower Demonstration Research Corporation which then has gone onto look at welfare policy because manpower training programs then began to be weaved into welfare reform and so much of what we now see as the appropriate role to go in terms of welfare reform, has come out of the work of this organization. So it is a pleasure introducing Judy to talk about this question to us today.
Ms. Gueron: Thank you. I want to say a few words first on what I bring to this topic so that you know my perspective on this. As Paul said, I'm not an educational researcher. I head an organization the mission of which is to improve the well-being of low income people through learning what works to improve public programs and through that to change policy and practice. Our primary focus has been on employment and training and welfare reform and only more recently have we turned to education. The premise behind MDRC is that you can get better information and that it will make a difference in practice and we believe that both--the answers to both are yes.
We do large scale field trials. We have been pioneers in doing a random assignment to evaluate social programs and my belief in the evaluation, the value of this approach comes from painful experience with challenges to the validity of alternative types of studies. With random assignment, you know something and you can more confidently separate fact from advocacy. With alternatives, this is much less true. Moreover, not only can you know something but you can explain how you learned it in common English that is uniquely convincing.
In 25 years, we've conducted 30 major random assignment experiments in more than 200 locations involving close to 300,000 people. We've never had a serious challenge to the credibility of a random assignment study but I do want to say that random assignment is not always the right approach. Such studies don't answer all questions and critically they are not always appropriate. Moreover, all policies don't merit such a study.
Our experience suggests eight criteria to judge when random assignment is the right tool. And Bob, could you put on that transparency. The first one is when the key question is does the program or reform make a difference. That is the questions of program impact. Not for questions like the feasibility of something or the replicability of a program approach.
Second, when the program under study is sufficiently different from the business as usual and you can maintain that distinction overtime. As Bob described, under random assignment, you have got people in the test program and people in the control group that are placed in the tube by a lottery. There has got to be a difference between those two programs and you've got to be able to maintain it overtime. That is easy if the program is simple and short. If they are relatively distinct, if it is controlled by the researchers, it is harder when students are in the same school, taught by the same teachers and when the program goes on over many years. Doesn't mean it is impossible. Just said harder.
Third, when you are not denying people access to an entitlement. Random assignment studies are best used to test a new program or when there is not enough funds to serve all eligibles; that is, when the program is over subscribed. That is in the process of doing this study, you are not reducing the number of people served by a treatment. In that environment, it is relatively easy to convince people that random assignment is an ethical way to allocate scarce resources and indeed in certain environments random assignment has been used when there was no research goals simply to allocate slots for example in magnet schools in Cleveland or New York City or access to the Department of Labor funded slots in the summer jobs program for youth.
Fourth, when you are addressing an important or unanswered question. If you know the test program works and you can afford to provide access to all people, you shouldn't do such a study but that is rarely the case.
Fifth when you conclude adequate procedures to inform program participants about the study and assure data confidentiality. Suspicions of the ethics of experimenting on people run deep. We have encountered over the years great hostility towards the idea of a random assignments experiment. Researchers need to follow the highest standard and provide adequate information to people and where appropriate get informed consent from students and their parents.
Six, when there is no easier way to get a good answer. Before launching such a study, researchers should be sure that there is a high probability of getting reliable findings. That there is no less intrusive or expensive way to get equally valid results and that the study has a high probability of addressing important questions that will actually be used.
Seventh, when you can get cooperation. You need the people involved in such a study to buy into it from the beginning. Random assignment is an all or nothing process. You can't be a little bit random. To make such a study work, you need buy in from people in the school or the program and critical to getting this is showing them that it won't be too burdensome or disruptive to what they are doing and that there a real value to them and for the people that they care about the participants in their program to doing the study. I've had some very moving experiences over the years convincing teachers or intake staff in social programs that they want to be in a random assignment study. Nobody a priori wants to be in a random assignment study. Before taking a vote to participant, we would do things like explain the process, what the random assignment involved, why the results were uniquely reliable and believable, and how there were examples that positive findings had actually mattered to the policy process so that they could see a reason to participate in this endeavor and it wasn't just academic ivory tower thing that you wanted to do so that you could write another article.
Eight, finally, when you have the resources and ability to do a quality study. There is no point in starting if you can't deliver credible results. This means for example, the number of people in the study has to be large enough or less don't bother. The data has to be high quality so that you believe they answer the questions that you set up at the beginning. That you address the right questions. You are selecting data on the things that you will ask about a program. Don't miss the key outcomes. That the study will follow people long enough to yield a reliable conclusion on whether the program did or did not work.
This eight point list may sound daunting but there is a strong track record of successful random assignment studies in vary diverse field, including a few in education. Looking for opportunities to apply these techniques in educational research is one vital root to bringing new rigor to our understanding of what works for improving outcomes for children in our schools. Thank you.
Mr. Peterson: I think the one point that Judy made that I would like to underline is that public policy can be influenced by randomized experiments almost more certainly than any other research strategy, though, indeed, as we know with all public policy, there are many political forces that shape it as well as research, but the randomized experiment has a capacity to influence public policy because there is much less controversy about what the finding is. When you use alternative methods, there will always be a great deal of controversy. It does not mean that you will not have controversy over randomized experiments, you will, but it is substantially reduced because the principle, the scientific principle is so well established.
Well, anyhow, one randomized experiment that I did recently was to find out whether or not people liked to be talked to or whether they liked to talk back and the evidence is absolutely conclusive so we will now open this up to--yes, sir.
Mr. Loveless: Could you identify yourself and also speak very loudly. We are transcribing the event and we would like to get you on tape.
Participant: Siobham Gorman, National Journal: How can you be random when dealing with vouchers and choice?
Mr. Peterson: Well, actually and I guess I will take that question because I'm doing a randomized experiment looking at vouchers and in the situations that I have encountered, I have been fortunate because the amount of what they call scholarship money is not large enough to be able to offer a voucher or scholarship to all those who would like one and therefore to be fair, they have handed out the scholarship or the voucher on the basis of a lottery. And therefore, you can compare those who won the lottery to those who didn't win the lottery and see what happens to them. There are lots of problems. I'm not going to say that all problems vanish just because you have a lottery here but it does certainly facilitate the process.
The same is true of charter schools. The charter schools if they are oversubscribed, many of them now are required by state law to admit on the basis of a lottery and thus far we haven't a single study out there on the impact of charter schools that has been based on data that takes advantage of the random assignment design. I understand that Carolyn Hucksby in the Economics Department at Harvard is undertaking a study like that and eventually will release the results but I don't know exactly what kind of data she has so I can't speak to that in more detail. So there are a couple of elements in the works.
The Edson project, you are quite right, none of their results are based on a randomized experiment.
Participant: John Hovan. How can you measure differences in implementation of treatments?
Mr. Peterson: Tom, you might be the best person to respond.
Mr. Cook: It is done--can be done several different ways. Preferably multiple ways in the same project. We just finished doing randomized experiments of Combs program in Prince George's County here, in Detroit and in Chicago and to measure implementation in those places, we've both had questioners delivered to school staff on an annual basis in which they rate what is happening. We've also had stenographers in the schools observing what is happening and we use that as two primary mechanisms for getting a handle on the quality of implementation.
Now, in this cases I happened to write a paper with Jim Combs beforehand outlining the theory of his program so that he and I are on the same wavelength as to what his program is and therefore how it should be measured. Right. Now, then once you've done that, you can ask the obvious questions what is the degree of implementation of the program overall compared to the comparison schools. How much variation is there from school to school within the treatment science schools and then you can ask what does the quality of implementation do for the size of the effect obtained. Now, that last question is not a question based upon randomized experimentation because schools in a sense have self-selected them into the--self-selected themselves into the quality of the program implemented but it is still an analysis well worth doing and it is a set of papers recently appeared on an improved analysis of that question of what is the effects of variation in the quality of implementation.
Now, I just described to you implementation where the unit being assigned to treatments was whole schools. That is it not the norm. The norm is more to assign classrooms or to assign individual children. I mean, in Paul's case of vouchers, they are assigning individual children. It can get tougher if children go to many, many schools given what your evaluating because then you have to go to the expense of collecting data on how it is implemented in various schools but basically it is the same principle, I think which try to look for qualitative and quantitative information to answer those three questions about did implementation vary with the treatment, what is the kind of variation and implementation within the treatment group and then what are the likely effects of implementation on outcomes for kids irrespective of the group to which you were formerly assigned by the experiment.
Mr. Peterson: Yes, Marci Kanstoroom.
Participant: One obstacle seems to be getting agreement on the desired outcomes. What are your thoughts on that? Is this different for education than other fields?
Mr. Peterson: Bob, why don't you and then I'll let Judy respond too.
Mr. Barouch: Well, certainly all our areas of education are characterized by the problem that you mentioned; for example, certainly in the adult literacy arena, meetings convened by the U.S. Department of Education three or four years ago to which lots of advocates of literacy programs of writing, resulted in a great deal of confusion because the outcome there--those folks identified as being important as a way to gauge the effectiveness of the literacy program included things like health, including dental health and the ability to get on a bus and the ability to read your tax return and your subsequent education levels like an enormous series. Certainly there is no way to run a--to do good research given that multiplicity, that embarrassingly large number of promiscuously chosen outcome variables. You have to settle on something. The medical arena are kind of fortunate because they have the luxury of relying on death as a primary indicator of outcome. In the criminal justice sector with which I deal recidivism is kind of natural, frequently used and so on.
In education to the extent that intermediary outcomes like standardized achievement can be reliable measured and are a good predictor of what comes after that, including likelihood to drop out, then reaching some agreement on that seems absolutely essential to developing a state of the art accumulated knowledge. Understanding other outcomes vehicles; for example, the Wisconsin experiments, welfare experiments in which the outcome variable was attendance of a kid in school. Okay. I did not quite realize until I read those reports that each principal in each school in the state of Wisconsin apparently has considerable discretion about when and where and how attendance is measured. You probably suspect as well as I do that the smart kids who know that attendance is taken early in the morning go to that particular class and then skip out and those who know when the--you get the general idea.
We in the education sector understand far more about the quality of these potential indicators. Some of the quality is absolutely invisible. I would not trust attendance--average daily attendance for lots of the schools in the United States for these kinds of reasons. I'm not sure what else.
Mr. Peterson: Judy.
Ms. Gueron: Disagreeing on the desired outcomes should not be an obstacle to doing a evaluation. I mean, an evaluation doesn't substitute for politics. Ultimately there are political choices where people evaluate the importance of different outcomes. The key is to even know and have reliable estimates of the effects on those outcomes. In the welfare field; for example, there is a great difference in what people think should be the goal of policy. Should be it reducing dependency? Should be it reducing poverty? Should it ultimately be helping children which was the goal of creating a federal welfare program and the Social Security Act in 1935? So people will look at the results of social experiments that measure those different things and reach different conclusions about whether a program worked or not because they value those outcomes differently. As long as you can try to get that range of outcomes in your study and sometime you can't, you can inform that debate. It doesn't mean that we sitting on this side of the table will decide what the policy implications of that issue is.
I think in education, one of the problem is that your ultimate goal takes longer to get to. So this issue of, you know, and welfare reform program, you might say, your goal is immediate employment and reducing dependency but if your goals relate to child well-being, only you are starting down a path where you got to wait a little while to find out what is really happening there as well. In education, you might say that your goals are ultimately related to many things that are graduation, post graduation, et cetera, et cetera, the things that ultimately the Perry Preschool looked at. Okay. That takes time.
Clever thing in doing such studies is to try to have some answers up front so that you don't launch your study and say come back in 20 years and fund me for 20 years because you are dead in the water then and maybe you can with looking at early impacts as Bob said that seem to be reliable keys to future outcomes, keep funders interested long enough so that you could get to those longer term outcomes.
Mr. Peterson: In the back.
Participant: Tom Toch. Regarding your third rationale in defense of randomized study: 1) please explain it more simply, 2) talk about how this plays out in early childhood class size and in the Milwaukee experiment.
Unidentified Speaker: Making a program whose effect is zero look harmful or making a program which has modest positive effects look as if it has zero effects. Those negative biases don't always crop up but they do crop up when you got when your studies when you can make comparisons between properly randomized trials and nonrandomized trials. In the case of the choice experiments in Milwaukee, the initial results based heavily on the passive observational and quasi experimental sort of comparisons. Suggested that the effect on the achievement was pretty close to zero. On the other hand Paul should be an answering part of this because my understanding of when one takes the focus is only on the randomized experiment that is embedded in this larger study one finds that if you base an estimate on those really comparable comparison groups, one comes up with a positive estimate. That is to say a notable nontrivial effect of choice on the kid. In that case and maybe a half dozen others one can find negative advisees. The fact that they appear is very distressing, certainly in the early analysis of 20 years ago of Head Start evaluations of the comprehensive employment training act coming up with negative effects of Head Start was embarrassing, implausible, counter intuitive and probably wrong. At that time we didn't have any randomized experiments to use as the gold standards. The big problem is here that sometimes those estimates are going to be positively biased and make programs look good when they are merely useless and sometimes there will be no bias. The problem is that we can never know that these models that we construct, the matching variables that we use in the absence of these randomized experiments are going to be the right ones and produce an accurate estimate.
Participant: Tom Toch. So in the case of early childhood class size, Eric Hanushek's work had these biases built into it?
Unidentified Speaker: It is certainly possible. Now, one thing that one has to do is actually get the raw microrecords from these studies and reanalyze the data to understand exactly how and why these biases occurred. Sometimes it is the model. Sometimes it is differences of example and so on. Incidentally, there is a very good piece in the January 1999 issue of British Medical Journal done by some people who were tied up with the Cochran Collaboration and health that Fred mentioned earlier in which they roll through about, let's see, 80 comparisons of this sort in which randomized clinical trial was done on say a particular surgical technique and nonrandomized trials were done on exactly the same technique, same target population and the like. Their interest lot lay partly in understanding whether the biases are predictable. To the extent to which the biases are perhaps zero. Sometimes it is the case that your unrandomized trial and not worry about--run a nonrandomized trial and not worry about the randomization. The bottom line for that nobly and knowingly conscious set of analyses is that you can't tell whether you--if you are doing a unrandomized trial how close your randomized trial--how close the estimates of the effect size will be for a nonrandomized trial and that's in that put together area. There may be other areas for certain kinds of people and certain programs in which the bias is zero. The problem is that we have to accumulates knowledge to understand whether that is empirically the case.
Mr. Peterson: Tom.
Mr. Loveless: The mic.
Mr. Cook: You are mentioning Hanushek and the class size studies. It is very interesting because if you are an economist, Hanushek was a predecessor to the class size experiment. If you come from an education, the predecessor was in fact a series of net analyses done on class size beginning with class which actually showed the same effect as the randomized experiment. Now, you could just see the randomized experiment that was recently done Star as just another data point in that net analysis which lines up with the past net analysis which in fact empirically the case. You can also see it, though, in the way that I prefer to see it as a data point that creates a lot more certainty than the prior net analysis, most of which was consisted of fairly junky quasi experiments which individually carried little weight so that the net analysis wasn't as convincing as it might have been. The randomized experiment was more convincing but to me the randomized experiment was convincing because it had the context of the net analysis to help buttress it.
Mr. Peterson: Fred, do you want to make any comments on the class size issues because I know that you worked on that?
Mr. Mosteller: I thought that the Star Experiment was very unusual both in its origin and its execution. To have randomization for the students and for the teachers both and I'm very happy that reanalyses by others seem to buttress the results of the original investigation so I'm very pleased with that. On the other hand, a lot of people don't realize that some of the conditions that were put on the original experiment are being violated in trying to use the class size idea in field practice. For example, every school that was admitted into the experiment was supposed to have its own rooms for the additional classes and every school had to have enough students to round out a good comparison so that every school had to have enough teachers to actually man the enterprise and I think that those ideas of that experiment have not always been appreciated by those who have tried to follow the leaders who did it.
Mr. Peterson: Actually the class size experiment is instructive in another way because it was preceded by a lot of econometric analysis on large scale data sets which routinely came to the finding that class size did not seem to have much of an effect one way or another or at least the results were highly inconsistent from one study to the next. The Tennessee Star study which did use the concept of a randomized experiment, actually, I think, found the largest effects in the initial year of education, whether it was kindergarten or in first grade, didn't find additional effects beyond that but it has had a lot of power in the debate over class size. You might even say that it has been critical to the President's making this an issue in recent couple of years because it is a randomized experiment. It is just a good example of how powerful a randomized experiment can be in the political debate but I know Tom has another comment, Tom.
Mr. Cook: Think the class size experiment is very important and for another reason which is if we have truth in advertising here today about experiments, it reveals an important limitation of experiments as they are often done which is when Tennessee and California borrowed the idea and brought it to scale within the states, they quickly found out that smaller classes entails necessarily more teachers. There is a national teacher shortage so where did you get teachers from? The richer districts poached them from the poorer districts and those states went out into other states poaching teachers from other states to bring them in. Then it necessarily also means more buildings. So where did you get these building from? Now, some of that is rehabilitating older dilapidated buildings. This is a case study of the fact that many experiments are done as demonstrations on a smaller scale and bringing them to scale within a state or within a nation can entail dynamics other than those represented within the scale of the experiment. You have to be very careful to that and avoid...prematurely [racing] from demonstrations into national things. There has to be stages in between. Slow and leisurely.
Mr. Peterson: Judy, you know a lot about this issue of moving from small scale to larger scale programs and also how the information from experiments gets used in the policy making process. You want to just make a couple of comments on that.
Ms. Gueron: Well, when we started doing social experiments in 1974 with an employment and training program, it was within a small community based environment. This is somewhat different issue but it was the view that the public--you couldn't control the environment enough and you couldn't convince people to do an experiment at a mass scale. You had to do it small. You learn small and then implement large. And that did have the problem that you learned within small units and the question was whether the generalized ability would be there and whether the subsequent way in which the program would be tested would be in similar small units and were they replicable. One of the innovations in social experiments in the 1980s and now is that they are actually often conducted in very large scale, the real world operating context of social programs so that for example in the welfare and employment and training fields, you rarely in some of these experiments co-opted the system and the leaders of the system into innovating ways where the entire system has used its resources and allocated them in a random assignment way so that you have, you know, across the state of California, 35,000 people involved in a welfare study in eight counties so that the entire intake process, not some small version of it, but the real program is the laboratory for doing a social experiment. That is very convincing because a question always does come up about the replicability.
A social experiment promises what is called internal validity. It promises that the experimental group and the control group don't systematically differ. That's its beauty. That's what you can explain to people. It is not neither research or with my sophisticated statistical techniques. I just ran a lottery just like on tasting toothpaste and I compared it overtime and the study wrote itself. That's internal validity. External validity means well, you go can from your study and then you try to generalize into the large world. That is very different and it does raise questions. How did you choose your little sample? Can you replicate it? Is it typical? Were the people in the program typical? Who was a generalizable to? Some of the larger experiments and the relationship between them and smaller ones can help address this but some of the more recent experiments at large scale pretty much directly look at those questions and I think there these challenges are less important.
Mr. Peterson: So there are lots of hands up here. Tom paid for all of this so I guess he gets the first, Tom.
Mr. Loveless: I don't want to ask a question. I want to make two good points about the California class size initiative. One is alternative path that California could have done, if you think about it, is randomly assign districts to the class size experiment and collect the data over three or four years and analyze it and find out whether or not it has been effective. The reason why that is important is really the second point I want to make. We will never know if the California class size initiative is working or not. It costs a lot of money. Hundreds of millions dollars or a billion dollars a year. The reason why we will never know is that the California class size initiative is being carried on in context with a whole bunch of other reforms. Imagine if reading scores go up next year, the people who implemented a phonics reading program are going to say that is the reason that it wasn't necessarily [inaudible]. We will never know. If Massport shows up we will have people to push for a more traditional math program in California. We won't be able to untangle the things because of the way that California went about it.
Ms. Gueron: Just to pick up on that for one minute. There are example where that kind of follow through has occurred and really particular beauty. I'm thinking for example we did a study in the 1908s of the welfare to work program in California state wide. It showed one particular approach in one county seemed to be more effective than what other counties were doing. Los Angeles which was shown not to be particularly effective, looked at what was going on in that county and said we want to change what we are doing and adopt to those practices and implement them within San Francisco. They went the next step; however, and also said and we would like to do a subsequent random assignment study of what we are doing to see if this adaptation, in fact, was any more effective because it was a good question to ask whether what you learned in Riverside County a small much more suburban county, less difficult labor market, less difficult to employ people was about the county versus the practices. Subsequently a social experiment was implemented in Los Angeles. First question was could they change what they were doing? Yes. Second question, did it seem to be more effective? Yes. Study still going on but it showed a process of learning that could have been followed in this example and would have led to much more conclusive results.
Mr. Peterson: Back here. Yes, sir.
Participant: Richard Whitmire, Gannett. Looking at the standards and assessments movement, how much is based in random assignment?
Mr. Peterson: Anybody have a comment? Tom.
Mr. Cook: I don't know. I don't know of any.
Mr. Peterson: You want to follow up, Richard, with elaboration?
Participant: If you don't know of anything [inaudible].
Mr. Cook: None that we know of.
Mr. Peterson: I mean, it is a really excellent example. I also can give you another one. There is a foundation out on the West Coast that was prepared to fund a major phonics initiative in a school system and in fact did do this and then wanted to know how to evaluate it and when the idea of a randomized experiment was proposed, the answer came back, no, this was such an valuable initiative that no child in this school district should be denied the opportunity to have access to this new training program. That's the way that people feel about high stakes testing. It is going to work and therefore we don't want to exempt anybody from this great benefit.
Yes, ma'am.
Participant: Martha Rota. How has research factored into HR2 and Title 1? What will happen in the future?
Mr. Peterson: You know, Title 1, does anyone want to talk about Title 1 and what the history of the evaluations of Title 1 is and where we are?
Mr. Barouch: How long do we have?
Mr. Peterson: Well, I think you really do need to address that. Bob, do you want to say something about that?
Mr. Barouch: Let me say more generally the planning and evaluation office and the Department of Education are responsible for executing and trying to design high quality evaluations in the Title 1 arena as well as in the variety of other education arena sponsored by the federal government. In the past, has actually been--folks are tied up in that office and advocated actively for doing high quality randomized control tests and indeed in things like Even Start which is kind of variation on the Head Start thing but for families and the so-called, the Upward Bound program, trio programs and the like high quality randomized experiments have been done. There aren't a lot of high quality randomized experiments in the Title 1 compensatory education arena but that is not for want of agitating for it. The problem is to do a good job on that it takes a lot of money. We are talking about 5 or $10 million experiments if you want to understand how to improve that Head Start program or enhance it appreciably at roughly the same cost to benefit the kids. These are not inexpensive things and it is hard to get money for them.
Let me mention to answer this gentleman's question. On the business of high stakes testing, there is one remarkable experiment that didn't occur in the United States but occurred in Ireland on the introduction of standardized testing in Irish schools. The Irish 15 or 20 years, 15 years ago, do you remember that? Were sufficiently concerned that standardized testing would increase remarkably the distance between teachers and kids. Would make kids nervous, paranoid and generally unsettle them that they decided to do randomized experiments in which schools were randomly assigned to either achievement testing, standardized or not, and then they measured the socks off the kids, teachers and so on and so forth. The negative effects were actually inconsequential. That's why Ireland now has standardized testing. That's a single experiment.
Mr. Cook: But that wasn't high stakes.
Mr. Barouch: Not too high stakes.
Mr. Cook: This is about testing per se rather than the stakes for testing.
Mr. Peterson: Let me just say that to me one of the concerns that I have is that I haven't seen much of a will on Capitol Hill yet to see that the programs that Congress is funding are evaluated using this kind of research strategy. They do call for evaluations. They do want evidence based research but they have hesitated to say, yes, and it must be gold standard research such as they demand or FDA demands. We don't have anything like FDA. In medicine, FDA says you don't get approval unless you have survived the gold standard. We have yet to have any agency of the federal government, whether it is Congress or the executive branch agency or in any of our state governments, say in the field of education, yes, before you innovate, you've got to show that you've got an effective program here that deserves implementation on a wide scale. So FDA is the powerful instrument by which the concept of randomized experiment has shaped our whole understanding of what is the appropriate way of evaluating innovative procedures in medicine. Judy.
Ms. Gueron: I want to piggyback on that because I don't want the tone of what we are saying to sound complacent. Like it can be done. It has been done elsewhere. It will be easy to do. It is a real uphill battle and the importance of what Paul just said is really critical here. The reason that we have had many--our organization was able to do many social experiments in the welfare field since 1981 was because of a provision in the Social Security Act. It is called--it is no longer there. It is the reason why no major social experiment has been launched since the welfare reform bill in 1996. It was a provision in the Social Security Act called section 1115 which said that if states wanted to innovate beyond the regular Social Security Act Welfare Program, they could get waivers to the regular program but they could only do it if they evaluated what they were doing. They could only do it. Throughout Democratic and republican administrations people at HHS took that only if you could evaluate not only seriously but increasingly viewed it as rigorous evaluation and random assignment evaluation and that led to governors make a whole fuss about how they had to come to the federal government on bended knee to beg to do innovations. It wasn't to beg to do innovations. It was some of that but they also had to evaluate what they were doing and it was the combination of clout that existed, yes, you could innovate, but you had to do a study that was the entry point for launching these evaluations and then let you learn whether innovation matters.
I remember testifying at a welfare hearing at a senate finance committee and having senator to be unnamed say how terrific it was going to be because we were going to have positive flowers bloom in a million or whatever and we were going to learn from it. You don't learn from blooming flowers. You learn if you go after those blooming flowers with a rigorous evaluation technique. Otherwise they bloom in silence and then they die and you do not know whether it has made any difference.
Mr. Peterson: Yes, sir.
Participant: Christopher Jencks' work on the black/white gap--were there unidentified variables? Are there variables unidentified or difficult to measure? When designed experiments, are there particular factors which should be included or deleted?
Mr. Peterson: Your question brings up a point that some things are more easily examined by this technique than others. Now, teacher quality is a wonderful--I have--maybe Tom can think of a way studying teacher quality but you would have to randomly assign good teachers to some students and bad teachers to others and you would have to know a priori who the good teachers are and who the bad teachers are. That is a really tough one to set. Class size is an easy one. You know, can you say these classes are going to be smaller than these classes. It is fairly simple. Actually school choice or vouchers is relatively simple because you can give a voucher to some people who want them and not to others who also want them. There are some things that are easier to do with this technique than others. Tom, did you want to add something to that. Fred, you do.
Mr. Mosteller: Yes, I would offer a very specific one which is indicated from analyses of the curriculum in mathematics. Our curriculum in mathematics takes up many more topics than most of the high scoring countries do and it may well be that if we changed our curriculum to take up fewer topics and take them up in a little more depth, that we would actually make some progress. So this is a good example of something that you could change; that is, you could hope to get a lot of people to agree to change their curriculum and you could hope to do a little randomization and find out whether that particular thing would matter or not and I think it could matter but I don't know.
Participant: How much background noise is there that can't be accounted for?
Mr. Peterson: Well, that is precisely why a randomized experiment is incredibly valuable because if you have an universe of schools, say and you randomly assign them to the test and the control condition, then you can assume, if you have a large enough number of cases and large enough number of schools participating in the experiment, that the two groups are going to be equivalent in other respects. I mean that's the power of the concept of randomization, that is without that you never know whether it is some other factor. You have listed a whole bunch other factors that could very well be the cause of any result you found. So the concept of randomization is to even out all those other factors and focus in on the one factor that you are particularly interested in at this point in time. Howard, you wanted to make a comment. The time is passed to do that. Yes, sir. Let's bring in a couple more and then Tom.
Mr. Cook: Paul is right about the randomized expert distributes these noise factors equally across conditions. But if the intervention you are looking at only works in certain kinds of schools, it is a contingent causal relationship rather than a generalized one, then that's very important obviously to establish the boundaries for this causal relationship you are studying. Now, that requires strong theory or luck. The sampling design that allows you to get the variation you want in the kinds of schools and that requires a lot of thought and my guess is that most of our conversation today has been about main effects of interventions rather than contingent causal relationships. They are much tougher. The stronger the theory, the easier it is.
Mr. Peterson: Yes.
Participant: My name is Jerry Sroufe, S-r-o-u-f-e, AERA. Given the breath of problems, how does one establish a hierarchy of importance in the need for randomized studies?
Mr. Peterson: Those are two very good questions and the first one I don't think we should pass by because Tom Cook has written a lengthy paper on why it is that we don't have more randomized experiments in the field of education.
Mr. Cook: It is a paper on the culture of research in schools of education as it has evolved over the last 35, 40 years. I'm not going to give you in great detail. I'll send you a copy if you like. But basically it charts the move from quantitative to qualitative research preferences and how that was justified on ontological, [inaudible] and etiological grounds. It documents the growth in the sense that knowledge--sorry. The change of--knowledge growth should be based upon craft principles and not scientific principles. It documents the fact that the quantitative researchers within schools of education were working on some very important topics how to conceptualize and measure individual change, hierarchial linear modeling and the like. But then none of them were working on randomized experiments and doing them to show that they could be done. So this--the culture evolved over 30 years with a literature led by Lee Chromback and many, many others Goober, Lincoln, and Stake, Stufflebig, Skiven. Probably some of you know the names. All of whom dumped on the [inaudible] foundations of quantitative research. Many of them took the randomized experiment as the epitome of quantitative research and so what you have now is a situation where if we were to rely for randomized experiments on people who call themselves education evaluators, schools of education, we could not get it done. They are not there. There is not the cadre of people with the knowledge base to implement these very difficult things that Judy has talked about. She has and her staff has 20 years of experience in a very difficult crucible and that experience, if you want the craft knowledge, she has acquired, is very important to getting high quality randomized experiments implemented and schools of education do not have that.
Now, you might say, it is not a problem because we can get contract research firms to do this or schools of public policy to do that. Why do we need schools of education? Well, I suspect that one needs their knowledge of what goes on in schools for implementation because the implementation of the innovations are crucial. So we need that knee deep in the muddy knowledge of school realities. But we also need it because many of the experiments that could be done will be local small ones for which you will not hire MDRC or the Urban Institute or Act and for those smaller ones that could be done, they won't get done unless the person power and culture changes someone.
By the way, you just accepted two of my randomized experiments in your journal. Thank you.
Mr. Peterson: Judy.
Ms. Gueron: I don't think this issue in education sort of hierarchy of where do you but your bets is really very different than in other fields. I mean funding agencies, foundations, Congress, people have to make bets and they need to think about what are the important questions and where is it feasible to do such a study and where do we think that the question is still going to be around at least five years from now. So when we have the answers it will matter to people and those decisions get made and I wouldn't say necessarily all that rationally but they get made and now they are domain. So I don't think--I don't see that as--an experiment that we are doing in education, basically we mock at it. We thought this was important. We thought it was feasible. We thought it would be a question down the line. We went to the Department of Education and a bunch of foundations and we got enough response to be able to go forward. That's one way such a thing could get launched. Others can actually come out of the Department of Education in whatever is the decisionmaking process that have people sit through alternatives and actually issue a request for proposals to do a research study. So now, this is hard but I don't see any qualitative difference from the decisions other federal agencies make or other areas.
Unidentified Speaker: What you are really saying is you are taking advantage of the opportunity to [Inaudible]. That is a very pragmatic [inaudible].
Unidentified Speaker: Obviously we have a problem here. We would need some kind of a basis for making a case or something as opposed to just [inaudible]. I don't see that hierarchy.
Mr. Peterson: Mathmatica has one randomized study of what effect does Upward Bound have on college going and that is a well designed study funded by the Department of Education and may not be OELI. You might be right about OELI but it is by the department and it is, you know, meets these criteria. But your basic point is absolutely correct. There is not enough leadership coming out of the Department of Education. And they--it is there where you need to have somebody say these are the topics that need to be addressed. You know, one of the extraordinary facts is that the amount of money that we spend on evaluation and research in education is a pittance compared to the size of the industry and compared to the investments in research in other industries, medical research being the most outstanding example. But then you have to ask the question, should you invest more in education research if you are not going to use the kind of scientific principles that have been established elsewhere so it is sort of a--it is a catch 22 situation. Because you really can't ask for the funds until you come up with an enhanced standards for the kind of research that you are going to undertake and yet you can't do that kind of research without a lot more funding than is currently available. Tom.
Mr. Cook: Can I just say something about that because when I read the OELI reports, it is not only an absence of randomized experiments, there is an absence of really well designed quantitative alternatives to randomized experiments. Now, I write on quasi experiments and I know a little bit about it. I would say that the quality of the quasi experiments represented there is from modest to abysmal. The improvements over the last 10, 20 years on how to do these fall backs, these second best options to randomized experiments, I would do some but I rarely see them with multiple preinterventions measures of different times. Right. I rarely see them with well matched, stably matched control groups, matched on the outcome variables. I rarely see sophisticated statistical analyses of them. I think the quality of the quantitative alternatives to randomized experiments, and I prefer randomized experiments, but sometimes we have to do alternatives, the quality of those alternatives is pretty bad in standard education practice alas.
Mr. Peterson: Bob has a little graph for us.
Mr. Barouch: I found another slide in my bag. This gives the portfolio of contracts led by the Department of Education's planning and evaluation service and not OELI. Planning and evaluation for things like surveys of needs or passive observational surveys. Could be [inaudible] educational progress and so on. About 24 percent of the total number of contracts awarded are actually for studies of that kind passive observational studies. Most of them, two-thirds are program implementation, performance measurement and monitoring kinds of studies. To what extent is Title 1 being delivered to the kids to whom it should be delivered and so on. About 37 percent of the contracts are quote impact evaluations. Some of are well designed or apparently well designed second best quasi experiments but the minority of all of these things are only, let's see, 4 percent of the total are actually impact evaluations that are randomized experiments. These happen to be big ones. The Upward Bound experiments which are quite good. The Even Start experiments and maybe three or four others but they have been in the minority because the money hasn't been there because partly the drive from the hill I don't think has been sufficient to push the department into doing still more. Again, it is this opposite rather than OELI that is responsible for things.
Mr. Peterson: Yes.
Participant: Darcy Olson. How many times should a good study be replicated before its used to establish policy?
Mr. Peterson: There is biblical question like that. It is escaping me but.
Mr. Cook: In my view, it is not the number of times it is replicated. It is the heterogeneity of the kinds of school districts, kinds of schools, kinds of teachers, kind of children across which it has been replicated that is important for me because then you get a sense that this intervention is not one that works in some context and in others it might even be detrimental. Now, I think none of us sitting here would uphold the Perry Preschool Project as a great example of a technical randomized study. It is a great example of political impact. I know of no replication, attempted exact replication in the United States but the British government, my own government, is so enamored of the Perry Preschool Project that they are now replicating it into 256 communities in the United Kingdom because it works so dramatically well they think into early adulthood on very important outcomes. So they took randomized experiment and ran with it. I would say they ran with it prematurely but they certainly ran with it very, very big.
Mr. Peterson: So are any results in from that or?
Mr. Cook: No, it is just being launched right now. The money has been appropriated. The communities have got it. They are just setting it up.
Ms. Gueron: Not as an experiment, as a program.
Mr. Cook: Some of us are working to try to get it done experimentally but that is not the plan.
Mr. Peterson: Judy, did you have another comment that you wanted to make?
Ms. Gueron: It is the same point. I think one of the reasons that the welfare studies in the 80s had such a big impact and impact on Congress and policy was that they were--the results kept coming in from different communities, West Virginia, San Diego, the rural and urban counties in Virginia, across some counties in Maryland. Across different environments and the similarity of the results or the clear understanding or indications of why they might differ made those results much more convincing than a large study in one location so I think heterogeneity is really clear.
Participant: Siobham Gorman, National Journal. Even if you get good research, what makes you think that you can translate it into good policy.
Mr. Peterson: Judy, I guess you are the one that could best help us with that.
Ms. Gueron: I think that is a very good question and if research ends with writing a report and not trying to pay substantial attention to getting the news out and making the news understandable and I would say not at the end of a study but as you are conducting it, evolving important different communities in learning with you, there is a result that no one will pay attention to the results or misinterpret them. There are examples of research making a real difference at a state level or at the federal level.
Mr. Peterson: Why don't you give us one or two.
Ms. Gueron: I think an example where it didn't make a difference which our organization learned from, we did this terrific study in the 1907s, our first project, we didn't really know what we were doing politically. We knew what we were doing intellectually. We did a wonderful social experience of supported work but we didn't have anyone on board. We got to the end of the social experiment and we wanted to give it to someone and say, okay, department of somebody, now, take these positive results and do something. They said, no, you do something. Who does something. It is like a hot potato and in addition it was run outside of the main stream delivery system so they didn't own the results. The system didn't own them. The welfare department didn't own them. It was kind of a closet event run in community based organizations and someone told me you should leave your job and spend two years going on the road with these results. I thought that was unbelievable and inconceivable. So kind of fell with a thud. It was out there with a beautiful thing. It set a standard for doing such studies but didn't have any impact. We learned from that, boy, you better pay attention to the systems and involve them early, the advocacy groups and involve them, the federal agencies and get results out through the press as they are coming out so that people are learning with you and the key different and diverse constituencies in Washington. Now I say then it was more Washington now it is states and Washington.
So an example would be the evaluation of the welfare to work program called the Game program in California where you come from. We did this large evaluation. The state was very interested in it. They advocates, the state, state legislative staff, people in the eight countries involved. The results came out every year. People followed those results. There were a lot of competition among the counties. When it turned out that there was a real fiscal--the results were encouraging, encouraging enough so that when there were cutbacks across the state budget, the Game program was not cut. It was maintained. People thought they had learned that this program worked and it made a difference in the political process and the legislative staff and actual legislators themselves that had written the bill knew about those results and went onto use them and they went onto use them in refining the subsequent program that they implemented under TANIF and as I gave an example earlier, after they got over their jealousy and saying it happened in your county, it could never happened in mine and you are wrong and ultimately when the long term follow up comes in, I will be right. Los Angeles looked at the results swallowed hard and said we weren't running an effective program and we want to adapt our program to what seemed to be effective practice and evaluate it again. So it was a whole process that involved how you set it up, who you worked with at the very beginning, how you got constituencies on board and then how you marketed and sold the results and how you dealt with the press, et cetera.
Mr. Peterson: Tom. Bob, you think about this because you've looked at randomized experiments, a lot of them over a lot of different places, so I want to hear do any of them ever make a difference. I want you to think about that. While you are thinking about it, Tom.
Mr. Cook: Right, you asked the question does quality research, in this case randomized experiments, increase the probability of their use, right. It is certain that it doesn't guarantee their use. I know personally, Bob knows more about this than I do, of no systematic high quality study that has probed that hypothesis that better--that randomized experiments increases the likelihood of use. I think that from antidotal evidence like Judy is bringing to bare and I could bring to bare from the early childhood education movement, it is indeed the case. Right. If you look at the cumulative effect of the Perry Preschool Project, the Absidarin project, the old home visiting project, all of which were randomizing experiments being used as ammunition on the hill to create more services for zero to six kids, then they were used and they were promoted as randomized experiments that show that they worked and were therefore not as open to alternative interpretations as other things.
Now, in that area, the randomized experiments have been done since, have been much less positive. Right. Like the Community Child Development Project, right. Like the replications of the old program when being done by real nurses in the real world rather than the nurses coming out of a demonstration program. So here was the case where randomization may contribute to the corrective, at the marginal at least, of an initial change brought about by randomized experiments themselves.
Mr. Peterson: Bob, I just want you to have a chance to comment on this. Do they do any good?
Mr. Barouch: You turn the question over and say in what sense is research taken into any account in any complex public policy enterprise. Lots of folks seriously interested in understanding how to do the job better. They include incidentally private foundations like Kellogg and William P. Grant which say that research is not used often enough in public forums at the local, municipal or state level and that somehow we ought to foster initiatives that help to facilitate that process. Getting people better informed despite the ferociously sort of large volume of information that is being thrown at us. One way of thinking about this is to sort of take an interrogatory approach to understanding what is going on there with respect to the use of experiments or any high quality evidence. That is to say, ask yourself, how many on the hill, if you ask the question, how useful are you as a General Accounting Office reports? You have to recognize that first somebody has to be aware of the report in the first instance. The staffer or the Congress person. Let's say the odds on that are one out of four. I'm being optimistic. They have to actually understand the contents of the report or at least the executive summary which is often quite good. Let's say the odds on that are one out of four. They have to have the capacity to use it. Be in the right committee, the right staffer, in at position well enough to do it. Let's say the odds are not one out of four. And finally let's say that they have to be willing. There has to be some insensitives there. Is it going to payoff? Is it accountable? That's one out of four. Once you start attaching numbers to this, then you figure odd all those odds are independent. One to fourth. It is one out of 256. Those are the odds on that particular report getting used, GAO report, getting used by the Congress. You take that seriously. Quite interesting part here is understanding that each level there are sort of manipulative things that can be manipulated and themselves tested. Awareness. How do I do experiments to assure that people are better informed about randomized experiments for example or other forums that are credible. What about enhancing understanding? Well, there are lots of different ways to write. Can I make my pros okay or I can try to write this thing responsible. I will try to do as well as Florence Nightingale or I can use [inaudible] so on. Similarly understanding how to assure, how to generate incentives for people to use better information seems to be important. In this country, part of the evidence based health policies and in other countries, for example, is being driven by one kind of incentive; that is, controlled by insurance companies. We are not going to pay for that surgery unless we know it is effective. Right. That dries up the interest in evidence on effectiveness. We don't have as many incentives in the education sector of that sort. We need them. They have to be somehow invented or thought about. It is not probably going to be monitoring incentives either. It is going to have to be some other theory.
Mr. Peterson: Well, we could go on and on but our time has run out. Howard, we are going to give you the last word.
Mr. Hyatt: . The medical sector has suggested [inaudible]. Less the people leave here with the sense that we have achieved Nevada. The reform of last week that has achieved so much I think publicity should put to rest the great obstacles that are involved in and the distance between the experiment and the implementation and the results of the experiment. The notion that the expert is the person and perhaps the only person to do the proper experiment is very, very graphic [inaudible] to me.
Several years ago when at the center I described, the Center of Public Schools for Public Health [inaudible]. The person who carried it out was a very prominent biologist who had taken a sabbatical to look at this problem because he had some ideas about this. When he demonstrated really quite conclusively that radicle mastoidectomy was not the treatment of choice, it achieved a great deal of publicity. Now, he was invited to appear on the "McNeil/Lehrer Hour" and the other side was taken by the chief of breast surgery at a prominent cancer hospital in New York. And the fact that surgery said to my colleague, my friend, a man named Murray Fox, a man with all kinds of degrees, he said after Mr. Fox had presented his opinions, Mr. Fox, how many patients with breast cancer have you operated on. That was the way in which this issue was dealt, with respect to a hierarchy of what is important. As Judy said, it is a matter sometimes of a persuasiveness of the attitude or given inquiry and the interest of a foundation or a government and still many, many questions, many more questions in medicine that yet have to be answered in this fashion than those that we have answered.
And finally with respect to outcome as to implementation and duration, at a meeting that we held in May at the American Academy on this topic, a number of prominent people in education said you don't understand. We are dealing with the lives of children and one won't know for decades as to whether something has worked. I've heard similar objections raised to these long-range programs in medicine. The Framingham Hot Study was done more than 15 years ago and it was only after number of years that it worked--generated a great deal of the information that leads us to understand the relationship of Parkinson's. The diet and other factors like exercise. So, you know, I think the real question is how committed are you. First of all, how persuasive are we and only serious experimentation of the sort we are describing will yield the answers we need. How committed are we to see that those become a part of the culture?
Mr. Peterson: Thank you, Howard. Thank you, Tom, for--Tom Loveless, the Brookings Institute for hosting the event and I want to thank the panel and especially Fred Mosteller for providing all of the leadership for this enterprise and I thank you all for coming.
[End of Event]
|