Interim Report #2: Some Proposed Comparisons of Matching Algorithms

By Alvin E. Roth

March 12, 1996

Revised 3/29/96

The purpose of this memo is to open the discussion of how alternative matching algorithms ought to be compared, in order that the results of the comparison will be informative for the debate about how the NRMP should be run. I will focus here primarily on what we should plan to learn from these comparisons, and only incidentally on technical issues concerning how the comparisons should be conducted in order to extract the desired information. I will focus on comparisons between a pair of matching algorithms (e.g. the current, mostly-hospital-proposing NRMP algorithm and the still to be coded applicant-proposing algorithm; see my previous Interim Report, Roth 1996a). However, if we decide to do a three way comparison (current algorithm, applicant proposing, and pure hospital proposing), the same comparisons would apply to each pair.

The comparisons to be discussed are all computational comparisons to be conducted on the Rank Order Lists (ROLs) submitted in previous matches. Such comparisons will reveal differences in how the algorithms produce a match from the given preferences. What they cannot reveal is how the submitted ROLs might change as participants adapt to a different algorithm. However, an investigation of the incentives that each algorithm gives participants to deviate from the given ROLs can give us some indirect evidence about that.

I'll propose three kinds of comparisons: direct comparisons of the matches produced by the different algorithms; comparisons of the incentives they give to submit different rank order lists; and comparisons of how other kinds of policy changes (in particular suggestions to participants about how long ROLs should be) might influence the match produced by each algorithm. In this latter connection, the data from the 1996 match may be worth special study, if it turns out that ROLs are notably shorter than in previous matches.

Comparisons of the different matches produced:

Some of the most important questions we should want to answer can be summarized roughly as follows:

How many students and residency programs are likely to be affected each year by a change in the match algorithm; how much will they be affected; and will the changes fall predictably on some parts of the market?

The first step will be to count how many applicants, and how many residency programs, match differently under the different algorithms, for each year for which ROLs are available.

If this is a consistently small number--as informal reports of comparisons between program and applicant proposing algorithms in other matches suggest it might be (see Peranson and Randlett, 1995 p482; Colenbrander, 1996)--we may be able to simply make a table of all changes. For each applicant who experienced a change, the table would list the rank order of the program he or she matched to under each algorithm, and for each residency program it would list the makeup (by rank on its ROL) of the new residents it matched to.

However, if more than a few applicants and programs have different matches under the two algorithms, comparisons will have to be formalized. A lot of the information regarding applicants could be contained in a transition matrix (of size 15 by 15 or so), for each year to be studied, in which the [ixj] entry would tell us how many applicants were matched to their ith choice under the first algorithm and their jth choice under the second.

From this matrix we could determine the number of students who moved up one position, two positions, etc. on their rank order lists, and whether an applicant who received (e.g.) her 3rd choice under the NRMP algorithm is more or less likely to receive a different outcome under the applicant-proposing algorithm than an applicant who received his, say, 7th choice.

If the number of applicants with different matches is quite large, then one of the most informative simple comparisons will be of cumulative distributions; i.e. under each algorithm, how many applicants are matched to their first choice, to one of their first two, first three, etc.

The comparisons for residency programs are a little more complex, because residency programs match to multiple residents, and so their match outcome can change in complex ways. Aside from knowing how many residency programs get different matches under the two algorithms in any given year, we will want to tabulate how many residency programs will be affected in only one position, in two, etc, and to note whether some kinds of residency programs are particularly likely to be affected.

So far I have been speaking of comparisons of two matches produced from the ROLs of a given year. If there are lots of differences in the comparisons from one year to another, we will also want to see if the year to year variance in outcomes is sensitive to the algorithm used.

All of these comparisons can be made from a database consisting of all applicants and residency programs, their ROLs for a given year, and their match under each algorithm, with particular attention to those who get different matches under the two algorithms.

Comparisons of the incentives under different algorithms:

Some of the most important questions about the incentives the different algorithms give to participants can be summarized as follows:

How many applicants could have improved their match by shortening (or otherwise modifying) their ROL, at what risk of being unmatched? And how many residency programs could have improved their match by shortening their lists; by reducing the number of positions they declare; or by better targeting candidates to list; at what risk of having a worse match?

More specifically, we will want to be able to answer questions of the form; if a given applicant shortened his or her ROL by one, by two, etc., what would be the chance of changing the resulting match, and what proportions of these changes would be positive (a more preferred match) or negative? Some of the technical aspects of designing experiments to answer such questions are described in my initial study proposal (Roth, 1995) 2.

For residency programs, the incentive issues are more complex than for applicants. Because programs match to multiple applicants, their matches can change in more ways, and they also have more ways to behave strategically.

Like applicants, programs can consider how long an ROL to submit, and the experiments to determine how much encouragement each algorithm gives to strategic behavior of this sort will be like those for applicants.

In addition, residency programs can consider how many positions to declare to the match. Since I don't presently know what options programs have outside of the match, I'm not yet prepared to suggest how this issue ought to be explored 3.

Finally, it is not necessarily the case for programs (as it is for applicants) that we can largely focus our attention on truncation strategies--programs may have other ways to strategically consider the ROL they submit. In particular, we need to explore whether it is profitable for programs, under either algorithm, to identify applicants who are likely to rank the program high on their own ROLs, and avoid giving high ranks to too many applicants who will not. The reason that targeting the most interested candidates is an issue for programs (even in the program-proposing algorithm), is that by listing too many candidates, a program with multiple positions may run the risk of competing against itself. By making offers to some candidates, they may cause those candidates to make rejections they wouldn't otherwise have made, causing other programs to make additional offers, which may prove attractive to applicants the original program would have gotten, but didn't because of the new offers generated (see pp145-6 in Roth and Sotomayor, 1990). What we need to determine is if this is only a theoretical possibility, or whether this phenomenon occurs often enough in the ROLs from previous matches to make this an active issue, and whether it is more of a problem under one algorithm than the other 4.

In evaluating the results of computational experiments aimed at discovering how different algorithms give programs different strategic incentives, it will not always be clear how to evaluate the results, since we can't say, for example, if a program with 2 positions would prefer to fill them with the candidates it ranked 1 and 12 or with the candidates ranked 5 and 6. But by tabulating the changes we see, we should be able to present the range of possibilities.

Finally, very closely related to the incentive experiments will be experiments aimed at addressing policy issues such as whether some changes should be made in the lengths of ROLs which applicants should be encouraged to submit. For this kind of question, we will be interested in the effect on the match if all applicants, for example, have a maximal ROL size of 10.

Concluding remarks:

I haven't concentrated here on statistical tests, partly because I think that the question of what kinds of tests will and will not be useful will better be resolved after we have an idea of what magnitude effects we are looking at. This is particularly so since the distribution of preferences from which we have data will likely be subject to continuing change as changes in health care financing evolve.

It will be helpful to remember that, even after the size of an effect has been estimated numerically, the question of whether it is small or large remains subjective. But we should get some idea of the relative sizes of different effects, and thus the policy debate may be able to focus on questions such as how to proceed if the changes in match results are relatively smaller or larger than the changes in the incentives which one algorithm or another gives to the match participants.


References:

Colenbrander, August "Matching Algorithms: A Discussion of Offer-based vs. Application-based Matching," draft, 2/22/96.

Peranson, Elliott, and Richard R. Randlett [1995a], "The NRMP Matching Algorithm Revisited: Theory versus Practice," Academic Medicine, 70, 477-484.

Roth, Alvin E. [1995], "Proposed Research Program: Evaluation of Changes to Be Considered in the NRMP Algorithm," Consultant's Report, http://www.pitt.edu/~alroth/nrmp.html.

Roth, Alvin E. [1996], "Interim Report #1: Evaluation of the Current NRMP Algorithm, and Preliminary Design of an Applicant- Proposing Algorithm," mimeo, and on the world wide web at http://www.pitt.edu/~alroth/nrmp.html.


Endnotes

[Click on note numbers to return to the text at the note]

2. In addition, to efficiently design the computational experiments concerning incentives, it will first be necessary to answer some technical questions about matching algorithms with couples and other match variations. In a simple match, without the NRMP match variations, if an applicant or residency program truncates its list below the (lowest ranked) match it receives, there would be no effect on the outcome of the match. But in the more complicated NRMP matches, some blocking pairs need to be considered twice (see my Interim Report #1), and this may no longer be the case. We need to test if, with all the match variations, items in ROLs below the match point influence the match. We can do this by running each algorithm on a truncated set of ROL's for each year, in which all ROLs are truncated at the lowest match they received when that algorithm was run on the full ROLs, so that the matches which the algorithm produces using full and using truncated ROLs can be compared. If the difference between these two matches is none or negligible, standard results from the literature can be used to guide the incentive experiments. In this case only truncations which exclude an individual or program's (lowest) match will change the match. Otherwise, we will have to consider experiments which investigate longer truncations as well.

Colenbrander (1996) considered modifying ROL's by deletion rather than by truncation. Since truncations are multiple deletions, our results will give higher estimates for the probability of getting a better match, as well as the probability of becoming unmatched, and a lower estimate for the probability of becoming matched to a lower ranked match, than if we considered single deletions.

3. However, preliminary theoretical results by Tayfun Sonmez suggest that the incentives for programs to withhold positions from the match may be different under the different algorithms.

4. These experiments will involve sampling in a different way than the other incentive experiments. We need to investigate the effect of deleting some applicants while continuing to highly rank others, to see if the program can, by so doing, match to some of the highly ranked candidates it otherwise would have missed.