Aspects of testing (matching, ranking and rating)


Thomas Cool
June 2000, report TC-2000-06-a



This paper explores the links between matching, ranking, rating, Elo-rating, logit modeling, 
Rasch modeling, Item-Response Theory (IRT), voting, aggregation, and the like. 
These issues arise e.g. when testing students, or rating web pages or scientific papers. 
The analysis is at a very basic level. The main result are Mathematica notebooks and a package.


This html file is a conversion of a Mathematica notebook. The 'words in blue' below are links to other Mathematica notebooks. The latter however have not been converted. Get a MathReader at Wolfram Research, and unpack all notebooks from Testing.tar.gz.


Original inspiration
The original inspiration was to see how papers could be rated, in the same way as there is an Elo-rating for chess players. Interestingly, there came the Tempe declaration (Baker (2000)), in which leaders in the scientific community call for improvements in the system of scholarly publishing. The Tempe declaration for example expresses an interest in 'searchability', 'processes for evaluating quality' and 'quality rather than quantity'.

The original inspiration lead to some study in Logit modeling, Rasch modeling, Item-Response Theory (IRT), etcetera.

The result of this study is (i) a Mathematica package to deal with some basics, and (ii) a set of notebooks to discuss those basics.

So the results are limited. But my impression is that the whole gives a nice discussion of aspects of testing - matching, ranking and rating - and that others could benefit when reading this. This is, also, the only conclusion that we will reach - and thus there is no section with the title 'conclusion'.

Testing is to score objects on criteria, or to compare objects by means of such criteria.
Consequences of this definition
There are two obvious applications: one is matching objects - like in marriages - and the other is to rank or rate them - like in determining the winner of a match (game, contest). In common language the word 'match' is used in both senses. There would be a 'matching' if the distance measure is zero.

For us, however, these two meanings of 'match' are a bit confusing, and we should avoid the confusion. We will use the expression 'find the best combination' for matching in the sense of pairing up. For such combining, we would tend to determine the distance between the objects per criterion, and aggregate these distances.

In ranking, we would aggregate the scores and only then compare these aggregates, e.g. to create classifications like Pass / Fail. Ranking would be for an ordinal scale only. If we have an interval scale, so that only the difference between variables has objective meaning, then the ranking turns into a rating.

Ranking and rating can be done deterministically or with an element of randomness. When player 1 wins against player 2, it is possible that this result is deterministic. For example, if the game is 'weight', then player 1 or player 2 is heavier, and this result will be the same in repeated trials. However, in some matches there is only a probability to win. But even with winning probabilities we still can define a 'distance measure'.

It is important to see that there are always criteria. Even if we organise pairwise duels, like in chess, then the comparison of the objects (players) still relies on criteria. The criterion for winning in chess is to take the opponent's King. It may be an enormous task to further develop such criteria, and hence we may skip such development, and we may only regard the outcomes of such contests. But we should be aware that this is only a simplification.

A classic example of testing is where the criteria are exam questions. People who do an exam, can be seen as being in a contest with the questions. They can also be seen as being in pairwise contests to do better on the exam. This insight links 'testing with criteria' to 'pairwise matches'.

We should be aware of at least three points of uncertainty: (1) The criteria might only be an approximation to the real objective of the test. (2) The way of aggregation might also be subject to discussion. (3) And, more in general, the scores need not be certain but can have a stochastic component. Testing quickly becomes statistical testing.

One possible type of testing is voting. Voting normally gives an ordinal scale which indicates that the object higher on the list wins from the object lower on the list. This uses certainty. Alternatively, there is only the probability of winning. We still could use an ordinal scale to express such a likelihood of winning (such as "A is likelier to win than B").

An important approach is Item Response Theory (IRT). A test consists of subjects answering to items (criteria) on the test. Both subjects and items have a rating. The rating of a subject is interpreted as competence, the rating of an item (criterion) is interpreted as the ease of the question. Then the probability of a proper response depends upon the difference between these ratings.

Another basic idea of testing is the prediction of winning. If we have three persons and we know the winning probabilities in a match between the first two persons, then we would like to make a prediction on the winning probabilities for matches with the third person. To make this prediction, we could use criteria scores on the abilities of the three persons.

Ratings have been used for IQ, sport games, bets or gambling, Social Science Citation Index, etcetera. Once you grow aware of it, it is everywhere.

Huge field
It follows that testing is a huge field. Testing is both an everyday experience, and it requires many topics of research.

We will meet the following issues:
- ranking in general - such as utility theory, where very different inputs are aggregated into one index
- voting theory and the voting paradoxes
- error correction: adjusting an existing score to actual outcomes
- the Elo rating system used in Chess - is it a random walk ?
- the Item Response Theory (IRT) of psychometrics (e.g. the Rasch model), used for example for school grade point averages and test banks
- Logit modeling in general, in econometrics
- and issues of measurement (log-interval scale).

There are some interconnections that at first may be surprising. For example, students doing a test, 'vote' for the answers.
(a) In voting the interest is in the winning answer.
(b) In testing, the at issue is rather whether the student belongs to the winning group - so testing might be seen as inverse voting.

There is also a link to neural networks - where a neuron fires when a threshold is reached.

The recent paper of Rafiei & Mendelzon (2000) looks into rating of internet pages. I have various doubts on their approach, but anyhow the issue is obviously linked.

Compared to the huge literature, the discussion below will be introductory, preliminary, exploring, and heuristic. Therefor the title: 'Aspects'. I am still organising my thoughts. But these preliminary results may be good enough, in particular in terms of programming in Mathematica, so that it is reasonable to present these notebooks to others.
Comparison to the literature
I have basically used Nunnally & Bernstein (1994), Theil (1971), Rasch (1977), and Elo (1978). I have also looked at other material, like Hambleton cs, but more cursory.

This discussion and programming has been guided as much by my own intuition as by this literature. Since there were no Mathematica programs available, I had this luxury that I could proceed anyhow. If I would re-invent a wheel somewhere, then that would not matter, since at least I created something novel: the Mathematica programs.

Proceeding like this, eventually various ideas came up, and some insights appeared more important to me than others. You may be more at home in the literature and therefor be a better judge whether these points are really worth anything. But they are:

(a) Clarification of the relationship between the various links above - and in particular the link between testing and voting and the link between testing on criteria and pairwise matches.

(b) A heuristic estimation procedure: Translate Item Response (IR) matrices first in pairwise matches, and then estimate from the probabilities found. This could be regarded as a condition for consistency.

(c) My conjecture is that IR can be collapsed into matches (Borda) - but some matches not into IR (some pairwise comparisons).

(d) Clarification, for sorted tests, of the relationship between correlation of scores and winning probabilities. Regard two subjects who do a series of questions. Let 1 have probability p of answering correctly, and let 2 have probability q of answering correctly. Let p > q, and let there even be the dependence that if 2 answers correctly, then 1 surely will too. This would be a likely outcome for sorted tests, i.e. tests with items of increasing difficulty. I find that the correlation here is [Graphics:Images/index_gr_2.gif] = (1 - p) / (1 - q).

(e) Clarification that the Rasch model is equivalent to a direct relation of the probabilities so that Logistic transformation are actually superfluous - this is what I call the 'inverse' approach. Clarification that the Rasch model has multiplicative odds that allow one to construct the whole match matrix. Which might make the model less attractive.

(f) Enhancing clarity on the probability model of the matches (in relation to voting cycles) - it is very likely that this best be the multinomial model (that includes deadlocks).

(g) Clarification to psychometricians that taking budget shares as probabilities likely comes at the cost of neglecting econometrics (income elasticities, marginal utility, etc.). On the other hand, such models for budget shares could be used, purely as mathematical models, to design new models, in order to escape from the restrictions of the Rasch model.

Links to other notebooks

It is good to start with the binomial model of passing a multiple choice test by simply guessing. This gives an idea of how probability and testing are related. It is also useful to see the maximum likelihood estimator of the binomial model.

The Logistic function appears to be a much used function to determine the probability of answering correctly. The simplest model doing so is the Rasch model, i.e. the Logistic with only a slope parameter. The more competent the student, the likelier it is that the proper answer is given. The easier the question, idem. Since there is no obvious 'zero' value, we get an interval scale, and the difference in ratings becomes the variable that determines the probability of a correct answer. Hence the variable x for the Logistic[x] will be the difference in ratings of the subject and the item. Actually, since the rating scale makes no real distinction between subjects and items, the difference in ratings between subjects can also be used to find the probability of winning a pairwise contest. The Rasch model is consistent, in that the sum of the probabilities of winning and losing is 1.

When students are graded, then they don't get a rating, but they get a grade on the scale from 0 till 10, which grade effectively gives the percentage of correct answers, i.e. a percentage on a scale from 0 till 100. Indeed, such percentages, winning probabilities, are even more informative than a rating. We thus could do without an explicit rating for a certain set of models. See the notebook on the 'inverse' approach.

It helps to understand the odds, and their meaning for winning probabilities. It appears that the Rasch model implies that the probabilities in a match matrix can be determined from one row only (multiplicative odds) - i.e. from the winning probabilities of one person only.

Hence, the Rasch model has an easy interpretation via the inverse approach, and it has easy multiplicative odds. These can be thought to be an advantage, but also a disadvantage. Searching for alternative specifications, however, we discover that these are difficult to find - see the discussion on formal conditions below.

The area of investigation is rather complex. Some papers or book sections appear less clear than one would wish. It is advised to remain critical. Mathematica appears to be useful in that we can quickly test statements.

We have to consider the definition of Item Response matrices, and random creation of these.

We can see IR as a match, and have routines for transformation of the IR matrices into match matrices. And we can quickly estimate a Rasch model for a match.

The above has covered sufficient ground to start considering the various steps for a systematic development of the Item Response model for the probability to win. Perhaps it is more instructive to first regard estimation, before proceeding. Presently we use a heuristic estimation, in which the IR matrix is first transformed into a match and estimated as in the above (Rasch model). After this practical example of estimation, we might be more understanding about the formal conditions involved.

An example is the Elo rating in chess. Note that we should be aware of the 'inverse' approach. See this longer discussion and an estimation. What seems elegant at first, appears rather complex and not without its problems. Chess rating adjustment might well be a random walk.

In economics, we can use budget shares and income elasticities. Is it useful to turn these budget shares into probabilities ?

For science: Could we have a rating for scientific output ?

This discussion leaves some PM points:

Correlation comes in as a direct measure for the closeness of subjects.

BeginDigits: the sample distribution from large populations and arbitrary distributions. With so many items and people tested, is this not a good prior instead of Laplace's uniform distribution ?

For a conclusion, we refer to the introduction. This discussion has been explorative, and not conclusive.

Literature and links on the internet

Baker, S. et al. (2000), "Principles for emerging systems of scholarly publishing", Tempe declaration May 10,

Cool, Th. (1999), "The Economics Pack, User Guide", published myself

Cool, Th. (2000), "Definition and Reality in the General Theory of Political Economy", Samuel van Houten Genootschap (in particular on my solution to Arrow's problem in voting)

Elo, A.E. (1978), "The Rating of Chess Players, Past and Present", Arco Publishing, Inc., New York.

Freeman, J. A. (1994), "Simulating neural networks with Mathematica", Addison-Wesley

Hambleton cs, "Item response theory: principles and applications"

Nunnally & Bernstein (1994), "Psychometric theory", McGraw Hill

Rafiei, D., & A. Mendelzon (2000), "What is this Page Known for? Computing Web Page Reputations", WWW9 International Conference,

Rasch, G. (1977), "Specific Objectivity: An Attempt at Formalizing the Request for Generality and Validity of Scientific Statements" - memo 18 at

Theil, H. (1971), "Principles of Econometrics", Wiley  (my own site)