Relevance Learning Challenge - Collecting Labeled Data Part 2 of 3

In Part 1 we described the importance of training point collection, and described a Binary judgment methodology. However, for most cases Binary is not good enough. Part 2 explores Multi-valued judgment methodologies, and explores multiple-judgments and judge evaluation.

Multi-valued relevance judgments

Another potentially good option is to have users give an absolute score - say 0 to 100. In effect a user directly states their utility of a particular result for a given query. The advantage of this is you can capture even very subtle differences in judgments, unlike the overly course Binary. The problem is how to create guidelines and a judgment interface that will offer consistent and easy/fast judgments? Speed of judgment and accuracy/consistency is critical - more so when you have a small (< 100K for average case problems) training sets. Given too much flexibility encourages judges to “behave irrational” - in my experience I have seen judges try to “punish the system” for giving them something that looks bad and always give the minimum score, likewise, some judges want to “reward good behavior” when a result looks good, they score it overly high. Likewise, the middle area has lots of room for uncertainty, and it seems like judges appear fairly random (or highly quantized) in the middle. While there are techniques to “normalize judgments and judges” (i.e. judge 1 tends to be too harsh, and judge 2 tends to be too generous), my recommendation is to reduce the problem by reducing the number of options for a judge. With 100 choices, the instantaneous mood of the judges could cause a ten or more point swing, ultimately resulting in error for the model.

Typically in academic work (and most commercial systems) they use 3 to 5 point systems. Some use an even number (commonly 4) to force a judge to specify good vs bad and then the 4 vs 3 and 2 vs 1 can provide a weak preference over the good/bad results. 3 is also often used so users can say “medium” as opposed to being forced into the two extremes. 5 is very common so it gives both a weak-ordering of good/bad and allows a judge to give a middle value.

Whatever method you choose, it is critical to have very clear guidelines and a powerful judgment system that presents the appropriate information and makes it easy for a judge to give a judgment. It is strongly recommended to regularly test and validate your judges and have at least one expert who is the official arbiter. That role should be very close to product who defined the problem and should be directly involved in creating, updating and evaluating the guidelines.

Multiple Judgments

One dimension is number of judgment values, a second dimension is number of judges assigned each point. Multiple judgments (of the same point) can be used to detect bad judges, increase compliance with the guidelines, and improve the chances of discovering problem points. There are several ways multiple judgments can be used and managed.

The minimal way multiple judgments are used is as a random sample of judges are asked to judge a point a second time. The small number of dual-judged points can be used to compare judges. Large disagreements can be an indicator of a bad judge or possibly of poor guidelines. When you have one bad judge, you might see strong disagreement between that judge and many other judges, while the other judges would generally agree. When you have poor guidelines or overall low-quality judges you might see a larger amount of disagreement across all pairs. The advantage of only a small number of duplicated judgments is the extra judgment cost is lower than with all points being multiply judged, but you are still able to learn about the judges and the consistency of the interpretation of the guidelines.

In other cases, every point is judged by more than one judge. There are several ways to manage this. A common method is to first ask for each point to be judged twice, and if there is disagreement bring in another judge (or the official arbiter). Other methods include averaging judgments (this can increase the quantization of the judgment points without increasing the complexity for each individual judge).

Commonly, there might be a “golden set” of points and judgments and all judges are given these points as a test (either randomly mixed in or all at once). In this case, it is used to test each judge.

A common metric for evaluating judges is inter-judge agreement. This is usually represented as fraction of judgments where the judges agree. Agree can mean exactly the same judgment or within some difference - i.e. on a 5-point scale find the fraction of judgments that are within 1 point. I have seen inter-judge agreement rates for within 1 of 85% or more for 5-point scales for relevance training with clear guidelines.

While the topic of ignoring points will be described in more detail in Part 3, multiple judgments can be used to identify points that are better off being skipped, maybe due to uncertainty in the guidelines, or legitimate judge (and or arbiter) disagreement. Multiple judgments is a powerful tool to analyze judges and guidelines and should be strongly considered as part of any large-scale judgment system.

Evaluating Judges and Judgments

We have talked about the judgment system, the choices for judges and handling multiple judges. A very important part, often ignored, is evaluating the judges and judgments. Examining the judgments can be a powerful tool to predict certain types of problems.

Three large causes of problems for companies doing supervised learning include:

  • Bad or low quality judges (judges might start out good and turn bad over time)
  • Poor judgment guidelines or poor questions for the judges (bad judgment interface)
  • Poor selection of points for judging

Having multiple judgments for each point is a good way to discover many problems with judge quality or judgment guidelines. Examining the resultant judgments can also yield good insight into potential problems.

While it is unreasonable to expect a perfectly equal distribution of points - i.e. if you have 5 values each value has exactly 1/5 of the total judgments. There are certain patterns which could be problematic.

This document does not deeply address methods for selecting training points, since clearly the method for selection will influence the expected distribution of judgments.

Major warning signs:

  • Certain values never or rarely occur - i.e. if you have a 5 point system and no points were judged a 3
  • Extremely top-heavy or bottom-heavy (beyond reasonable expectations)
  • Distributions of judgments vary significantly by judge

No occurrences of a value are very unlikely in typical problems. It should be noted that you should have a much larger number of judgments than possible values. If you have a 5-point scale, but only 100 judgments this is too small a sample to draw strong conclusions (unless the problem is extremely severe). For typical relevance problems - if you randomly pick queries and results as judgment points, you would expect very few relevant answers. I.e. if you have 1M documents and 1M random queries, the intersection of relevant is likely to be on the order of 1 in 100 Billion (assuming 10 relevant results per query). If your selection method were to use an actual search system that is not yet tuned, but is based on some initial strong features (say a text search with an AND query), you would expect a higher level of relevant results. But - you would probably not expect more than say 20% to be relevant and probably less than 10% to be strongly relevant). Of course different selection methods could have different expected distributions - i.e. using results of a competing service that is very good, and picking the top 2 results might have 80% relevant and 5% very bad.

Given an expectation of few relevant results you would expect a bottom-heavy weight. If you saw top-heavy or middle-heavy judgments this could be a problem. When you suspect a problem you can pick actual judgments and investigate to determine if there is a problem or maybe the cause.

Judgment analysis can also be used to determine if judges are somewhat consistent. If we assume that all points being judged are randomly distributed with an equal distribution for each judge, then even without multiple judgments you would expect the distribution of judgments to be similar across judges. Different distributions could be a warning sign of either one or more bad judges, or poor guidelines leaving too much room for personal interpretation. Some properties to look at:

  • Raw distribution fraction by judge - i.e. Judge 1 has 65% - 1, 10% - 2, 15% - 3, 7% - 4, 3% -5
  • Average judgment - maybe Judge 2 has an average of 1.8 and Judge 3 has an average of 3.5
  • Average time for a judgment - in addition to judgment values, examining average time spent can be useful

See part 3 for more about judge guidelines and judgment systems Relevance Learning Challenge Part 3