Relevance Learning Challenge - Collecting Labeled Data Part 3 of 3

Jump to Part 1

Things I didn’t already know about guidelines and judgment systems

One personal pet-peeve of mine is that people always force a judge to make a decision. What might this look like, why is this bad, and how do we fix it?

This could be an unfortunate consequence of a system that takes the previous sections literally and only gives the judges a fixed set of definite choices. Lets look back at the example of a query “soft-reset” what if the guidelines were insufficient to determine the desired result for some documents? Often what happens is a judge is only allowed to offer an explicit judgment - they can’t skip. Sometimes, a judge can skip - but then the “arbiter” comes in and forces an answer. At the end of the day, that labeled point appears in the training set, and likely with strong inconsistency (lets say you have many similar query-document pairs in the training set). When you have “uncertain data” - either uncertain due to human uncertainty or due to just not knowable (Schrodinger’s cat), adding these points with an equal importance as other, less controversial points sort of forces the learner to “learn them” - in so doing it might result in a less than ideal model. If the training data point is labeled with a “1” (lets say it is Binary 1 or 0) if the model output of a system without using that point as training would be a 0.5 it is treated as an error (not as bad as if the model predicted 0). If that point were included in the model as a training point, the optimizing non-emotional machine tries to minimize errors and will “respond”. Maybe I have the following training data:

Data-point, target (Binary 1 = relevant, 0 = not relevant) * q1 = “soft reset” * q2 = “hard reset” * q3 = “fix screen not responding to touch” * q4 = “delete an application”

The same documents from part 1

Point p1 = q1,d5 —> q=“soft reset”, title = “Difference between a hard-reset and soft-reset?” Judgment j1 for point p1 = 1 (relevant)

Point p2 = q2,d1 -> q=“hard reset”, title = “How do I hard-reset my phone” Judgment j2 for point p2 = 1 (relevant)

Point p3 = q3,d2 -> q=“fix screen not responding to touch”, title = “My screen is not responding to touches - how do I fix it?” Judgment j3 for point p3 = 1 (relevant)

Point p4 = q1,d3 -> q=“soft reset”, title = “How do I delete an application?” Judgment j4 for point p4 = 0 (not relevant)

q4,d5, 0 -> q=“delete an application”, title = “Difference between a hard-reset and soft-reset?” Judgment j5 for point p5 = 0 (not relevant)

q2,d3, 0 -> q=“hard reset”, title = ““How do I delete an application?” Judgment j6 for point p6 = 0 (not relevant)

Lets say we have point p1 where the “1” is uncertain - meaning the judge didn’t know so they picked randomly, but the “official label” is j1 of 1. As humans we know this is not as clear given only the title since it could tell you how to do a soft-reset or could just compare them, one case being relevant, the other not. If we had more judges then maybe half would give it a 0 and half a 1.

Lets imagine a possible model that is trained on a bunch of totally certain points and we ran the p1-p6 as testing points through the model:

  • p1 : 0.5
  • p2 : 0.999
  • p3 : 0.999
  • p4 : 0.001
  • p5 : 0.001
  • p6 : 0.001

Overall the system is doing a great job 5 of the 6 testing points are basically perfect, but that darn point p1 - it is a pretty bad error, returning 0.5 when it was supposed to return 1 (the official label). As a human we might say that the point was not so clear, so it really is not that bad, unfortunately this simple system does not allow for specifying some points more or less important than others.

Now imagine we trained another model, but this time we included a bunch of points like P1 where the judges were uncertain. The model was doing great when we excluded points like p1 where the judges were not sure. In this case there are many “hard to train” points in the training set. Since the judges themselves are uncertain, it is not reasonable to believe there is a model that can be “certain”. The net effect is in trying to learn the uncertain points, the system worsens the model - as evaluated by an outside observer. The system needs to lower the overall mean squared error, and sometimes when something is not-learnable it can “take away” from the easily learnable points. I do not explain this in detail here, but if you really want to learn more post a comment. Below are the same testing points, but with the model trained including some “uncertain points”.

  • p1 : 0.75
  • p2 : 0.9
  • p3 : 0.9
  • p4 : 0.1
  • p5 : 0.1
  • p6 : 0.1

In the first model the total squared error of the 6 points was about 0.25 or (1-0.5)^2 + approx 0. In the second case the total squared error is about: (1-.75)^2 + 2(1-.9)^2 + 3(0-.1)^2 = .1125 which is much less than .25 (just under half). However, as a human you can see that this is much worse since before we had 5 points perfect, but the second model has a non-trivial error for the 5 points that before were perfect.

It should be noted that this is one possible example of what could happen. If “uncertain points” could be learned within the model and features, then there might not be extra error added to the other points. This just illustrates what could happen when “uncertain” training points (due to human-judge uncertainty) are added to an already clean set.

If you aren’t sure leave it out

In my many years of experience, I have seen judges are much faster, and machine learned models are more accurate when judges can just leave out things that are unclear (or don’t matter). Rather than force an arbiter or judge to make a judgment (guess), allow the judge to select “skip” as opposed to “not sure”. “Not sure” is a request for the arbiter, while “skip” is a statement of can’t be determined based on the guidelines and judgment interface, better to leave it out and avoid confusing the model.

Judgment guidelines matter and are not easy

The topic of the effect of judgment guidelines on the system could be an entire book. I am only going to give a quick story and then summarize the overall lessons from this post.

At a previous company, we were trying to train a relevance model on web documents. The judgment guidelines can be simplified as follows:

Judge: Your task is to select a judgment from 1 to 5 for the presented query and URL or select skip if you are unsure.

  • “Skip” - Skip this query-url pair
  • 1 = Very bad, embarrassing
  • 2 = bad
  • 3 = Not great, worse than a 4, but worth showing as a last result
  • 4 = Very relevant, but not a 5
  • 5 = The URL is the official homepage of the query

We also did extensive training with the judges who were typically FT employees and worked closely with the rest of the team. The closeness provided a great forum for teaching them about the system as well as what the judgments meant.

We did everything right - had clear guidelines, well trained judges who were supervised effectively, we collected a lot of training data, with good variety. We removed training points that would likely cause problems, etc… Unfortunately, the resultant models had a consistent problem, either the system would work great for everything except homepages, or it would get homepages right and everything else was almost random.

Thanks to Halim (contributor to, he discovered that there was a problem with our guidelines and how we defined that target score. We are actually asking two totally different questions: Q1: Is the relevant for the query (or not) and Q2: Is it result an official homepage for the query. While some might think that being a homepage for a query has special status, at the end of the day trying to combine those two different questions into one judgment “confused the learner”. It effectively either optimized for homepages OR optimized for relevance, the scores were forced on a single scale and not necessarily comparable. The way we set up our target definition used for training, the error calculation we gave a much higher penalty for errors near the top - so mixing up a 1 and a 2 was small, but a 5 and a 4 was more significant. As a result a page that was highly relevant, but not an official homepage was given a different target than a page that was highly relevant, but not the official homepage - even though a user cares about “relevance”, and not if it is relevant because it is an official homepage. As some might see there could be a situation where a 5 was training towards 4 or vice versa and those viewed as tremendous errors, forcing an effect similar to what we described above.

The brilliant solution (thanks Halim) was to replace all 5’s with 4 for the training data and make it a virtual 4-point scale (the judges were still asked for a 5-point scale). With the new usage of the old (poor) guidelines, the system significantly improved overall relevance for both homepages and non-homepages.

It should be noted - it is find to have judges provide extra information. Maybe there is a checkbox “is a homepage for the query”. The additional information could be used in various ways, however do not try to solve two distinct, and not clearly related, problems with one score.

Also, some readers might ask, wouldn’t the official homepage always be more relevant than a non-official homepage? Well the problem is not only is that not always true, but when you have a query like “buy a mac” and for a URL for - technically this is not the “official homepage” for the query, since there is no concept of a homepage for an functional query like that. Likewise, a query for a small company that also could mean something else that is more popular could result in confusion. The “5” is not necessarily better. Definitely not significantly better due to the way we used the target values.

What did we learn - Summary

When planning to use applied machine learning for relevance applications, it is critical to properly define your problem, have clear guidelines, and a great judgment collection system and be sure to consider:

  • What specific signals are used where - i.e. do we need to know the Operating System of the user when they query? Is the time of day used, etc…
  • Select an appropriate judgment criterion - i.e. Binary or 5-point scale, or other
  • Create very clear and comprehensive judge guidelines
  • Have an “expert” or “arbiter” whose job it is to support the judges and understand the requirements from product
  • If appropriate, allow the judges to skip or indicate unsure for points that might be undefined given the guidelines
  • Ensure the judgment collection system has the right information presented sufficiently for the judges to make informed (and accurate) decisions
  • Be sure not to confuse objectives when collecting labeled data or when using it for training