Relevance Learning Challenge - Collecting Labeled Data Part 1 of 3

By: Eric Glover November 3, 2015

Skill level: ML: medium, Math: medium

The hardest task that probably has the least amount of literature (as compared to other aspects of applied ML) happens to be the one most people take for granted, and many suffer for it - problem definition. This post will explore aspects of search and ranking tasks and provide some advice and examples about the problem of problem definition, to help increase the chance of a product being successful.

Computers are cold, heartless machines that do exactly what they are instructed to do, using deterministic programs. Humans can be emotional, error prone, capricious and even inconsistent, as well as persnickety and stubborn.

The coldness of a computer algorithm is great when you have well-defined objectives such as minimize fuel consumption for a new engine design. For these problems the objective is explicit, inherently measurable and does not have significant subjectivity. However, for applications that involve human preferences such as search or ranking, it is not always as easy to adopt a “standard definition” and even worse, how do you evaluate the system to determine if it is good, or better than another.

Many people will turn to academic literature and adopt “standard definitions” such as Binary relevance, or a 5-point scale. Likewise (not covered in-depth here) adopt standard methods for evaluation such as DCG, Precision, Recall, Click-Through-Rate (CTR) or others only to end up with failed systems. In some cases (will be explored more in other blogs) the tests looked great and the product managers thought the system would work, but they are perceived bad by real users.

In effect, we are asking a computer algorithm to “learn” a model that predicts human behavior or preferences - in essence read users’ minds. This is an impossible sounding task, and requires very careful thought about the system design and the actual problem definition. As impossible as it sounds, I would like to give hope and point out there are multiple very successful commercial systems used Billions of times a month, that do an exceptional job - and many were built using the basic process described below. So yes, you can read people’s minds (well sort of).

To appreciate the challenge, let’s start with a typical machine learning process as applied to a specific product involving human subjectivity, such as for typical search or ranking problems.

  • Step 1: Define the problem/high-level goals, system etc… Inputs, outputs, evaluation methodology.
  • Step 2: Collect training data consistent with the definition of Step 1
  • Step 3: For each training data point obtain feature values
  • Step 4: Train the model
  • Step 5: Deploy the model and test

There can be more complex processes, maybe involving unlabeled data, or automatically learned training data, as well as feedback loops and interactivity - for now let’s keep this simple since the hard part is actually Step 1 which is necessary for all cases of applied machine learning.

When you read many academic papers involving relevance or ranking, it might appear as if this first step is actually extremely simple. You define your document set (the possible items that could be returned and ranked), you determine your features (often these are simply bag of words or very simply stated features like page rank or date or website popularity), and last the part that always seems to be easy you define a relevance metric.

It should be noted that there are two major classes of ranking learners, the most common (in my experience) is a regression learner where the target is an apparent objective relevance score. The less common is something often referred to as Learn To Rank or LoTR and typically involves relative preferences such as A is better than B (without necessarily any objective statement of relevant or not).

Details like the judge guidelines are often left off of academic papers. Also, it is common that the papers use a - in some cases “pre-judged” set and there is no decision to be made or need to collect human judgments.

Unfortunately, in real-world applications where we don’t have appropriate pre-judged data, determining a precise metric for evaluation and “scoring” is essential to the very core of all regression learners (and preference learners). Basically what exact question are we asking and what specific thing are we trying to learn?

Probably, some of you are thinking - wow, so much text for something so obvious - lets use a Binary judgment, either a result is relevant or it is not. The idea of Binary relevance is a commonly taught form of judgment and many relevance metrics assume Binary relevance (such as Precision and Recall). Unfortunately, the world is usually not that simple :-(

Binary Relevance

To understand the intricacies of problem definition, lets start with the simplest case of a Binary judgment where the question is: “given a specific query q, and a specific result r, is the result r relevant for the query q? (Yes or No). No other choices! No “maybe” or “sometimes” only “Yes” or “No”.

It is easy to think that the problem is solved, but it is not, there are still many things that can go wrong.

Remember Step 2 - Collecting labeled training data - this means you have some example queries and results and judgments. For now lets not focus on the nature of the training points (that is a whole other can of worms we will address in other blog posts), but instead examine how we might get judgments for the training points.

Let’s assume you are already given the pairs of queries and results and you just need a judgment. That sounds easy enough right… Well unfortunately it is not so easy.

The first problem we face is who decides this? Is it our friends, random employees, strangers from Mechanical Turk? Assuming we have some set of honest judges (this is also not so easy), we still need them to make judgments, lots of judgments. In my past companies the training set sizes were usually in excess of one million labeled training points (these are relatively small sets).

Making judgments requires creating in the judge a clear vision of what it means to “be relevant” - fortunately Binary makes it easier since they don’t need to think too hard, and only have one choice - but.. even that can be problematic.

Why asking a judge a Yes or No question is actually very hard

For this discussion lets assume a simple system - a Question/Answer matching system where the goal is to find relevant pre-written “documents” (a question and answer) given a user input query. We can imagine we have a FAQ and maybe customer service and engineering provided questions (and answers) that are used as the “documents” and also assume we have a good sample of user-provided queries (assume we have lots of good user-logs for our company that give real questions that were previously asked).

Even though it seems obvious and simple and this problem feels like it is “well defined” there are still many gotchas and subtle errors that can propagate causing a bad overall product.

Besides the problem of proper judge motivation and honesty (people motivated to make money might just give random answers to get paid more), there are some real challenges here:

  • Definition of Yes/No (relevant or not-relevant)
  • Interface used to present the “question” and collect the answer or judgment

Before you say “obviously a result will be relevant or not” let me give a few examples:

Imagine we have the following sample documents written as the “question”, answer part is not shown.

  • D1: How do I hard-reset my phone?
  • D2: My screen is not responding to touches - how do I fix it?
  • D3: How do I delete an application?
  • D4: How do I delete an application? (not the same as D3 - just the same question/title, different answer)
  • D5: Difference between a hard-reset and soft-reset?

Now for user-queries like: “email not working” none of the above are relevant. Likewise a query of “what is the difference between a soft-reset and hard-reset” D1-D4 are most likely not relevant and D5 is likely relevant.

What about a query: “soft-reset”? Is D5 relevant? Actually the question to ask is what should a Judge do, or how should they decide? Here is where clear guidelines and great examples are essential. Simply saying “you figure it out” will probably end up with pretty bad labeled data - lots of errors. Maybe there is a similar example of part of a question matching with directions on when a judge should answer “Relevant” or “Not Relevant”.

What about a case where a user query is q=“remove application” and the user making the query has Operating System Version 21, maybe D4 is only for Version 22 and higher and D5 is for Versions 15 to 21. Now it is not so obvious what a judge should do. In fact without more information about a user’s context and more details about the documents than just their title, it is not possible for a judge to decide (with an exception of accuracy). Part of the problem definition Step 1 is to include all of the details about the input from a searching user (such as is Operating System Version included).

Guidelines are the “rules” given to judges. Every major company (that I have ever known how they do ML) who does any type of supervised learning - probably the most common form of relevance learning - will produce extensive judge guideline documentation, involving pictures and many examples. I have seen 100+ page documents. Sometimes, companies will perform tests on judges to see if a judge agrees with the product manager’s definitions (and other judges). Agreement tests are a good way to find bad judges (those who don’t do what you asked) as well as identify possible problems in the guidelines or even system (i.e. forgetting to include the OS version with the query).

For more information there is an article in SearchEngineLand about Bing’s rating guidelines in 2012:

My friend and hope to be soon guest blogger, Chris Fillius, very clearly describes the subjective nature of human judgment and some directions he gives to judges.

One final note that I always tell anyone when discussing relevance judging: as a judge you are being asked to make a subjective and imperfect decision based on the best of your ability and within a standardized methodology. We are asking you to do this so it is OK for you, and even encouraged, for you to take an opinionated stand on something and be confident in your judgment. We are not asking a computer, we are asking you as a person.

The key point that I think many info scientists overlook is that search is a subjective system in general. When the search engine returns a list of ranked results it is doing what it thinks is best, but each and every human being who is searching on the system may have a different interpretation of the SERP's quality. Therefore asking a person (or multiple people) is OK because the ultimate end user of the system will be a human being, so we do the best we can.

The Judgment Interface is Essential

The judgment interface is also essential if judges are able to make accurate judgments (reasonable judgments - there is some level of subjectivity). If insufficient information is provided, judges might not make useful judgments (even when there is an obvious answer).

Now - lets make an assumption that we have an infinite amount of perfectly labeled documents - with Binary labels. Labeled documents can be used for both training (and a non-intersecting set) for testing. Lets also assume we have very good features and there is a pattern to be learned (some problems are just not solvable, or at least not with the given set of features).

This sounds great, we have everything we need for Steps 1-3 right? So what can go wrong?

Besides the fact that you will never have an infinite amount of perfectly labeled documents (I argue even the ‘perfectly labeled’ is impossible), the original problem definition might fail when applied as a product.

Remember the homepage classifier from the blog post on “Know Your Priors” how a 98% accurate classifier resulted in 2/3 results being wrong? Well there is a similar problem here…

The fundamental problem with Binary relevance is that it is Binary. More specifically, a human who is an emotional, inconsistent, and capricious being - usually doesn’t think in Binary. Especially for this example problem of matching queries to Q&A documents - it is likely there might be more than one “partially relevant” document and a human might have a preference between the multiple “relevant” documents.

If we have a Binary system that can only output 1’s and 0’s for relevant and not - there will likely be a set of “relevant” (1) results that are not equal. The whole rave when Google first started to gain popularity was how Google solved “information overload” and Google was all about finding the “best” results from all the world’s information - imagine if their ranking function were for all results that are “likely relevant” sort randomly. That would be a failed product.

To be clear, not all problems have this issue. In some cases there really is a Binary definition and the expectation would be exactly one or zero results that are relevant. In such a case if there is only one “likely relevant” result, random ordering is actually okay. In fact if we expected two or three “likely relevant” results maybe random ordering is actually pretty good given the right interface - this is a product question, sorry engineers.

I am sure some of you are saying - wait a minute… even though the training data is Binary, does that mean the output of a ML model is Binary? In fact most learners (or software packages that do machine learning) do provide floating point outputs. In fact, the theory says Naive-Bayes should output an actual probability that the input is relevant or not.

So - given this why can’t we just sort by “score” - the “most likely relevant” would be first and then we solve the ranking problem. In some cases this actually does work but… let me explain some more details and try to convince you that it is not likely from a mathematical or experiential standpoint to work as you might expect.

First, when using real-world Naive-Bayes systems if you have good data the majority of the outputs are like: 0.9999, 0.9999, 0.9998, 0.0001,0.0001 etc… Basically most outputs for a specific input (query, document) will return 1 or 0 to several significant figures. In fact, in my own personal experience the actual distribution of scores outputted from a Bayesian typically contain large numbers of exact duplicates of value. What about the points in the middle? What do they mean - they mean the system is uncertain, not that they are necessarily less relevant. What is the difference? Imagine if we had those same 5 example documents from above and a user query of “soft-reset” and we are considering if D5 is relevant. Some judges might say Yes since it mentions “soft-reset” in the title while others would say “No” since it is not entirely about soft—reset, in this case a theoretical model could return 0.5 indicating 50% chance of being relevant. Even though if you read the document there might be universal user agreement it is relevant. A 50% means more that judges can’t decide, not that it is more or less relevant than another document that mentions “soft-reset”. Likewise, it might not be that judges were indecisive on the precise case, but examples with similar features. The model could be uncertain even though there is a universally agreed upon answer.

Another way to think about it is if we assume that users have a real preference over some aspect - say date or recency. The more recent the better if all-else is equal. If the judgment criteria can’t possibly capture this, maybe the guidelines would require the judges to give the same answer regardless of document age - then it is impossible to learn. In order to recognize that two similar documents (for a given query) have a preferential relation there needs to be some difference in judgment, which is not possible with a Binary judgment system. This is one reason some people prefer pairwise judgment systems - since a judge is able to express a preference and it can be represented in a label (a label could be A > B and there might be two documents and one query). It should also be noted that in my discussions with other experts, some have implied that even with Binary judgments if you collect enough data a preference could be discovered so long as some people would make a “not relevant” judgment for a “worse” document. While in theory this is true, in practice I have not seen this (feel free to e-mail me personal experiences that either support or contradict my premise) as you need a very large number of very similar examples with consistent judgment differences, something that can only occur with very large organizations.

So - given Binary is likely insufficient what else can we do?

Jump to: Relevance Learning Challeng Part 2