*Math skill-level: medium-low
Machine Learning skill-level: low*

My first blog hits on a topic very important to me, one that is fundamental to all of machine learning, but yet so many practitioners forget about, or even worse don’t understand. The concept has become my go-to interview question for candidates for Machine Learning related jobs, and failure to properly consider it, is what I believe one of the largest causes of commercial failures of applied machine learning.

While the phrase “prior probability” might cause some people to cringe in horror recalling that crazy-hard math class in college, the key lessons here are not very advanced and I hope that even a person who doesn’t like math will embrace. Hopefully a few minutes here will save hundreds or thousands of dollars of lost time and product launch failures.

Lets start with a question: I have. a fair coin, and I flip it - what is the probability it will land on heads. This is not a trick question, and the answer is 50% or 1/2. Basically, we all know that on average half of the time a coin is tossed, it will land on heads.

Now another question: What is the probability that an average American will have a heart attack on a given day? While you might not know this answer, but my math (see below) estimates it about 6 in 100K.

From: http://www.cdc.gov/heartdisease/facts.htm approx 735K Americans have a heart attack per year. From: http://www.census.gov/popclock/ 321,216,397 as of July 4th , 2015.

Number of days in a year (lets round to 365) - so probability = 735K/321M/365 or approx: 6 E-6 or about 6 in 1M.

Thinking about this you could say that a coin toss flipping heads is relatively common (of the times you flip a fair coin), but having a heart-attack is very unlikely for an average american on any given day (of course some people are much more likely than others - but that is not the point). The prior probability is roughly an indication of the likelihood of an event.

As humans we have an initiative feel (often wrong) for how likely something is going to happen. While we might not always be rational, we can understand that a coin toss landing heads is more likely than say winning the lottery on a single ticket.

One common theory of this blog is going to be that computers and modern-day machine learning is very advanced. The computer will learn what you teach it, and often do exceptionally well on problems you might think are very hard. Unfortunately, failure to consider prior probabilities in the system as a whole can cause undesirable results. So to see what I mean let me summarize the basic flow of training and applying a supervised binary classifier.

You can simplify training of a supervised binary classifier as a few very simple steps:

- Step 1: Obtain labeled training data - where each data point is given a label, in this case True or False, or some equivalent say 1 or 0.
- Step 2: For each labeled training point generate or obtain a set of features - we can represent them as a Vector. such as x
*i -> [Feature*1.value, Feature*2.value Feature*3.value, …] and the last item could be the label. - Step 3: Train the model - feeding the labeled training data into any of a variety of machine learning algorithms.
- DONE - The output of Step 3 is your model, or simply put a set of rules for a program to be able to take as input a feature vector and produce a predicted label.

When you want to run this model, you simply take an input (unlabeled) data point, get the feature values and apply the model.

Basically, the “magic” happens in the applying some magical black-box algorithm. Many people might be surprised that the process is not more complicated, or that there is not a ton of math about what is inside the black box, but the truth is for most applied machine learning there is no need to understand very much about the black box, and in many cases it literally is as easy as running a command line program on a .csv file. Basically zero skills! I do believe it is nice to learn more, and I strongly encourage some more study than reading the 3-step process.

Yes, you can actually do Binary Classification learning from standard .csv files generated from an Excel sheet.

Now that I told you Binary classification is super easy, and really requires no human effort, why does it matter that we understand prior probabilities?

So I will look at two ways Prior Probabilities can affect the system - one is on the model-learning (the above), and the other is one the application as part of a product.

It should be noted that a good ML practitioner does much more than take input without examination, trains the model and declares done. A key step is in validation and testing. How well does it work? Does it work at all? Is this a model I would bet my company or job on?

So lets start with an interesting experiment. I could not find the best reference, but you can see some refs here: http://cogsci.stackexchange.com/questions/4846/solving-t-shaped-maze-probability-learning-differences-between-rats-and-human

The binary classification problem in this case is a simple case where there are two areas a “left area” and a “right area” each has a light. At some point in time a person (or lets say computer) could hit the “go button” and either the left light or the right light will light up. For a human it is a game of predicting which light before the round begins. For a rat, it is a matter of going the correct direction and either getting food or a shock (if wrong).

So lets say we have a few features - the last three answers, these could be binary -1 means left, +1 means right. So we could formulate this problem as: F1=value 3-times previous F2=value 2-times previous F3=value previous time

And the goal is to predict what it will be the next time.

We could collect some data say: x0 (first training point - after 3 runs previous) might be: [-1, -1, 1, 1] — the last value is the target (what happened) x1 -> [-1, 1, 1, -1] … We could collect lots of training points - since we are not getting shocked, it is not a problem.

So now lets think about this problem a moment and what might happen…

If I told you the black-box we were trying to learn was a random number generator with a 60% chance of it being on the left side, what happens?

The rat learned that the best solution is to always run left - that was the best possible answer.

Actual graduate students tried to solve this with complex programming, not accepting that there was no pattern, and ultimately ended up with only a 52% accuracy - which was worse than the rat.

If we actually ran a very large amount of training data through most modern ML systems, many would simply also learn “pick left always” since it would minimize the error.

This example has many lessons: First - sometimes there is just some best possible model - you can’t get more than 60% unless your system cracked the random number generator (which is not likely to happen).

Second - some cases there might be a trivial solution - that complicated black box spits out a model that says “always choose left”

But there is actually a much more important lesson and this will be done by exploring the first step of the problem - the selection of the training data.

Imagine we are trying to train our system and we pick 1000 training points - this might be good, but unfortunately we have a bug in our code and instead of picking randomly we end up picking 750 cases where the answer was Right, and 250 cases where the answer was Left. Given that we know the system was random, what might our model do? Given the training data, the best answer would be to always pick Right… and we would be wrong a majority of the time…

So distribution needs to be considered when selecting training data. This makes sense since an algorithm trying to pick the best, can only pick the best from what you gave it.

Unfortunately sometimes we might not want to pick training data randomly…

Lets look at another problem (this is actually my typical ML interview question) web page classification.

Imagine we are given the challenge of creating a Homepage classifier - basically imagine some search engine is still doing web crawling (so 20th century) and the challenge is to predict if a web page is a homepage (of a person or a business etc.. such as eidition.cnn.com, facebook.com, as opposed to say a specific article on CNN or the privacy policy of Facebook. We can imagine keywords on page, URL words and many other features.

So we can pretend we have lots of labeled training data and each URL we crawled is a datapoint and we have lots of features.

We put it in and run the system and get our model.

Lets take a step back… what is the prior probability? Why do we care? So lets just say for the sake of argument that the prior probability is 1% - meaning of 100 totally random web pages I expect one to be a homepage, and 99 to well not be homepages.

Now if I don’t know what is in my black box, it might just do the same thing as above and always say NO. Since that would actually end up with 99% overall accuracy, but it would be a disaster as a product and unlike the rat trying to get food, would not be useful at all…

So now lets say because we know a homepage is a rare event, we instead decide to train it with half homepages and half non-homepages (this is actually not necessarily a bad thing to do - but here is where it is desirable to know something inside the black box)

We run our learner and BAM out comes a model…

Lets actually play a cool thought experiment and pretend that the model is consists of an oracle (an AI term to define an all-knowing thing) that always knows the right answer (we will have future blogs about judge agreement and clarity of defining the problem). But, our oracle is sneaky lies exactly 2% of the time. The probability of lying is equal regardless of whether it is a homepage or not, and the oracle has infinite memory so the same page will always get the same answer (sorry no cheating by feeding back the same pages twice).

Now we take this very accurate model - which ironically is less accurate overall than always saying NO - and we want to test it.

So, like very good practitioners of ML we create several experiments to determine how good is our model.

We select 1000 homepages (none of which were in the training set) and 1000 non-homepage (also none of which were in the training set, and we run them through the model…

What do we get? Well if the oracle lies 2% of the time then: Experiment #1 would have 980 homepages correctly labeled and 20 errors Experiment #2 would also have 980 correct (non-homepage) and 20 errors

This is a very good result in that we got 98% Precision and Recall wow our mothers would be so proud…

Okay … now, lets see what happens when we apply the above model to the real world as part of a product:

Lets pretend we worked for a search engine and a product manager heard about our super-accurate classifier that got 98% of web pages right… They had a brilliant idea of whenever a user does a search result homepages would have a red-box drawn around them to help a searcher identify this is a homepage.

Lets simplify this and say the following: 1: This only applies for “regular old web search” when there is a flat list of blue links 2: Given all queries the probability of any shown result being a homepage is the same as its normal prior - i.e. 1%

I know that unfortunately today none does 10-blue links any more and it is very likely #2 would be false too - but lets for simplicity go with this.

So given your amazing results and the product manager’s product definition the company goes ahead and codes your model into their search system and draws red-boxes around every page classified as a homepage.

What happens?

So take a moment and ask yourself what would happen for a search given the above model (2% lying Oracle) and a 1% homepage prior?

This can get tricky…

But first let me say that were this to be a real company, the product manager and Engineer who made that model would probably have been fired, and the company would have been very badly embarrassed.

So now lets step through this: A user does a search and a result is selected. This result has a 1% chance of being a homepage and a 99% chance of not being a homepage. IF the page is not a homepage, then there is a 98% chance the system will not draw a red box (not label it as a homepage) and a 2% chance it will incorrectly draw a red box. IF the page is a homepage, reverse the above 98% chance correctly draw a red box, and 2% chance of no red box.

Now… here is where the prior comes in… IF the probability of a homepage were very high then this would be a very good system since it would be right most of the time… but since it is very rare to be a homepage, there are very large number of errors when it is not, but is labeled as being a homepage. Lets do the math.

Q: What is the probability that if a red box is drawn the page is actually a homepage?

If you remember from your math classes the expectation of the outcome is:

We calculate all cases and then from them determine the aggregate answer.
So probability of a red-box correct is the probability that we are classified as a homepage given it is a homepage. Or P(labeled*home*page/is*home*page)

Now you can solve this formally or you can just try to estimate it but.. the basic equation is:

P(is*homepage/labeled*homepage) = or P(A/B) = P(A)*P(B/A)/P(B)
P(A) = probability is homepage = 0.01
P(B/A) = probability labeled_homepage given it is homepage = 0.98
P(B) = probability is labeled)homepage = P(is_home_page)*0.98 + P(not*home*page)*0.02 = 0.01*0.98 + 0.99*0.02 = 0.0296

So … the probability the red box is correct is: 0.01*0.98/0.0296 is about 33% or you are WRONG about 2/3 of the time! Someone should get fired for that error if it launched.

IF the prior were instead say 50% (like a coin toss) then the math would give you: P(A)= 0.5, P(B/A) = 0.98, P(B) = 0.5*0.98 + 0.5*0.02 = 0.5
Which is: 98%… Which is a very good result for a product.

So what are the lessons here: Prior probabilities are critical in all aspects of applied machine learning: When selecting your training points When building your model When testing your system

Key most important points: Understand the test set and be sure that the prior probabilities are considered when testing and evaluating. A system that is expected to return true infrequently should not be tested using only separate positive and negative sets.

The actual product you are building is what determines success - a 98% accuracy overall on some test set does not guarantee your product will be any good…