Reveal Help Center

Predictive Coding 101

The anomaly detection display shows which terms are surprisingly frequent in a time period, regardless of how many documents are in that period. This display is useful for detecting when there was a sudden burst of interest in a topic.

This document is an introduction to Brainspace’s predictive coding. This document should help you understand how predictive coding can help you, and what kinds of things you’ll have to do to make predictive coding work.

This document should also help you better understand predictive coding in general, and why Brainspace’s predictive coding asks you to do certain things.

Why Use Predictive Coding

People use predictive coding in legal cases so they can review fewer documents, and still meet their legal responsibility for making a reasonable effort to produce all of the responsive documents in a court case.

The basic idea behind predictive coding is the promise of starting with 1,000,000 documents to review, review less than 10,000 documents to train the predictive coder, and identify 100,000 to 200,000 documents that, if you review them all, you are likely to find most of the responsive documents in the set.

Predictive Coding in Context

If you are considering using predictive coding you are probably either the defendant in a legal case or someone helping a defendant in a legal case. The defendant has been asked to produce documents relevant to the case.

Finding the initial set of documents that are candidates to be produced was probably something like this. Important people (custodians) and date ranges were identified for the case. For every custodian, documents from the specified date ranges have been gathered and these documents could include emails, files found on hard drives, or even paper documents found in filing cabinets that have been turned into electronic documents using optical character recognition (OCR).

After collecting all of the documents, you might have hundreds-of-thousands or millions of documents to sift through.

As the defendant in the case you can’t simply grab millions of documents, send them to the plaintiff, and let the plaintiff sort them out. Instead you need to review the documents to figure out the following:

Important

Is the document relevant to the case? You should only produce relevant documents to the plaintiff.

Is the document privileged communication? If a document is privileged, you should not produce it for the plaintiff.

The gold standard for determining which documents should be produced is to review every document. With that review, a person can decide if the document is relevant, and if the document is privileged. Every document that is relevant and not privileged can then be produced and made available to the plaintiff.

The problem with reviewing every document is that it is expensive and time consuming. If you have 10,000-100,000 documents, full review may not be a big deal. If you have 1,000,000-10,000,000 documents, full review can be prohibitively expensive.

What Predictive Coding Offers

The goal of predictive coding is to let you review a much smaller number of documents and be able to say that you have some level of certainty that you have produced some percentage of the relevant documents. For instance, you might pick a level of review where you are 95% certain that you have produced at least 75% of the relevant documents.

Once you have decided to use predictive coding you are no longer trying to produce 100% of the relevant documents. It is common in a legal case for counsels to agree that 100% review is prohibitively expensive, and that instead of reviewing everything, current best practices can be used to find a reasonable percentage of the relevant documents with a reasonable level of confidence at an acceptable cost.

Arguments can be made about how reliable human review is, and what the margin of errors are on a 100% reviewed case. Regardless, once you decide to use predictive coding you are using probabilities and statistics to attempt to produce a reasonable percentage of the responsive documents with a reasonable level of confidence. What levels are reasonable are driven by the cost of finding that percentage of relevant documents with that level of certainty in a particular set of documents.

Predictive Coding Statistics 101

I don’t really want to teach you statistics, but I do want to be able to talk about precision and recall. Since we’ll usually want both our precision and our recall to be pretty good, I’ll need to talk about f-measure as well.

Calculating precision, recall, and the f-measure is easy if you have already reviewed every document in your dataset and know, not only how many relevant documents there are, but also exactly which documents are relevant. In the real world you don’t know any of this up front, so you’ll also need to learn about how you can create and use a control set to help you make predictions about recall and precision when you don’t know what the real recall and precision are.

Finally, your goal is to reach some acceptable level of certainty that you have found at least some percentage of the relevant documents in your dataset and produced them. Since we are using predictive coding in order to be reasonably certain that we have found a reasonable portion of the relevant documents, I’ll finish up by discussing how control sets can be used to determine a depth for recall.

Predictive Coding Terms

Richness

The easiest way to understand richness, precision and recall is to use an example. First, here are our documents:

100 documents with 15 green documents we are trying to find.

We have 100 documents. The green documents are the 15 documents we are looking for out of the 100 documents. Since 15 out of 100 documents are documents we are looking for, the richness of this dataset is 15%.

Now we are going to bring in a steam shovel, and we’ll try to grab as many green documents as we can with a single swipe of the steam shovel’s bucket.

A steam shovel scoops up 4 documents

We used a really small steam shovel and its bucket managed to scoop up 4 documents. How good of a job did it do? To decide, let’s consider precision and recall.

Precision

Precision is a way to look at what is in the steam shovel’s bucket and decide how precise the scooping was. Since we are looking for green documents, if the steam shovel’s bucket only has green documents in it, we’ll say the precision is 1.0. If the bucket only has white documents in it, the precision is 0.0. If there is a mix of white and green documents, the precision will be the fraction of documents in the bucket that are green. In this case there are 4 documents in the bucket and 3 of them are green, so the precision is 0.75. In general the formula for precision is:

The formula above just says count the number of documents you were looking for that are in the bucket, and divide by the total number of documents in the bucket.

The precision of 0.75 or 75% is pretty good. We did a pretty good job of trying to only scoop up green documents, but just by eyeballing the results it looks like we left a lot of documents behind.

Recall

Next we’ll calculate the recall. Where precision told us how good the documents in the steam shovel’s bucket are, recall will tell us, out of all of the green documents, what portion did we scoop up in the bucket. In this case we already know there is a total of 15 green documents, and we were able to scoop up 3 green documents. Scooping up 3 out of 15 documents is scooping up one fifth of the green documents, which means our recall is 0.2. The formula for recall is:

The formula above says count the number of documents you are looking for that are in the bucket, and divide by the total number of documents you are looking for.

F-measure

Our precision was 75%, which is pretty good, and our recall was 20%, which isn’t so good. Often when we improve one of these measures, the other gets worse.

For example, if we carefully scooped up a single document in our bucket and we made sure that it was a green document, our precision would of up to 100%, there are only green documents in the bucket, while our recall would go down to 6.67%, we only found one of the green documents.

If we want to have great recall, instead of using the small steam shovel and its tiny bucket, we could use a giant bulldozer and scoop up all of the documents. If we get all of the documents, then we know we got all of the green documents and our recall would go up to 100%, but out precision would go down to 15% since only 15 of the 100 documents we returned were green documents.

It would be nice if we had a single number that told us how well we did at scooping up documents that somehow took into account both precision and recall. We could try just averaging the two numbers, but if we did that our two extreme examples would score too high. 100% precision and 6.67% recall averages to 53.3%. 100% recall and 15% precision averages to 57.5%. In both cases these scores are probably too high.

Instead of calculating the average, for the F-measure we’ll use the harmonic mean. In this case we’ll only look at equally balancing the weight we give recall and precision. This is the more specific and simpler F1 score. The formula for F1 score is:

Notice that if precision is equal to recall we get the exact same results as we would if we averaged the two . When the two scores are very different, the results are very different.

In our example where precision is a perfect 1.0 and recall was 0.067, the F1 score is 0.13, much lower than the 0.53 average. Similarly when recall was a perfect 1.0 and precision was 0.15, the F1 score is 0.26, much lower than the 0.58 average.

Going pack to our original example that was more balanced with a precision of 0.75 and recall of 0.2, the F1 score is 0.32, which outscores either of the more lopsided examples.

F1 is a score that rewards a balance between precision and recall instead of rewarding maximizing one at the other’s expense.

Control Set

Up to now all of our measurements assumed that you know how many green documents, the documents you are looking for, are in the document set. In the real world you don’t know how many green documents there are.

If you could just go out and review every document, we wouldn’t be talking about predictive coding. Let's assume that the document set we are working with is too large to review every document. How can you evaluate your recall and precision without reviewing every document? The answer is that you can use a control set to estimate your recall and precision. The goal is to select a control set that is significantly smaller than the full document set, and is still large enough to provide good enough estimates of recall and precision.

For now we won’t worry about what makes a control set “good enough”, and will simply go through selecting a control set for our example.

Randomly selecting a control set

In the randomly selected 10 documents control set, 2 documents were green. After reviewing every document in the control set we would guess that 20% of our document population is green documents. Since we can see every document above, we know that in reality 15% of the documents are green.

Once we have a control set, we can use the control set to evaluate recall and precision as if the control set was the entire document population.

When we pick the size of the control set we will be balancing how much it costs to review the entire control set against how accurately we can estimate recall and precision using that control set. Larger control sets are more accurate, and smaller control sets are cheaper to review.

Depth for Recall

Depth for recall is essentially the cost-of-review metric. What depth for recall tells us is how many documents do we have to review to be confident that we found our target level of recall. The lower our depth for recall, the fewer documents we need to review.

Predictive coding is all about creating a classifier that can predict how likely it is that a document is responsive or not. If you can use your classifier’s predictions to order the documents from most likely to be responsive to least likely to be responsive, depth for recall tells you how far down that list you have to go to find your target recall.

In our example with 100 documents, 15 of which are responsive (green), if our goal is to find at least 80% of the responsive documents, we need to find at least 12 documents.

In the following examples we will use a classifier to order the documents so that if we read from left to right and top to bottom, we are reading from the document predicted most likely to be responsive to the document predicted least likely to be responsive.

Random classifier shuffles documents

The above shows the result of a random classifier (a bad way to classify). If the reality is that the results of the classifier are indistinguishable from a random shuffle of the documents, then the expected depth for recall (DFR) for 80% recall is 80%. In this case we found the 12th green document when we read the 79th document in the list, so the actual DFR is 79%.

While random classification is bad, the classifier could be worse.

Worst classifier possible

The diagram above shows the result of using the worst classifier possible. What the classifier said was most likely to be responsive was actually least likely to be responsive and vice versa. Here we don’t find the 12th responsive document until we read the 97th document. The DFR in this case is 97%.

Perfect classifier

The final diagram shows what a perfect classifier can do. Here, after reading the 12th document we’ve found the 12th responsive document. This gives us a perfect 12% DFR for getting 80% recall on a 15% richness dataset.

In summary, if your goal is at least 80% recall on a dataset with 15% richness, the worst DFR possible is 97%. Just randomly shuffling your documents gets you an expected DFR of 80%, and the best possible DFR is 12%.

If you are going to review all documents that you produce, the DFR is a measure of how many documents you have to review given the classifier you are using. If you are going to produce everything the classifier says you should produce, then DFR is a measure of how many documents you will produce.

These examples showed calculating the actual DFR since we know exactly how many responsive (green) documents are in our dataset. In the real world we’ll use the control set to calculate an estimated DFR.

Getting Started: Creating a Control Set

A control set is a random sample of your dataset. Once that random sample has been fully reviewed, it can be used to estimate richness, recall, precision, and depth for recall (DFR). How good these estimates are depends on the size of the control set.

Your goal is probably something like “I need to be 95% certain that I have produced at least 80% of the responsive documents from this collection.”

I wish that there was an easy way to tell you how large you control set needs to be to get some level of recall with some level of certainty. What makes this hard is that the size of your control set depends on the richness of your document set. This leaves us with a chicken-and-the-egg problem: you don’t know how big your control set needs to be until you draw a control set.

Control Set Dirty Secret: For Estimating Recall, Only the Green Documents Count

Let’s say you want to be 95% confident that when you make estimates with your control set, those estimates are within +/- 2.5% of the actual values. You can plug those numbers into an online sample size calculator like http://www.raosoft.com/samplesize.html, add in information like how many documents are in the full population (let’s say 1,000,000) and what your expected richness is (let’s say 50%). The calculator will then spit out a number: 1535.

Getting a sample with 1535 documents doesn’t sound too bad. You can use that sample to estimate richness of your dataset, and even estimate the precision of using a classifier or even a steam shovel to select some documents from that control set. For these estimates the 95% confidence and 2.5% margin of error hold.

Now what happens if you use this same control set to estimate recall? You might think you will still be 95% confident with a 2.5% margin of error, and you would be wrong. The problem is that recall only cares about the relevant (green) documents.

Earlier we talked about what recall means and gave the following formula for calculating it:

When you measure recall, you are simply comparing the number of green documents scooped up in a bucket to the total number of green documents in your dataset. When you use a control set to estimate how many green documents will be scooped up by your bucket (or really by the classifier you are going to train) the only thing that matters for the quality of that estimate is how many green documents are in the control set. Adding more and more white documents tells you nothing more about how many green documents you are going to find.

If you want a control set that will give you 95% certainty with a 2.5% margin of error for recall calculations, your control set will have to have 1535 free documents in it.

In an ideal world, if you were drawing a control set for estimating recall, you would randomly select responsive documents to get your 1535 green documents. Unfortunately you don’t already know which documents are responsive. Instead of picking 1535 responsive documents, you have to pick enough documents at random so that your randomly selected control set has the 1535 documents you need.

To figure out how many documents you need to pick at random to get 1535 responsive documents, you need to know your richness. For instance, if your richness is 10%, you will need to randomly select 15,350 documents to have a good chance at finding 1535 responsive documents. If the richness of your document set is 10%, then you expect the richness of your random sample to be 10% as well. If you have a random sample of 15,350 documents and 10% of them are responsive, you will get 1535 responsive examples.

Wait a second, if my richness is 10%, doesn’t that mean I could plug 10% into the response distribution of my sample size calculator and get a much smaller number?

Unfortunately, no. That response distribution is the distribution of answers to the question you are estimating. If we were estimating richness, and you were pretty certain your richness was 10%, you could plug that number in and get a smaller sample size— but now you have a chicken-and-the-egg problem of saying you have a good estimate for richness before drawing a control set to estimate richness.

The recall control set is used to estimate recall. That means the response distribution field is the recall percentage: in other words, how many of the responsive documents in the recall control set were found by the classifier (scooped up in a bucket).

Okay, why don’t I just target a really high recall like 90%, and get a smaller control set?

This might sound like a good idea. If you put 90% in as the response distribution in a sample size calculator instead of 50%, instead of 1535 responsive documents, you now only need 553 responsive documents. At 10% richness, that saves you from reviewing almost 10,000 documents. The problem is that you will only get the 95% certainty and 2.5% margin of error if you target 90% recall. At the end of training your classifier you might find that targeting 90% recall means producing 50% or more of the document population. If your rule is to review everything you produce and you started with 1,000,000 documents, you will have to review 500,000 documents or more.

The total cost of predictive review includes the cost of the following items:

The cost of reviewing the control set. The cost of reviewing training documents. The cost of reviewing and/or producing the documents found using predictive coding.

If you plan on reviewing all documents that are produced, the third item on this list will probably be the most expensive. This cost is also the hardest to predict since it depends on how well the classifier you trained can identify responsive documents. The quality of the classifier will vary from document set to document set and case to case. The upshot is that at the end, you will want to be able to choose the recall level to target based on how many documents will be selected by the classifier at that level. If you want the freedom to control your final cost by changing the target recall, you will need a control set that supports adequate quality at that recall. In other words, stick with the 50% response distribution.

Back to the Chicken and the Egg: We have to draw a control set before We know how big the control set should be.

The quality of our recall estimation will only be as good as the number of responsive documents in the control set. For estimating recall, it is as if those responsive documents are the only documents in the control set. So how do we decide how large the control set should be before we have a control set?

Discovery’s answer is to draw a small control set to get an initial estimate of richness, and then keep adding to that control set until you have enough responsive documents.

Here is the process. Discovery has you first draw a small random sample of around 400 documents. Once this small sample has been coded, Discovery can estimate the richness of your document set and use that to estimate how many documents need to be in the control set for your desired level of recall, certainty, and margin of error.

Discovery might tell you that you need 1000 more documents in your control set. Keep in mind that this is an estimate. Once more documents have been randomly selected, the richness estimate will have improved and the number of documents you need in your control set might increase. It can take a few rounds of adding more randomly selected documents until you finally have enough documents.

As a smaller example, let's look at the control set drawn in the section “Predictive Coding Statistics 101”. There we randomly picked 10 documents from the full set of 100 documents to be the control set:

Randomly select 10 documents for a control set.

This gives us an initial control set that looks like the following:

Initial control set.

This control set has two responsive (green) documents. If the recall certainty and margin of error require at least 6 responsive documents, we will have to extend this control set. Based on the initial control set, the richness is estimated at 20%, and to get 4 more responsive examples we need to randomly select 20 more documents.

Note

A control set MUST have 6 responsive documents. When doing a control set, if the user decides to skip the control set warning, they would have to code at least 6 responsive documents. Otherwise, when they run the next training round, the system displays an error saying `Unable to process the dataset`. This because it is trying to do a depth to recall calculation which requires at least 6 responsive documents in the control set.

We randomly select 20 additional documents:

The original 10 random selection plus 20 additional random selections.

And now we have the following control set with 30 documents in it.

A 30 document control set with 5 responsive documents.

The new control set has more responsive documents, but we said that we needed the control set to have at least 6 responsive documents and we only found five. Because the control set is larger we now have a more accurate richness estimate. Before we thought the richness was 20%. With the larger sample, the richness appears to be 17%. Based on the new richness estimate, to get a control set with at least 6 responsive documents in it, we will need to add 6 more documents to the control set.

In this case we were unlucky and none of the 6 new random documents were responsive:

Control set selections extended to 36 documents.

I’ll skip showing the new control set. Now that the control set has more documents, the richness estimate has improved again and is 14%. Based on this new estimate we need to randomly select 7 more documents to find 1 more responsive document.

This time our random selection finds another responsive document and our control set is complete.

43 documents randomly selected to be in the control set.

Finally we have a control set with 6 positive examples:

A 43 document control set with 6 relevant documents.

Adding more documents to the control set improved the richness estimate some more. While the richness was bumped up slightly, it still rounds to 14%.

Each time Discovery tells you that your control set needs more documents, you are adding to what is already in your control set. You aren’t losing what you reviewed in previous attempts to come up with a control set.

In this example the control set contains 43% of the entire document set. This is a quirk of a small example where there are only 15 responsive documents and 6 of the 15 had to be randomly selected to be part of the control set. In the real world you might need a control set with around 200 responsive documents out of a document set that has at least 10,000 responsive documents. In the real world control sets are much smaller than the entire document set.

Control Set Too Large: The Low Richness Problem

Since the certainty and margin of error supported by a control set depends on having a minimum number of relevant examples, the size of your control set depends on the richness of your document set. If responsive documents make up less than 1% of your document population, you may find Discovery asking you to have a control set with 10,000-100,000 documents or more. When the richness is very low you have to randomly select lots of documents in order to get the number of responsive documents that you need.

Right now we can offer you two suggestions for dealing with low richness datasets:

  • Reduce the certainty and/or increase the margin of error supported by your control set.

  • Cull your dataset to get rid of unresponsive documents and leave you with a richer document population.

For a low richness dataset, it can be prohibitively expensive to show that you achieved what is normally a reasonable certainty and margin of error for your recall. Being less certain and/or accepting a larger margin of error will reduce the minimum number of relevant examples you need in your control set, and reduce the total size of your control set.

If reducing certainty and/or increasing the margin of error on your recall estimate is not acceptable, then you need to see if there are simple ways to cull out large numbers of unresponsive documents.

Again, it would be nice if the math didn’t depend on the richness of your dataset, but it does. Document sets with very low richness will need very large control sets for typical certainty and margin of error levels.

What do we do if we can’t cull enough to fix our richness?

When we created a control set, we picked how certain wanted to be, probably 95% certain, and we picked what margin of error we wanted, maybe +/- 2.5%.

If we plugged 95% certainty and a 2.5% margin of error into a sample size calculator, even for huge population sizes we would only need 1537 documents in our control set.

But now we have learned the dirty secret about control sets for estimating recall: only responsive documents count as members of the recall control set. We are using predictive coding to avoid reviewing every document in our dataset, and this means we don’t know which documents are responsive and which aren’t. So instead of randomly selecting 1537 documents from the population of responsive documents, we have to randomly select enough documents from the entire document population so our control set will have 1537 responsive documents.

As we’ve discussed, if our document set has 10% richness, we will have to draw and review 15,370 documents to expect that we’ll find 1537 responsive documents. At 1% richness we have to draw and review 153,700 documents, and at 0.1% richness we have to draw and review 1,537,700. No matter what our review budget is, there is a level of richness that is low enough that the number of documents we have to draw for the desired confidence and margin of error will be too large: we simply won’t have the time or resources necessary to review that many documents.

As we’ve just discussed, we could cull the dataset to improve the richness, though there are limits on what culling can do. If we can perfectly cull half the documents from our dataset so we only remove non-responsive documents, we’ll double the richness. If we need a 10x improvement to richness to get a reasonable control set size, we’ll have to perfectly cull 90% of the dataset. For a 100x improvement, we have to perfectly cull 99% of the dataset. Assuming that we have already taken reasonable measures to gather documents about a case, we probably can’t perfectly cull 90% of the documents from that set, let alone 99%. If we can’t cull enough to get a reasonable richness, then we’ll have to relax our certainty or our error bounds so we can afford to draw and review the control set.

How certain is certain enough? And what does margin of error have to do with it?

At the end of the predictive coding process we are going to produce some number of documents, and say something like “We are 95% certain that we have produced 80% of the responsive documents.” How does a control set that promised 95% certainty with a 2.5% margin of error translate into a level of certainty that you have produced some percentage of the responsive documents?

To figure this out, let’s take a look at three different control set sizes: 100, 385, and 1537 documents. If you have played around with a sample size calculator you may recognize 385 and 1537 as the control set sizes for 95% confidence with a 5% margin of error and 95% confidence with a 2.5% margin of error. With 100 documents you get 95% confidence with a margin of error that is slightly better than 10%. All of these control sets assume a 50/50 rate of responses so that the margin of errors are the worst case margin of errors. As we’ll find out, depending on the actual results, these margins could be better.

As we’ve talked about before, we had to review a lot more documents than 100, 385, or 1537 so we could come up with that number of responsive documents. If the richness of our dataset is 10%, 100 responsive documents are what we would expect if we reviewed 10,000 documents. This 100 documents control set was picked to be affordable so we can compare an affordable control set to a more accurate control set.

When you select a control set size, you are placing a bound on the amount of error you are willing to accept and still keep your confidence level. For recall control sets, the actual amount of error will depend on the level of recall you choose to produce at.

For example, here is a graph of the actual confidence intervals supported by a 100 responsive document control set:

95% certain upper and lower bounds for a recall point estimate for a 100 documents control set.

Looking at this graph, if you choose to produce 50% recall, you are 95% certain that you are producing between 40% and 60% recall. If you want to be 95% certain that you are producing 80% recall, look on the graph for where the lower bound reaches 80%. You will have to produce what you estimate is slightly less than 90% recall to know that you are at least 95% certain that you at least produced 80%.

If you have 1537 responsive documents in your control set, the error bounds are much narrower. Take a look at the following graph to see how much narrower:

95% certain upper and lower bounds for a recall point estimate for a 1537 document control set.

We can see that the upper and lower bounds at any estimated recall are low enough that we can’t really eyeball them.

Another way to narrow the error bounds is to reduce our confidence in the outcome. If we are willing to accept being only 70% certain in the results instead of 95% certain in the results, we get the following graph for a 100 document control set:

95% certain upper and lower bounds for a recall point estimate for a 100 document control set.

When we were 95% certain, we were saying that if we repeated this experiment 20 times, only 1 out of those twenty times would the observed recall be outside of the error bounds. Now that we are only 70% certain, 6 out of the twenty times we repeated this experiment, the observed recall would be outside of the error bounds. The large reduction in certainty only improved our error bounds at 50% recall to around 5% on either side.

Control Set Creation Conclusions

Creating control sets can be confusing because of the chicken-and-the-egg problem of needing a control set to decide how big of a control set you need. Discovery will ask you to draw and code the first 400 or so documents you need for your initial control set.

Once this first set of documents is coded, Discovery will tell you how many more documents you need. Adding more documents to your control set will not only help you reach the certainty and margin of error you want for recall, it will also improve your estimate of richness.

Once you are done adding to your control set, the same control set can be used to estimate richness, recall, precision, and depth for recall.

When using a control set to predict recall, only the responsive members of the control set count as members of a control set. In low richness document sets, you may have to review a large number of documents to find enough responsive documents for your control set.

The good news is that even with only 100 responsive documents in your control set, you can still find the 95% confidence level around your target recall. You will simply get a larger confidence interval.

Control Sets and Rolling Productions

Many e-discovery projects have rolling productions. While you might start predictive coding when you have 100,000 documents, by the end because of rolling productions you might have 1,000,000 documents. What does this do to the validity of the control set?

Your control set is valid if you can’t differentiate how you selected your control set from randomly selecting a QA set at the end of your predictive coding process. Randomly selecting a QA set at the end of your predictive coding process means that all rolling productions have happened, and any clawbacks have happened as well. The full candidate set of documents is now known, and you pick a random set of documents out of that full set.

What does this mean for rolling productions?

Let’s say that when you started with 100,000 documents you picked a random control set containing 1000 documents. Now, because you received additional documents along the way, you have a document set with 1,000,000 documents. If you picked a control set with 1000 documents out of a set of 1,000,000, what are the odds that you picked a control set that only has documents from the first 100,000 documents?

Each time you randomly pick a document, there is a 1 in 10 chance that the document came from the first 100,000 documents. Once you pick the second document, there is only a 1 in 100 chance that both documents were in the first 100,000 (1/10 * 1/10 = 1/100). By the time you have picked 9 documents randomly, there is only a 1 in 1,000,000,000 chance that all 9 documents were found in the first 100,000 documents. By the time you pick the 1000th document, the odds that all of those documents came from the first 100,000 documents is amazingly tiny.

Let's say that you started with 100,000 documents and at the end of your rolling productions you have 200,000 documents. Now for each document selected there is a 1 in 2 chance that it was picked from the first 100,000. In this case, by the time you pick the 30th random document there is less than a 1 in 1,000,000,000 chance that all thirty documents came from the first 100,000 documents.

In other words, as soon as you have a rolling production your control set is biased and can no longer be used for making predictions about richness, recall, precision, or anything else about your data set.

Repairing versus Replacing

Adding more documents to your population after you have drawn your control set breaks the validity of your control set. At this point you can choose to either repair the control set or replace the control set.

Let’s go back to our example and say that when we started predictive coding, 20 out of the 100 documents had been delivered. We then selected a control set that gave us adequate coverage of those 20 documents:

20 documents with 6 control set documents circled.

Now 80 additional documents are added so we have a total of 100 documents:

20 + 80 documents with the biased control set documents circled.

Now we have to choose between repairing the control set and selecting a new control set.

Repair by Same Rate Sampling

To repair the control set we need to pick additional control set documents in a way that, if we were picking a completely new control set now, would be likely to have as many documents coming from the first 20 documents as the current control set has selected from the first 20 documents.

The current control set selected 6 control set documents out of the first 20 documents. This outcome is most likely if on average we pick 6 control set documents out of every 20 documents. Since we are adding 80 (4 x 20) more documents, we need to select 24 (4 x 6) additional control set documents out of those 80 documents:

20 + 80 documents with 6 + 24 documents circled to be part of the control set.

We removed the bias of having 6 of the first 20 documents in the control set by selecting additional documents to the control set until the control set was large enough that 6 out of the first 20 documents is expected.

Repairing control sets is cost effective when the entire population will no more than double from the time you select you first control set to the time you receive your last rolling production. In the above example where the document population more than doubled, the cost of repairing the control set by extending it was high.

Replace the Control Set

In a case where the document population will more than double, you are usually better off replacing the control set. Since the certainty and margin of error of a control set depends more on the number of positive examples in the control set, you can review fewer documents by replacing the control set with a similarly sized control set selected over the entire final population.

100 documents with 6 + 6 documents in the control set.

In the above example a new 6 documents control set was selected. The new control set had only 1 responsive (green) document in it, so the control set was extended by selecting 6 additional documents. After adding the additional documents, the control set had two responsive documents just like the original control set.

Rolling productions can add document sets with different richnesses than the original set. Just selecting a control set that is the same size as the one it is replacing does not guarantee that the resulting set will have the same number of responsive documents and the same statistical properties as the original control set. In the real world the control sets are larger and the results won’t vary as much unless the richness of the document population was significantly changed by the rolling productions.

In other words the double in size of the control set that was seen in this example will be very rare in practice unless the rolling productions significantly reduced the richness of the document population.

Which is better, repairing or replacing?

The biggest difference between repairing and replacing is the total number of documents that must be reviewed for the control set, and the total cost of reviewing those documents.

One other difference is that if you repair, you can choose to repair your control set after each rolling production. This means you always have an up to date view of the richness of your document population and you always have a control set that can accurately estimate recall and precision for your entire document population.

If each rolling production is the same size as the original production of documents, then you will have to replace your control set less frequently than you receive rolling productions if you want to save on the total number of documents reviewed for the control set. If you don’t generate a new control set each time there is a rolling production, then for some amount of time you will be working with a control set that does not accurately estimate the richness or your document set and can’t accurately predict the recall and precision of a classifier’s performance on that document set.

In practice, you will usually decide how to handle rolling productions based on how much you expect your document population to grow. If you expect the document population to significantly more than double in size, then replacing the control set one or more times will probably cost less than repairing the control set each time.

If you don’t know how much your document population will grow, then we recommend you do simply replace the control set each time the document population doubles. This way you keep your control set from getting too far out of sync with your document population to be useful while at the same time avoiding an unacceptable increase in review costs.

Control Sets and Training Documents

Discovery uses a control set as a way to evaluate the predictive coding training. To keep the control set effective, it is important to avoid using knowledge of the control set to guide your selection of training documents.

A wrong way to do it

First let’s imagine the worst case possible: after drawing and reviewing a control set, you submit the control set as a training set. For any classification algorithm if you train with the same set of documents that you will be tested with, you will perform much better.

It’s like getting to see all the questions on a final exam before you study for the exam. If you know the exam questions, you only have to study the answers to those questions, instead of everything covered in class. If you know which documents will be used to test the classifier, you only have to train to recognize which of those documents are responsive or not, instead of trying to recognize what is and is not responsive in the entire document population.

A classifier trained on the same documents that will be used to test the classifier should pass the test with flying colors. Training with the control set is cheating, and means the control set is no longer a control.

How to tell if you are doing it right

Using the control set as a training set is wrong, so what is the right way to train classifiers and how can you tell if you are doing it right?

The best way to test would be to wait until all your training rounds are done and then draw a random set of documents to be the control. Then you could compare manually coding the after-the-fact control set to using your classifier to code that control set and see how well you perform.

The most important thing is that when selecting the random set to be the final exam for your classifier, you should completely ignore whether or not the document was used to train the classifier. You would select the control set as if you know nothing about how the classifier was trained and instead are coming up with a random sample representative of the entire population of documents.

Waiting until after you have done your training work is the best way to generate a fair test, but it also would mean that you would have no way to tell how you are doing after each step of training. A very important part of predictive coding is figuring out when the classifier is good enough and is ready for its final exam.

The next best thing would be if, after each round of training, you selected a random set of documents, coded it, and tested your classifier against it. Now you can see how the classifier is progressing, but you are paying a lot to review many times more documents than you would otherwise.

We want to be able to see how training is progressing, so we want to select a control set at the beginning instead of at the end. Reviewing a control set costs enough that you won’t want to do that over and over again, so the question is how can you select a control set at the beginning and keep it effective so it is still a fair test at the end of the training process.

The way to tell if the up-front control set is good enough is to ask “Given the way I have trained the classifier, is the control set as fair as blindly selecting a new control set right now?”

So do I just avoid training with control set documents?

Some people say that the way to keep the up-front control set usable as a control set is to make a rule that documents in the control set can’t be used as training examples.

That sounds reasonable, but let's test this against the gold standard: selecting the control set documents after all training has been performed completely blind to which documents were used for training and which weren’t.

If I select a control set of 1000 documents out of a population of 100,000 documents, I’ve selected a 1% set. That means I expect roughly 1% of the training documents to be in that control set. If you end up with 3000 documents used for training, you would expect roughly 30 of those documents to also be in the control set. Avoiding using control set documents for training is a kind of bias that does not reach the gold standard.

Worse, avoiding training with control set documents can give you a false sense of security that you have avoided cheating on the control set test when that is exactly what you are doing. What if, instead of training with the control set documents, you found documents that are as similar to the control set documents as you can find. You aren’t training with control set documents, but you are cheating on using that particular control set as a fair test of your training.

So avoiding using control set document as a training document does not live up to the gold standard for what should be in the control set, and does not guarantee that you have avoided cheating on the control set test.

Let Discovery Select Training Documents

Discovery supports automated ways of selecting documents to train with. Discovery does not simply rely on random sampling for training. Discovery learns as much as it can about your entire document population as well as what you have learned through your training set up to any point. Discovery can then select documents balancing out what the classifier thinks it knows about your documents with how well you have covered different kinds of documents.

None of the information that Discovery uses to select training documents comes from the control set. As long as you let Discovery select your training documents, it doesn’t matter if the control set was selected yesterday or tomorrow. The training documents selected by Discovery were not influenced in any way by the control set, and the control set remains a fair test of the training results.

The Gray Area: Manually Selecting Training Documents

If you always let Discovery select training documents, you don’t have to worry about invalidating the control set. That’s the easy way out. But what happens if you manually select documents to train from? As your reviewers read more documents, they will start to learn what relevant documents look like, and might be able to find examples that no computer algorithm would think of.

So far there is nothing wrong with this example. But what if the reviewer who is finding relevant examples to train with is the same reviewer who reviewed the control set? How can you keep that reviewer from using the knowledge that they gained from reviewing the control set to find training examples?

What if the reviewer who reviewed the control set isn’t recommending training documents, but that reviewer asks another reviewer if they have found certain kinds of relevant documents yet?

Manually selecting training documents can be a great way to find relevant examples to train with, and it can be a way of inadvertently training to the control set. Remember, training to the control set means cheating on the control set test.

If you are afraid that your team has accidentally cheated on the control set test, or if someone accuses you of deliberately cheating to make your training results look better than they are there is a simple fix: draw a new control set.

You do have to pay the cost of reviewing that new control set. The good news is that the cost is predictable. The new control set will be about the same size as the old control set. You might get unlucky and have to select a few more documents to reach your statistical goals, but it will still be close.

This means that, given your own misgivings or given the concerns that someone else raises, you know what the cost will be to get a definitive answer. For the cost of reviewing a set of documents roughly the size of your current control set, you can create a brand new control set and know that there was no way that training documents were selected based on knowledge of what is in the control set.

Conclusions About Control Sets and Training Sets

Simple rules saying that you should not use control set documents for training do not meet the gold standard for a control set, nor do they guarantee that you have avoided cheating on the control set test.

The easiest thing to do is let Discovery select your training documents and then you are guaranteed that the control set did not influence the documents selected for training.

If you are afraid that some training documents were selected based on knowledge of the control set, you can always select a new control set. That is more review to pay for, and you can weigh the cost of reviewing another control set against the certainty of knowing that your control set lives up to the gold standard for control sets.

Seed Sets

Once you have a control set you can finally start training a classifier for predictive coding. Training a classifier means providing the classifier examples of responsive and non-responsive documents so the classifier can start learning how to tell the difference between the two. During any round a big part is deciding which documents to use for training.

The first training round is called the seed round. This is the trickiest round because you theoretically don’t know anything about what is in the dataset before this round. I emphasize theoretically because you just went through creating a control set. A lot of documents have been reviewed—enough that you have found a significant number of responsive documents. If you reviewed the control set documents you probably have a pretty good idea about what the document set looks like, and in particular what responsive documents look like in this document set.

These things you already know and have already learned from the control set are exactly the things you will have to ignore while creating your seed set if you want to keep the control set valid.

Here is what makes this round extra tricky. You have not started to train your classifier yet, and your classifier will learn more if it has both responsive and non-responsive documents to learn from. As a rule of thumb you want this set to include at least 5 to 10 responsive documents to make it more likely that the classifier sees some different examples of what responsive looks like. At the same time, you probably want to keep the total size of the seed set small. The easiest way to keep the seed set small is to use what you’ve learned from the control set to find some relevant examples.

Use Random or N-Seeds

Using knowledge gained from the control set you could easily create a small seed set with the 5 to 10 responsive examples, but you would invalidate your control set. If you invalidate your control set, you would have to randomly select and review a new control set, and that is probably a lot more review than is required for a larger than you wanted seed set.

To avoid invalidating your control set, Discovery offers you two methods for automatically selecting seed round training documents: random and n-seeds.

Random selection is exactly how the control set was selected. Now that you have a control set you have a good estimate of the richness of your document population. Based on this richness estimate you can pick a good number of documents for your seed set. For example, if you have 1% richness (1% of the documents are responsive), you would expect that you would have to randomly pick 500 documents to get at least 5 responsive examples. If you want to increase the odds of finding 5 responsive documents, it is a good idea to randomly select 600 documents and aim for finding 6 responsive documents.

An alternative to randomly selecting documents is to use the n-seeds algorithm. The n- seeds algorithm is all about maximizing what you can learn about the entire population from a small sample of documents. Before telling you how n-seeds does it, the end result is that seed sets selected using n-seeds help you learn more about your dataset than randomly selected seed sets. N-seeds selected seed sets train a classifier that has better depth for recall and maximum F1 scores than random. The down side is that n- seeds takes longer to select the seed documents.

The short story is that if you can wait a few minutes, or for large cases an hour or two, in order to get a set of documents you can learn a lot more from, use n-seeds instead of random.

When you use n-seeds, pick as many documents as you would have if using random selection, and don’t worry about how many responsive examples you get.

How N-Seeds Does It

N-seeds consistently picks document sets that exhibit better learning metrics than randomly selected documents. How does it do that?

The full gory details for n-seeds is in a separate white paper. The high level of what n- seeds does is the following two-step process:

Pick a few candidates that are probabilistically likely to teach you something new about the document set. Evaluate which of the candidates helps you learn the most about the document set, and pick that candidate as the next to be added to the seed set.

The n-seeds algorithm repeats these two steps over and over again, each time picking one more seed document to add to the seed set.

If there were time to do a perfect job of finding the documents that tell you the most about a document set, there would be no need for the first step of picking a few candidates. Instead every document would be a candidate, and step 2 would pick the best next document for the seed set.

Since evaluating every document is very expensive, n-seeds considers a small number of candidates picked in step 1 instead. The candidate documents are not picked purely at random. Instead every document is given a weight based on how different that document is from the document most like it in the seed set. The more different a document is from a seed set document, the more likely it is to be picked.

The end result is that at the end of a pass through these two steps, n-seeds picks a document to add to the seed set that is close to the best document to add to the seed set. The document that is added won’t be like any other document in the seed set, and will be like enough documents that aren’t in the seed set that you will learn more about the whole document population.

Summary for Selecting Seed Round Documents

Don’t use the knowledge gained from reviewing control set documents to pick seed set documents. Use the richness estimate from the control set to pick a seed set size that will give you at least 5 to 10 responsive examples. Use n-seeds to maximize what you learn about the overall document set from your seed set. Training Rounds

Why we are Training

Before getting into recommendations for how to train a classifier, it is important to keep in mind why we are doing this. The goal of Discovery’s predictive coding process is to help you reach a desired level of recall for a lower cost than you could without using predictive coding.

Your goal might be to find at least 90% of the responsive documents. If you didn’t have any tools to help you, you would have to review at least 90% of your document population to be confident that you found 90% of the responsive documents.

At this point you have selected and reviewed a control set. Based on the size of that control set, and more importantly how many responsive examples are in your control set, your control set supports a level of certainty and a margin of error on what your actual recall is.

The control set also gives you an estimate of the richness of your document set. If your entire document set has 100,000 documents, and the control set estimates 1% then the estimate is that there are 1000 responsive documents in the dataset. The control set lets you say that you have some certainty that there are 1000 responsive documents in your dataset, plus or minus some margin of error.

After each training round Discovery will estimate the depth for recall that the latest classifier supports. If your recall goal is 90%, the depth for recall tells you, if you used the classifier to prioritize your review, how much of your document set would you have to review to find 90% of the responsive documents.

After the seed round the first depth for recall was calculated. As you go through more training rounds, the hope is that the depth for recall will get lower, and the number of documents you will need to look at will be significantly reduced.

Keep the Control Set Usable

Like the seed round, the goal of each training round is to teach the classifier as much as possible about what is and is not a responsive document, and like the seed round you need to avoid using knowledge gained from reviewing the control set to guide selecting training documents.

Unlike the seed round, all other training rounds have previous training rounds to use as guides when picking training documents. This allows Discovery to offer active learning algorithms that can take into account what the classifier has learned in previous rounds when selecting training documents, and this allows you to use what you have learned from reviewing previous training rounds to guide manual training document selections.

You can manually select documents to review for a training round. The danger is that if you use knowledge you gained from reviewing the control set to pick these training documents, you invalidate the control set. The control set is supposed to be an honest test of how well the classifier you are training will perform against the entire document population. If you specifically train the classifier for the control set, the control set will only measure how well the classifier performs against the control set.

Pick as Many Documents as Convenient for You

After the seed round you no longer have to strive for a specific number of relevant documents. After the seed round, the number of documents to review each round is more about what is convenient for your review process.

The hope for each round is that the depth for recall (DFR) will improve more than the number of training documents you reviewed for that round. Whether you reviewed 10, 100, or 1000 training documents, the hope is that the classifier’s DFR will improve enough that significantly more documents can be removed from the final review.

Since Discovery’s active learning considers more than the uncertainty of the classifier, you have a lot more flexibility in how many documents you use for training each round. When classifier uncertainty is the only thing used to select training documents, you want to update this measure as frequently as possible. When uncertainty is balanced with diversity, density, and other measures for good training documents, you can get useful learning regardless of the size of the training round.

If you are still unsure how many documents to select in a training round, 200 documents is a good ballpark for a training round. Again, you are free to pick the number of documents that helps your team be productive. Your training rounds can be larger or smaller depending on what works for your team.

Use Fast or Accurate Active Learning

Most implementations of active learning for predictive coding simply return the documents the classifier is most uncertain about. Discovery’s active learning looks more deeply than this. The Discovery active learning algorithms also consider the diversity of the training set, and the content of the overall dataset when selecting training documents. The end result is that Discovery learns more from every document it asks you to review.

Since the Discovery active learning does more, there are two options for active learning: a fast algorithm, and a more accurate algorithm. The more accurate algorithm fully considers all documents that have already been used for training, what those documents tell you about the overall document population, which documents the classifier is most uncertain about, and how much each training candidate can tell you that you don’t already know about the overall document set. The faster algorithm approximates much of what the accurate algorithm does to select training documents much faster.

The end result is that if you have reviewers sitting idle and need to get them documents to review as quickly as possible, there is a fast algorithm that performs well. If you can accept waiting longer for the document selection, you can get even better recommendations for training documents.

What to do When Results Plateau

At the end of each round, Discovery will estimate the DFR the classifier supports, and will show you a graph of how DFR has changed from round-to-round.

Eventually the DFR improvements in each round will be small, and it will be time to decide whether or not you want to continue training.

The first question to ask is how much, if any, did the DFR improve over the previous round. Even though the DFR improvement looks small on the graph, it still may be a profitable improvement. For instance, a small change in DFR might mean reviewing 1000 fewer documents for the investment of 100 training documents.

Let’s say that the DFR improvement was not profitable. A particular round of training may have even made the DFR worse. What then?

It is possible to make significant gains in DFR after a round where DFR got worse. It is also possible for DFR to plateau for a few rounds before more profitable gains are made. At this point it can be helpful to think about the overall cost of the process. Here are some questions to ask:

  • How many documents have been reviewed so far for the control set, seed set, and training sets?

  • Given the current DFR, how many documents will need to be reviewed to provide the desired recall?

  • How does the total cost of review (documents reviewed so far plus the cost of reviewing to DFR) compare to your cost goals?

  • How much would additional review rounds add to that cost if the DFR does not improve significantly?

Some of the reasons you may decide to continue training even though the DFR scores seemed to have plateaued or are even getting worse include:

The current DFR isn’t good enough. Reviewing to DFR will cost more than is acceptable, and it is worth the cost of a few more training rounds to see if it can be improved.

While the current DFR is good, the cost of additional training rounds is low enough that it is worth seeing if additional gains can be made.

Jiggling the Process: Techniques for Breaking Through a Plateau

Looking for responsive documents can be similar to digging for gold. Once you find some gold, if you are lucky it is part of a rich vein that you can follow to find more and more gold.

For documents, finding a responsive document that has features similar to many other responsive documents is like finding a rich vein of gold. This is the situation where active learning has the easiest time identifying documents that provide significant gains in identifying the real gold in that vein, and differentiating the real gold from documents that are similar but not responsive.

Breaking through a plateau often means finding a new vein of responsive documents (gold) that aren’t related to the veins of responsive documents the classifier has already been trained with. Here are some ideas for how you can do that:

  • If you have been using the same active learning algorithm each round, try a different one. If you have been using the fast method, try the accurate active learning or vice versa.

  • If you have tried both active learning methods, try using n-seeds for a training round.

  • Try a larger (at least twice the size or more) than normal training round using the fast active learning algorithm. Due to how the fast active learning algorithm works, this has the effect of casting a wider net for finding responsive documents.

  • If you think there must be some large veins of responsive documents that you haven’t found yet, and n-seeds didn’t find them, try a round where the training documents are selected at random.

  • If it fits your budget, consider turning your control set into a training set, and drawing and reviewing a new control set. Before drawing and reviewing the new control set, you can use all the knowledge you gained from the old control set to come up with searches or otherwise identify documents for review based on everything you’ve learned about the document set so far. Since you are doing this before drawing a new control set, you can use all of Discovery’s document set exploration tools along with everything you have learned to identify interesting documents for review without any fear of harming the control set.

After trying some rounds using alternative algorithms for selecting training documents, you can go back to your preferred algorithm(s) and see if normal training returns to profitably improving the DFR.

Using Consistency Reports and Uncertainty Scores

To get good predictive coding results you need to code the control set, and training set documents consistently. To help you with this, a consistency report is generated for each of these sets.

Consistency reports are CSV files. While there are various ways to access some of the information found in consistency reports, one of your options is to download the consistency report. Once you have downloaded the report you can load it into your favorite spreadsheet and look at it.

Consistency reports have one row for each document in the report. For the control set and seed set reports, there is a row for each document in that set. For training round consistency reports, there is a row for each document that contributed to the training in that round. That means that training round consistency reports will also contain rows for seed set documents and for documents in previous training rounds.

First 3 entries for a consistency report sorted by “uncertainty”.

When you read a consistency report, the first column that will interest you is the “matches” column. If “matches” is yes, that means that the code on that document and the prediction for that document match. If “matches” is no, the code and prediction don’t match, and you may want to check that document to see if it was coded properly. If you sort “matches” in ascending order, the “no” entries will be sorted to the top.

The next column you might be interested in is “uncertainty”. The higher the “uncertainty” value, the less certain the classifier is about its prediction for that document. Even if the classifier made a correct prediction for the document, you may want to recheck the codes assigned to the most uncertain documents.

In the above example you can see the “term1” and “score1” columns. Normally there are 8 term-score pairs. Most were truncated above so the table would fit. The terms are the terms that most strongly influenced the prediction for a document, and the score is how much that term contributed to the raw score . The terms and scores give you some insight into why the classifier made the prediction that it did. This can help you decide if the document itself is tricky, and more examples are needed so the classifier can learn about what makes that document different, or if the classifier has some bad learning, and you need to find the training documents with those top terms that are coded wrong.