De-duplication. It’s something we talk about a lot in the scope of processing data for review in e-Discovery. Most times when we inquire as to whether our client wants their data de-duped globally or by custodian, it’s not uncommon for us to get a blank stare and then, “what’s the difference?” It’s a small difference, really, that potentially can end up with big consequences, but it all depends on the review workflow.
The biggest challenge we see our clients facing is the attempt to review electronic documents the same way they would approach a paper based review. Technology is good, and to toot our own horns for a quick second, we feel we’re working with the best out there; but it’s only efficient if effective workflow is applied. In a perfect world, the review team would know what that workflow looks like from the very beginning. However, this is rarely the case and it’s quite normal for teams to change the process several times in the scope of review. Something as little as de-duplication is one of the factors to consider when thinking about workflow. Will the review team be grouping and reviewing documents by custodian? Is there a keyword list and will the team use those search results to concentrate their core review efforts? Will the team be reviewing everything in multiple tiers, doc-by-doc?
There are three ways we can approach this as a vendor. The first choice, A) is to de-duplicate globally so that there are no duplicates whatsoever. Zero, zilch, nada. We see this approach used a lot when there is a large set of data, and the review team will be searching by keywords, phrases, and concept analytics for their core review. If they need to later produce documents by custodian, the document can always be produced multiple times with different bates numbers to accommodate the process, if necessary.
The second option, B) is to de-duplicate within each custodians’ records. This only works if we can actually determine which documents belong to each custodian. If the collection was done by a forensic expert, this should not be a concern. In this scenario, there may be duplicates within the entire database, but if the team is reviewing each custodian’s documents separately, and it’s important to know if they received or sent a specific email, they’ll find it. Some review platforms, such as kCura’s Relativity platform, display options within the document viewer to recognize when a record has duplicates and ensure that they are coded consistently, if desired.
The third option, C) would be to de-duplicate globally, but have a rank of importance by custodian. There still will be no duplicates, but it allows the team to control which document is considered the “master” document to de-duplicate against. For example, Executive A was sent an email. His Assistant, we’ll call them Assistant A, gets a copy of all emails sent to Executive A, plus their own emails. If we were to de-duplicate globally, the processing agent may take out Executive A’s copy OR Assistant A’s copy. However, if we were to rank order them, so that Executive A’s records came before Assistant A’s records, Executive A’s would be the master, and we’d de-dupe out Assistant A’s set.
The other question we get asked is how much is it going to cost (it may be hard to believe, but sometimes this comes first). Like anything in litigation, it depends. There usually is a cost per gigabyte to do the de-duplication up front, and then a cost to process the remaining data for review. As it’s impossible to know exactly how much will be duplicative from the very beginning, it makes creating an estimate for this cost challenging. The amount that is duplicative is also dependent on which of the three workflows are chosen above. Depending on the technology used by the vendor, pre-processing reports can be run on the data set that give you an idea of what those numbers will look like in order to make a more accurate budget and decision. These reports can be a valuable tool in communicating with your client and helping them understand e-discovery expectations and cost.
While these explanations seem straight forward enough, it’s also helpful to talk to your vendor and explain the details and specifics of the data in question, as every scenario is different. Let us know if we can answer any questions or provide more information.