MasterFile's near-duplicate support finally makes near-duplicate processing affordable for small and solo practices. Near-duplicate documents and e-mail threads are organized within MasterFile using analysis results produced by Equivio Near-duplicate processing is a key component of MasterFile's Load 'n Go E-Discovery Platform. MasterFile simply puts the smaller practice on an equal footing with the larger or international firm.
Research shows that 30-50% of documents are duplicates or near-duplicates. Detecting and grouping these in MasterFile using Equivio's near duplicate technology generates immediate, concrete time savings.
Exporting documents and loading the results file from Equivio takes just a few mouse clicks. Two unique MasterFile views create the easiest to use near-duplicate review tool available.
The difference between near-duplicate and de-duplicate processing
De-duplication identifies electronic documents that are exact duplicates of each other. Content is not analyzed, but instead files are compared byte for byte. Near-duplicate processing, like Equivio's, moves beyond traditional de-duplication techniques to solve the following three much more complex problems:
- Identifying documents that have identical content but are in different formats. For example a document created in Word, converted to PDF, finally faxed, scanned and OCRed as an image.
- Identifying near-duplicate documents, that is documents which have similar but not identical content, such as documents that have been revised or have several draft versions.
- Reconstructing e-mail conversation threads from e-mail history, found at the bottom of e-mails, to reduce the amount of e-mails to be read and reviewed.
To streamline review, the documents are analysed to cluster duplicate and
near-duplicate documents into groups, flagging one document as the "pivot"
document (circled in red) against which the remaining similar documents in the cluster are compared and rated for similarity. Staff simply review the pivot document of each cluster and if it is found relevant, they can then review the near-duplicates of the pivot. This avoids reading the near-duplicates of irrelevant documents and saves significant time.
When people reply to an e-mail, they usually leave the original message intact as history at the end of the e-mail. Equivio's e-mail threading analyses this history to reconstruct the different threads of an e-mail conversation tree. For each branch, the leaf node (circled in red), which includes the history of all the preceding e-mails in that branch, is identified and flagged as the "inclusive" e-mail. Consequently, it is sufficient to review just the inclusives and thus drastically reduce the number of e-mails and repetitive history that needs to be read or examined.
Near-duplicate processing is not performed within MasterFile. Instead you simply:
- Export the documents from MasterFile.
- Load the Equivio results file back into the original MasterFile database the documents came from.
Once the Equivio results file has been loaded into MasterFile, two special views organize the documents into sets of near-duplicates or e-mail threads.
MasterFile near-duplicate tools -- Near-duplicate review has never been so efficient and easy
MasterFile provides two special views that automatically organize near-duplicate document clusters and e-mail threads into groups for instant access.
The "Everything as documents" view organizes groups of near-duplicate documents together. The pivot document is the first document listed in the group, exact duplicates appear next followed by near-duplicates in decreasing similarity. This view does not group e-mail attachments with the e-mails they were attached to, but instead treats them as regular documents and groups them with other documents that may be duplicate or near-duplicates of the attachment.
The screen shot below shows 7 groups of near-duplicate documents (Equivio's EquiSets). The section for Equivio EquiSet D0000027 has been expanded to show its "Pivot" document, one duplicate document and 3 near duplicates that are also all equivalent themselves, as indicated by the common "Duplicate Subset" value of 27. As explained above, for documents the "Pivot" document should be used as the baseline file to which other documents in the EquiSet are compared and assessed for equality and similarity.
The "Attachments with e-mails" view is the same as the "Everything as documents" view but groups attachments with the e-mails they belong to. In the screen shot below, the e-mail flagged as "inclusive" has its two attachments listed immediately below it.
Documents flagged with the "Zzz" (sleeping document) icon have been flagged using the "Exclude/Un-exclude" button. This unique MasterFile feature hides these documents from all other views that they would have appeared in to reduce clutter.
Within a document profile, under the "Near Duplicates" section, an "embedded-view" shows the near-duplicates of the document, as shown below. The view is live so you're able to select any entry and double click to open the document.
The Compare Tool, shown below, shows you the differences between documents. The differences will be displayed in your web browser as shown below. You can use the "Previous" and "Next" buttons to move to through the differences.