Database Record Matching (DBRM)
Purpose
The real goal in a PHI inspection engine should be to accurately detect the actual pieces of information that are being sought. While matching to certain formats or noting relevant surrounding text as described above may provide an alert to a higher possibility of finding sensitive data, a more accurate method is to look for matches to the actual data itself. And, that is what Database Record Matching does.
Database Record Matching (DBRM) is a method of creating mathematical hashes of the true sensitive data, and using those hashes to look for that exactly identical data by ashing the target data when inspecting other sources such as an email, a file share, the cloud, a web posting; anywhere that true data would be problematic if found there.
Creating Fingerprints
The DBRM process begins with querying an internal database table known to contain complete and accurate records identifying personal information. This is usually a string of characters, such as SSN, MRN, Policy ID, Account number, Member number, etc.
This is typically a simple query performed against a data warehouse or reporting database. It is only important that data known to be accurate (true) is obtained. Once established, this process is automated to requery the database on a daily or other appropriate regular basis so that current values will always be incorporated. In practice this is typically setup in less than an hour with someone normally responsible for report generation or business intelligence.
At this point the DBRM engine creates one way hashes, often called “fingerprints”, of each individual field of protected data, and stores these fingerprints within the engine. For security, the procedure does not keep the original (readable) data, only the hashes are used. These fingerprints will then be used to find instances of the exact same data if it exists in an inspected target file.
Inspecting Data
At this point, the DBRM engine is ready to find sensitive data elements inside operational data. The inspected content might be an email, a web posting, in the cloud, a file on a network device, a file being copied to a USB drive, or anything else being inspected by the overall DLP solution.
Database Record Matching™ (DBRM), exclusive to Digital Guardian for Compliance, is an extremely accurate method to detect an actual policy ID in all inspected text.
The content to be inspected is serially run through the same DBRM hashing algorithm that was used to create the fingerprints of the actual data. When fingerprints (hashes) match, then that exact sensitive data element has been accurately identified.
DBRM is thus able determine which elements in the inspected record matched the actual sensitive data. In addition, multiple elements from the same actual records can be used for further confidence. This could include, for example, requiring that the corresponding patient last name is seen somewhere nearby a potential sensitive MRN discovered in the target data.
An Example with Social Security Numbers
An extreme example using SSNs may illustrate how DBRM differs from simple use of patterns, in reducing the potential for false positives:
• There are 1,000,000,000 possible 9-digit numbers.
• Of these, about 900,000,000 are in valid issuable ranges (as of changes made in 2011).
• If there are only 10,000 actual patient records in a hospital’s database, then the odds that merely finding a sequence of 9 digits actually identified a hospital patient would be unacceptably low. In other words, a high number of false positives will be generated by this method.
• With DBRM, on the other hand, the source data for creating the hashes will be the hospital’s actual 10,000 patient SSNs. Hence the record being inspected is either related to that patient or that record only coincidentally happens to contain that 9 digit sequence in some other context.
The DBRM system may eventually come upon 9 digit sequences that are only coincidentally equal to, but not actually representing a patient’s SSN within the record. To improve upon this the DBRM system can be tuned to reduce the generation of such false positives by requiring another true database element to be present. For instance, by requiring the corresponding person’s actual last name (hashed) to be somewhere nearby the SSN, such false positives will be virtually eliminated.
An Example with Medical Records
Medical Record or Medical Insurance policy “numbers” don’t follow the same format across institutions, leading to an array of differing possible formats. Some of these are numeric. Some are alphanumeric. And, they can differ in length and other formatting aspects. Thus, pre-built pattern matching solutions can require significant tailoring to produce reliable results with such data. In contrast, DBRM, by its nature, easily provides an extremely accurate means to detect an actual ID in all inspected alphanumeric forms.
DBRM gains more reliable matching for account numbers with fewer digits, while Pattern matching suffers even worse false positive problems than with longer values. With DBRM, detecting account numbers as low as 6 digits (and even sometimes 5 digits on lower volume inspection streams) may be reliably performed.
DBRM is appropriate for any unique database elements, regardless of format. This makes it an ideal method for identifying data protected by regulatory compliance requirements.
Pattern matching will always sacrifice real discovery of sensitive data while producing more false negatives.
Resilience to Variations in Format
The inherent flexibility of DBRM allows the implementation of many improvements worth mentioning here. For example, an MRN such as “257121234” may need to match to “257- 12-1234” in the inspected text. Similarly, any identifiers may optionally be presented with any alternate formatting required.
For international use, DBRM is able to handle registration and inspection of non-ASCII data, including double byte language characters. Further, non-ASCII data in the inspected text does not interfere with inspection of ASCII or non-ASCII elements.
The End
Result In customer implementations of DBRM, Code Green Networks has observed organizations experiencing low implementation time requirements, low ongoing time requirements, more actionable results, and lower risk of loss of sensitive data.