Skip to main content

Overview

Responsible Data Discovery is built to give you a better, more accurate view into your organization's privacy risk. It enables privacy managers to have automated and up-to-date inventory reports and informed impact assessments. It powers the fulfillment of comprehensive access and deletion requests. And it shifts your risk reduction to be more proactive with better informed policies and controls around your data processing.

DataGrail’s Approach

Detect data in structured or semi-structured data systems

DataGrail helps you protect sensitive data from compromise by identifying and categorizing it for structured data sources (relational databases), schema-less systems (NoSQL stores), and a growing list of customizable third-party applications like Salesforce, Zendesk, and Braze.

Always up-to-date so you don’t have to be

To keep up with evolving compliance laws and data governance regulations, DataGrail classifies data from your company’s data sources and maps it to smart categories. That way, you can help auto-populate common privacy deliverables like DPIAs, RoPAs, subject requests, and more.

Grow the understanding of your tech stack

Apps you use everyday hold personal data, from Salesforce and Zendesk to internal systems and databases. DataGrail employs a smart taxonomy to systemize consumer-sensitive data held in those systems so you can deliver on data protection and privacy requirements quickly.


Data Discovery For Third-Party Systems

We do not want to add to your privacy and security risk. As such, we’ve developed a novel approach that doesn’t require mirroring your data in our systems.

Instead, DataGrail leverages its existing integration network to scan metadata and to perform a light-weight sample of data values. A best-effort is made using APIs (and their limitations) to randomly sample data (typically no more than 20,000 records) to obtain representative samples. These are performed on a weekly basis during off-hours while honoring rate limits to ensure no disruption of day-to-day operations.

Once sampled, the data is pre-processed and fully anonymized (using a k-anonymity factor of more than 20). The anonymized data is then classified using our machine learning models.

You can view our list of Data Discovery Integrations here.


Data Discovery For Internal Systems & Databases

Data Discovery API Architecture and Workflow

Data Discovery API Architecture and Workflow

Secure by Design

DataGrail does not want to add to your privacy and security risk. As such, we’ve designed data discovery to prevent direct connection to your data systems from outside your network. Stated in concrete terms, if DataGrail were to be compromised, our systems are designed so a bad actor would not be able to access your internal systems or network.

Minimize Risk

To protect your data, we’ve implemented an agent that can be deployed securely within your private networks that is unreachable via the public internet. This agent can be configured to securely connect to your data systems with read-only privileges without sharing secrets or credentials with DataGrail. It’s able to retrieve schemas and other metadata, and scan and process data. When done processing, the agent shares metadata, classification features and anonymized data with a DataGrail API service that is responsible for classifying the data and updating reports accordingly.

DataGrail Agent - Scanner

Built atop the DataGrail Internal Systems Agent, the scanner is a containerized application that can run on your private or public cloud. If you’re hosted on Amazon Web Services (AWS), for example, we recommend running the agent on ECS Fargate (serverless).

Instructions detail how to enable connections to out-of-the-box database systems like PostgreSQL, MySQL, and Snowflake. New connectors can be added as needed, typically, within two weeks.

Secrets and passwords are stored securely with secret managers in your cloud which provide logging and auditability.

The agent is statically configured (no dynamic configuration is supported) before being executed as a task. These tasks can be scheduled using your preferred containerization platform. Execution can take anywhere from minutes to hours depending on the number of databases and tables that need to be scanned.

Responsible Data Discovery API

When metadata and anonymized sample data are retrieved, the agent securely posts data over HTTPS to the RDD API, authenticated by a token provided by a DataGrail. Here, the data is classified and associated with your internal systems reports, which in turn can inform records of processing activities (RoPAs), privacy impact assessments and more. All in all, DataGrail is able to aggregate all this information across all systems to give you a holistic view of your inherent privacy risk.

The API is designed with flexibility in mind and the detailed specs can be provided upon request to allow proprietary clients to be built.

Supported Database Criteria

DataGrail is able to support the following types of systems:

  1. Operational SQL systems (e.g., PostgreSQL, MySQL, Microsoft SQL Server)
  2. NoSQL systems (e.g., DynamoDB, S3, MongoDB)
  3. SQL-based analytical systems (e.g., Redshift, BigQuery, Snowflake)

We recommend starting with canonical systems where data is collected. These tend to fall under the first two categories above and are typically considered critical to running customer-facing applications. By starting with these systems, you’ll gain immediate visibility into your inherent privacy risk without tackling the more time-consuming and resource-intensive task of scanning analytical systems, which are likely to have multiple variations of the same data, especially if you have well-established ETL pipelines.

Following up with analytical systems will ensure that all inferred or predicted personal data (e.g., if you are predicting gender from data captured in upstream canonical systems) is also captured and reported accurately.


Data Classification

Key to a better understanding of your privacy footprint are DataGrail’s proprietary classification models, which are able to map thousands of data elements to a few dozen categories (see personal data taxonomy).

Classification accuracy will vary, but DataGrail optimizes for recall (i.e., as few false negatives as possible), given the cost of not identifying personal data, especially if sensitive, far outweighs that of misclassifying some data elements as containing personal data (false positives).

To achieve this, we take a two-phased approach. First, we will classify data in a small set of canonical data systems and can then review results together. Data from sandboxes is recommended where possible. Once fine-tuned, as part of phase two, DataGrail will classify new data on an ongoing basis based on your configuration. We recommend configuring a monthly or quarterly cadence, depending on your capacity to review reports.


Up-to-date Reporting

Automated Data Category Updates

Automated Data Category Updates

As new categories are detected, privacy managers will be alerted automatically with no need to do this manually by system owners via surveys or questionnaires.

Automated Data Category Updates List

Similarly, system reports will be updated automatically with new categories and privacy managers will be able to quickly review.

Sensible Reviews

Sensible Reviews

Reviews can be done at the category level, without having to paginate through thousands of data elements. Once approved, these reports will automatically inform RoPAs, privacy impact assessments and more.

Detailed Reporting

Detailed Reporting

If you want a more detailed view, reports with all data elements are available for review in-app and via export.


Personal Data Taxonomy

Category Support

Categories bolded below have better support.

Contact Information

  • Email Address
  • Name (full, first or last)
  • Phone Number (landline, mobile or fax)
  • Address (postal, billing)
  • Username, Social Handle or Alias (addressable)

Employment & Business Information

  • Application Number
  • Beneficiaries
  • Benefits Information
  • Company Name
  • Criminal Convictions
  • Dietary Preferences
  • Emergency Contacts
  • Employee Identification Number
  • Employment Decision Record
  • Employment History or Status
  • Individual Quotas
  • Job Role, Title or Position
  • Payroll Information
  • Performance Information
  • Bio or Profile
  • Salary Information
  • Sponsorship Information
  • Travel Data

Education Information

  • Assessment or Score
  • Degree or Certification
  • Education Status or History
  • Graduation or Attainment Date
  • School or Accreditation Body

Government Identification

  • Driver’s License or Other State ID
  • Taxpayer Identification Number (TIN)
  • Immigration or Naturalization Number
  • Other Government Identifier (Military Identification, Known Traveler, Registro Geral (BZ), My Number (JP) etc)
  • Passport Number
  • Professional License Number
  • National Insurance Number (SIN, SSC, SNAP, GHIC etc)
  • Social Security Number (SSN)
  • Vehicle or License Plate Number (VIN)

Demographics & Psychographics

  • Age, Birthday or Range
  • Audience Segment
  • Birthplace or Hometown
  • Citizenship or Naturalization
  • Education Level
  • Family and Lifestyle
  • Gender
  • Geography
  • Immigration Status
  • Income or Range
  • Interests, Favorites, Possessions
  • Marital Status
  • Military Status
  • Nationality
  • Political Opinions or Affiliation
  • Preferred Language
  • Presence of Children
  • Racial or Ethnic Origin
  • Religious or Philosophical Beliefs
  • Sex Life or Sexual Orientation
  • Trade Union Membership
  • Veteran Status

Online & Mobile Data

  • Ad Engagement (views, clicks etc)
  • App or Site Usage (visits, sessions, downloads etc )
  • Inferred or Derived Data
  • Browsing History
  • Consents, Opt-Outs and Preferences
  • Electronic Signature
  • Email Engagement (views, opens, clicks, clickthrough etc)
  • Personal Directory Information (calendar, address book, call/text log, files etc)
  • Communication Contents (mail, email, messages etc)
  • Social Profile
  • Requests, Posts, Comments, Reviews and Ratings
  • Search History

Online & Mobile Identifiers

  • Beacon ID
  • Browser or Device Profile (type, OS, language, resolution, apps etc)
  • Device ID (MAC, Apple ID, Android ID, Ad ID, serial etc)
  • Hashed Email or Phone
  • Household ID
  • IoT Device ID
  • IP Address
  • User ID
  • Website Visitor ID (cookies, pixels, strings, ad browser fingerprint)

Security & Diagnostics Data

  • Access and Change Logs
  • Crash and Event Logs
  • Credentials (usernames with passwords)
  • Network Logs
  • Security Logs
  • Activation, Recovery or Verification Information

Audiovisual & Sensor Data

  • Audio
  • Photos
  • Sensors
  • Video

Location Information

  • Coarse Location (geo, ZIP, radio tower, public beacon etc)
  • Precise Location (GPS, lat/long, personal beacon, location over time)

Biometric Data

  • Fingerprint
  • Facial Patterns
  • Iris Patterns
  • Voice Patterns
  • Handwriting

Genetic Data

  • DNA
  • Family Genomics
  • Ethnographics

Health & Medical Data

  • Fitness, Diet and Wellness
  • Heart Rate
  • Condition
  • Treatment
  • Medical History
  • Medical Record Number
  • Insurance, Claims and Billing Information
  • Prescription Information
  • Height and Weight

Payment & Financial Information

  • Account Balance
  • Bank or Financial Account
  • Bank or Financial Institution
  • Commercial Decision Record
  • Credit Score or History
  • Customer ID
  • CV2/CVV2/Visual Cryptogram
  • Know Your Customer (KYC) Information
  • Payment Information (credit, debit, pay service)
  • Payment Service Code
  • Personal PIN or Access Code
  • Purchase, Order or Transaction Details
  • Tax and Filing Information

Other

  • Custom personal data categories and elements directed by DataGrail Customers

 

Need help?
If you have any questions, please reach out to your dedicated CSM or contact us at support@datagrail.io.

Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.