Responsible Data Discovery - Agent Setup
Overview
Responsible Data Discovery is built to give you a more accurate view of your organization's privacy risk. It enables privacy managers to have automated and up-to-date inventory reports and informed impact assessments. It powers the fulfillment of comprehensive access and deletion requests. And it shifts your risk reduction initiatives to be more proactive with better informed policies and controls around your data processing.
To achieve this, DataGrail uses one or more containerized agents that you can run in your network to connect with your data sources, scan them, and then pre-process data for classification.
To get started, you’ll need your team to accomplish the following tasks (each of which will be described in more detail in subsequent sections):
- Source docker image provided by DataGrail
- Configure the agent
- Create and run containerized service
1. Sourcing the Docker Image
The docker image for the data discovery agent is hosted in the DataGrail ECR repository (ask about other registries), which you will be granted access to. Use an ARN with the version specified for retrieving your image. Ask your engagement manager or CSM for this information.
2. Agent Configuration
To configure the agent, you will need to perform the following steps, detailed in subsequent sections.
- Create a new Agent in the DataGrail application
- Obtain a DataGrail API Key
- Configure the container template
- Configure access to your secrets vault
DataGrail Agent
The first step in enabling RDD is to create an agent within the DataGrail application. Each instance of an agent will enable you to add and configure systems you’d like to connect and scan.
- Via the left-hand menu, visit Integration Network > Agents
- Click Add New Agent
- Give the agent a name that is easy to remember and that can be associated with a private network, region, types of systems and the like
- For Agent Type, select Data Discovery
- Click Add New Agent
DataGrail API Key
For an agent to securely post to the main DataGrail application, it will need an API key to authorize its requests. This API key should be stored in your vault. For more information on how to configure various vaults, you can reference DataGrail’s docs: https://github.com/datagrail/datagrail-agent-docs
- Within a newly created agent’s page, click Generate API Key
- Give the Key a convenient name
- Copy and store the API key in your secrets vault
Note that API keys cannot be viewed or copied once saved. If you do lose a key, you can always generate a new one. DataGrail also recommends rotating API keys regularly in accordance with your policies.
To configure the secrets in the vault, you can reference the following format.
Secret Type: "Other type of secret"
Key/value pairs:
token: <API Key copied above>
Secret name: datagrail-api-key
Description: <description for the secret>
Note that the key-value pairs must be token: API key, and all other values are only applicable for some vaults and aid in identification of secrets.
Container Template
With the agent created in the DataGrail app, it’s now time to configure the container that will run within your network.
DataGrail recommends you run this with your preferred container orchestration platform like Kubernetes or serverless via AWS ECS, Google Cloud Run and the like. Terraform configuration files are available for some cloud platforms. Ask your DataGrail engagement manager for information.
Each container requires that an environment variable be configured as follows:
DATAGRAIL_AGENT_CONFIG='{
"customer_domain": "yoursubdomain.datagrail.io",
"datagrail_credentials_location": "<secret location>", # vault location of DG API Key
"platform": {
"credentials_manager": {
"provider": "<AWSSSMParameterStore|AWSSecretsManager|JSONFile|GCP|AzureKeyVault>",
"options": {
"optional": "some modules may have required fields, e.g. GCP should have project_id: <project id>, azure needs `secret_vault`",
}
}
}
}'
It is strongly recommended that one API key be used per containerized service for security and debuggability purposes.
Below is a detailed explanation of the JSON fields above.
customer_domain
Your DataGrail-registered tenant domain, eg, customer.datagrail.io. This is the hostname that the agent will be posting back results to via the RDD API.
datagrail_credentials_location
The vault location of the DataGrail API generated per instructions above, used to authorize requests and obtain system configurations, post back scan results, and more.
platform
The secrets/credentials and cloud storage platforms used to deploy the agent. For the purposes of data discovery only the credentials manager needs to be configured. The following platforms are supported: AWS Parameter Store, AWS Secrets Manager, GCP Secret Manger, JSON Files via attached volumes, AWS Key Vault, and more. For the latest, see the GitHub docs: https://github.com/datagrail/datagrail-agent-docs.
If your vault service is currently not supported, reach out to DataGrail support.
credentials_manager
- provider: name of class providing credentials access. Actual class name is e.g. CredentialsJSONFile, remove Credentials here, e.g. use JSONFile
- options: hash/dictionary of options. Optional but some modules may have required fields, e.g. GCP should have "project_id”
3. Running the Agent
After you’ve configured the agent, you’re now ready to run the service. Below are general guidelines to ensure that the service is able to run with proper roles and permissions. Your onboarding specialist will be supporting you with recommendations tailored to your needs.
Network Settings
DataGrail designed data discovery agents with flexibility and security in mind, allowing you to deploy agents where your data sources are hosted to minimize security and privacy risks.
Recommended Configuration:
- One or more agents per virtual private network or data center where data sources are readily reachable and no bridged or proxied connections are necessary
- Run in a private subnet with no ingress
- Egress rules can be set to only allow connections to the DataGrail API (ip: 52.36.177.91) and the source of the docker image
In short, each agent can be configured with egress to the desired data sources, the image repository, and the DataGrai API–and nothing else.
DataGrail has Terraform configuration files to help you get started. Please reach out to your engagement manager for more information.
Minimum System Requirements
Each agent should be configured with at least the following requirements:
- Cores: 4+
- Memory: 8GB+
- Disk storage: 20GB
Note that each scan can take minutes or up to a few hours depending on the volume of your data sources, the number of data elements (columns), and more.
Please consult with DataGrail support for more tailored recommendations as needed.
Logging
Persisting logs for up to 30 days is strongly recommended for debugging purposes. You can use logs to identify configuration issues during setup. For other errors or exceptions, feel free to reach out to DataGrail support with detailed error messages and stacktraces.
Running the Service
Once you run the containerized service, you should see a green check mark in the DataGrail app confirming there’s a successful connection.
Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.