Skip to main content

Bring Your Own Integration

Overview

To scan your data systems, DataGrail offers out-of-the-box connectors with the RDD Agent for immediate use, alongside an OpenAPI specification and tooling that enable you to quickly build support for any additional or proprietary systems.

What Systems Are Supported

To optimize the onboarding experience for all our customers, DataGrail aims to support the most commonly used data systems out-of-the-box. To identify the most commonly used systems, DataGrail uses internal data as well as online rankings.

The current list of supported systems includes:

  • AWS Athena (Glue Databases)
  • BigQuery
  • Databricks Lakehouse
  • DynamoDB
  • MongoDB
  • DocumentDB
  • MySQL
  • MariaDB
  • PostgreSQL
  • Redshift
  • Snowflake
  • MS SQL Server
  • Azure SQL Database
Looking for a different integration?

Given DataGrail is continually adding integrations, reach out to your Account Manager to get the most up-to-date list.

Customers can expect DataGrail will natively support roughly the top 50 most commonly used relational/NoSQL systems according to DB-Engines.

Connecting to Any System

While our Agent offers direct connectors to a wide range of popular databases and data platforms, we understand that your data ecosystem might include unique, proprietary systems or specialized data sources that are not natively supported out-of-the-box.

For these scenarios, DataGrail provides a robust and flexible solution: an OpenAPI (formerly Swagger) specification and accompanying tooling.

Instead of direct data store connectivity, you can leverage our OpenAPI spec to build a lightweight API service that acts as an intermediary between the RDD Agent and your specific data system.

Developing Your Integration Service

The OpenAPI specification defines the API contract our Agent expects when communicating with a data source. This specification outlines the required endpoints, request formats, response structures, and authentication mechanisms.

Your integration service will typically need to implement four key endpoints. Each of these endpoints is designed to be highly flexible and will be parameterized by an underlying data source ID, allowing a single service to manage connections to multiple data store instances.

/test_connection

Purpose: This endpoint allows our Agent to verify connectivity, authentication, and authorization with your underlying data source.

Your Service's Role: Your service will receive a request from the RDD Agent and, in response, issue a simple, non-destructive query or request to your proprietary data system. This confirms that the credentials (inaccessible to DataGrail) are valid and the data source is reachable.

/schema

Purpose: Our Agent needs to understand the structure of your data. This endpoint is crucial for data discovery and classification.

Your Service's Role: When called, your service will query or request essential metadata. This includes details like table or object names, schema names, column names, and their corresponding data types.

/record_count

Purpose: Provides the RDD Agent with an understanding of the volume of data within a specific table or object.

Your Service's Role: This endpoint should return the record/object count for a requested table or object store. For very large datasets, providing an approximation is perfectly acceptable and often preferred for performance.

/sample_data

Purpose: Offers a quick snapshot of the actual data, essential for more accurate preprocessing and classification of the underlying data.

Your Service's Role: Your service will be asked to return a statistically significant and random (where possible) set of records from a specified table or object. To balance utility with performance, this sample typically numbers no more than 20,000 records.

Network Configuration

Data Discovery API Architecture and Workflow

The service can be deployed in its own containerized service alongside the RDD agent within your environment. It will need ingress from the RDD agent only, and egress to any relevant data systems.

Additionally, we recommend using a load balancer or proxy that accepts TLS connections. While not strictly critical given that no public ingress or egress is required, this adds an additional layer of security and best practice for internal communication.


Getting Started

To streamline the development process, we offer sample code, the ability to leverage OpenAPI spec code generators, and carefully-tuned GenAI prompts to guide your custom integration.

Depending on the system, and with the support of our technical staff, you can expect development and testing of a new system to take about 1-3 days.

Reach out to your Solutions Architect for more details and support.

 

Need help?
If you have any questions, please reach out to your dedicated Account Manager or contact us at support@datagrail.io.

Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.