Ingestion Format
Before DataGrail can match DROP registry deletion requests against your consumer data, you need to deliver your consumer identifiers to DataGrail. This article covers file format specifications, delivery via cloud storage, and validation behavior.
Before starting, ensure you have:
- Completed Identifier Configuration and saved your DROP list selections
- A cloud storage bucket (AWS S3, GCS, or Azure Blob Storage) that DataGrail can read from
Ingestion Mode
DataGrail recommends delivering pre-hashed identifiers — you apply the DROP standardization and hashing rules to your identifiers before sending, and DataGrail stores and matches the hashes directly. This ensures raw PII never leaves your environment. See Hashing Algorithm for the exact rules you need to implement.
DataGrail can also accept clear (unhashed) identifiers and handle standardization and hashing on your behalf. This option requires prior discussion before implementation — contact support@datagrail.io to explore this path. Both modes use the same file format and delivery method described below — the difference is whether identifier fields contain clear values or pre-computed hashes.
Delivery: Cloud Storage Import
DataGrail ingests identifier files from a cloud storage bucket you own. You upload files to your bucket; DataGrail reads them.
Supported providers: AWS S3, Google Cloud Storage (GCS), Azure Blob Storage.
Setup
- Create a bucket (or dedicate a prefix in an existing bucket) for DataGrail DROP data.
- Grant DataGrail read-only access to the prefix:
- AWS S3: IAM role with
s3:GetObject+s3:ListBucketscoped to the prefix - GCS: Workload Identity Federation (WIF) or service account with
storage.objects.get+storage.objects.list - Azure Blob: RBAC role
Storage Blob Data Readerscoped to the container/prefix
- AWS S3: IAM role with
- Configure the bucket and prefix in DataGrail under DROP Compliance > Settings.
File Layout
s3://your-bucket/datagrail-drop/
├── manifest.json ← declares which files to ingest
└── data/
├── part-0001.ndjson.gz
├── part-0002.ndjson.gz
└── ...
Recommended File Size
Target 256 MB – 1 GB compressed per file. This range optimizes for parallel reads, resumability, and memory efficiency.
| Corpus size (compressed) | Recommended split | Resulting files |
|---|---|---|
| < 1 GB | Single file | 1 |
| 1–10 GB | ~500 MB each | 2–20 |
| 10–100 GB | ~1 GB each | 10–100 |
| > 100 GB | ~1 GB each | 100+ |
Manifest
A manifest.json at the root of your prefix declares what to ingest. DataGrail polls for new manifests every 15 minutes by default. Presence of a new manifest triggers ingestion.
{
"schema_version": "1.0",
"record_type": "consumer_identifier_manifest",
"broker_registration_id": "your-br-id-from-datagrail",
"emitted_at": "2026-05-30T14:22:03Z",
"format": "ndjson",
"compression": "gzip",
"files": [
{
"path": "data/part-0001.ndjson.gz",
"size_bytes": 536870912,
"sha256": "a1b2c3d4...",
"record_count": 4200000
}
],
"total_record_count": 4200000
}
Manifest fields:
| Field | Required | Description |
|---|---|---|
schema_version | yes | "1.0" |
record_type | yes | "consumer_identifier_manifest" |
broker_registration_id | yes | Your DataGrail broker registration ID |
emitted_at | yes | ISO 8601 UTC — when you produced this batch |
format | yes | "ndjson" |
compression | yes | "gzip" (required for files >10 MB) or "none" |
files[].path | yes | Relative to manifest directory. Must not contain ../ |
files[].size_bytes | yes | Exact byte size of compressed file. Mismatch aborts ingest. |
files[].sha256 | yes | SHA-256 hex digest of compressed file. Mismatch aborts ingest. |
files[].record_count | yes | Number of records in the file |
total_record_count | yes | Sum of all files[].record_count values |
DataGrail tracks processed manifests by (broker_registration_id, manifest_sha256). Re-uploading the same manifest is a no-op. To re-ingest, produce a new manifest even if data files are unchanged.
Incremental Updates
For ongoing updates (new consumers, changed identifiers):
- Export only new or changed records since the last manifest
- Upload new data files to your bucket
- Write a new
manifest.jsonreferencing only the new files
DataGrail upserts records by (remote_identifier, hash_type) — existing records are updated, new records are inserted.
Record Format — Clear Identifiers
NDJSON (recommended)
One JSON record per line, gzipped.
{
"schema_version": "1.0",
"record_type": "consumer_identifier",
"emitted_at": "2026-05-30T14:22:03Z",
"data": {
"remote_identifier": "0035f00000A1B2CAAZ",
"remote_identifier_kind": "external_id",
"emails": ["alice@example.com", "alice.work@example.com"],
"phones": ["+15551234567"],
"name": {"first": "Alice", "last": "Smith"},
"dob": "1985-03-12",
"zip": "94103",
"vins": ["1HGBH41JXMN109186"],
"maids": ["A1B2C3D4-E5F6-7890-ABCD-EF1234567890"],
"ctvids": []
}
}
Data Fields (Clear Mode)
The following fields are available in clear identifier records:
| Field | Required | Type | Constraints |
|---|---|---|---|
remote_identifier | yes | string | 1–256 chars. Your stable consumer pointer (e.g., Salesforce ID, database UUID). Returned to you on deletion. |
remote_identifier_kind | yes | enum | external_id | row_uuid | email | phone |
emails | no | string[] | 0–100 elements, each ≤254 chars |
phones | no | string[] | 0–100 elements, each ≤32 chars. Any format — DataGrail normalizes. |
name | no | object | {"first": "...", "last": "..."} — both 1–100 chars |
dob | no | string | YYYY-MM-DD preferred; MM/DD/YYYY also accepted |
zip | no | string | Any format — DataGrail normalizes |
vins | no | string[] | 0–100 elements, each exactly 17 chars |
maids | no | string[] | 0–100 elements, each ≤64 chars |
ctvids | no | string[] | 0–100 elements, each ≤256 chars |
Only populate fields relevant to your enabled DROP list types. For example, if you're registered for Email + NDZ lists only, populate emails, name, dob, and zip. Missing fields are fine — DataGrail only hashes what's present.
Record Format — Pre-Hashed Identifiers
Same envelope, but the data payload contains hashes instead of clear values. Set "hashed": true to signal pre-hashed mode.
{
"schema_version": "1.0",
"record_type": "consumer_identifier",
"emitted_at": "2026-05-30T14:22:03Z",
"data": {
"remote_identifier": "0035f00000A1B2CAAZ",
"remote_identifier_kind": "external_id",
"hashed": true,
"email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="],
"phone_hashes": ["ptzVkgbv9DonwvPCHmXmJ2SEOaolSh37z3ZzY/Gmm+U="],
"ndz_hashes": ["PQOfn1RffEKmqMmNAzDKKaoZCwxWbQZkQzPWmQo9REA="],
"name_vin_hashes": ["rtnDuXIe63jXYQQXW5r07GJ7lSsrib8+46QuKFwkOmk="],
"maid_hashes": [],
"ctvid_hashes": []
}
}
Data Fields (Pre-Hashed Mode)
The following fields are available in pre-hashed records:
| Field | Required | Type | Constraints |
|---|---|---|---|
remote_identifier | yes | string | Same as clear mode |
remote_identifier_kind | yes | enum | Same as clear mode |
hashed | yes | boolean | Must be true |
email_hashes | no | string[] | SHA-256 Base64 hashes of standardized emails |
phone_hashes | no | string[] | SHA-256 Base64 hashes of standardized phones |
ndz_hashes | no | string[] | Composite NDZ hashes (first+last+dob+zip) |
name_vin_hashes | no | string[] | Composite NameVIN hashes (first+last+vin) |
maid_hashes | no | string[] | SHA-256 Base64 hashes of standardized MAIDs |
ctvid_hashes | no | string[] | SHA-256 Base64 hashes of standardized CTVIDs |
Each hash must be a valid Base64-encoded SHA-256 digest (44 characters, typically ending with =). See Hashing Algorithm for the exact standardization and hashing rules.
Validation & Error Handling
The following table describes how DataGrail handles common validation conditions:
| Condition | Behavior |
|---|---|
| Required fields missing | Record rejected |
| Invalid hash format (pre-hashed mode) | Record rejected |
| Array exceeds 100 elements | Record rejected |
| String exceeds length limit | Record rejected |
| Rejection rate >1% of records in a file | File halted — fix and re-upload |
| Rejection rate ≤1% | File succeeds; rejected records reported in ingest results |
| Manifest checksum mismatch | Ingest aborted for that file |
| File missing from bucket | Ingest aborted; error surfaced in admin UI |
| Duplicate manifest | No-op — not double-processed |
Minimal Examples
Smallest valid record (clear, email-only):
{"schema_version": "1.0", "record_type": "consumer_identifier", "emitted_at": "2026-05-30T00:00:00Z", "data": {"remote_identifier": "cust-001", "remote_identifier_kind": "row_uuid", "emails": ["alice@example.com"]}}
Smallest valid record (pre-hashed, email-only):
{"schema_version": "1.0", "record_type": "consumer_identifier", "emitted_at": "2026-05-30T00:00:00Z", "data": {"remote_identifier": "cust-001", "remote_identifier_kind": "row_uuid", "hashed": true, "email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="]}}
Smallest valid manifest:
{"schema_version": "1.0", "record_type": "consumer_identifier_manifest", "broker_registration_id": "br-001", "emitted_at": "2026-05-30T00:00:00Z", "format": "ndjson", "compression": "gzip", "files": [{"path": "data/part-0001.ndjson.gz", "size_bytes": 1048576, "sha256": "abc123...", "record_count": 10000}], "total_record_count": 10000}
Next Steps
- Hashing Algorithm — required reading if using pre-hashed ingestion
- Testing & Validation — run end-to-end test requests to verify your ingestion and matching configuration before go-live
Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.