Skip to main content

Ingestion Format

Before DataGrail can match DROP registry deletion requests against your consumer data, you need to deliver your consumer identifiers to DataGrail. This article covers file format specifications, delivery via cloud storage, and validation behavior.

Prerequisites

Before starting, ensure you have:

  • Completed Identifier Configuration and saved your DROP list selections
  • A cloud storage bucket (AWS S3, GCS, or Azure Blob Storage) that DataGrail can read from

Ingestion Mode

DataGrail recommends delivering pre-hashed identifiers — you apply the DROP standardization and hashing rules to your identifiers before sending, and DataGrail stores and matches the hashes directly. This ensures raw PII never leaves your environment. See Hashing Algorithm for the exact rules you need to implement.

Clear identifier ingestion is available

DataGrail can also accept clear (unhashed) identifiers and handle standardization and hashing on your behalf. This option requires prior discussion before implementation — contact support@datagrail.io to explore this path. Both modes use the same file format and delivery method described below — the difference is whether identifier fields contain clear values or pre-computed hashes.


Delivery: Cloud Storage Import

DataGrail ingests identifier files from a cloud storage bucket you own. You upload files to your bucket; DataGrail reads them.

Supported providers: AWS S3, Google Cloud Storage (GCS), Azure Blob Storage.

Setup

  1. Create a bucket (or dedicate a prefix in an existing bucket) for DataGrail DROP data.
  2. Grant DataGrail read-only access to the prefix:
    • AWS S3: IAM role with s3:GetObject + s3:ListBucket scoped to the prefix
    • GCS: Workload Identity Federation (WIF) or service account with storage.objects.get + storage.objects.list
    • Azure Blob: RBAC role Storage Blob Data Reader scoped to the container/prefix
  3. Configure the bucket and prefix in DataGrail under DROP Compliance > Settings.

File Layout

s3://your-bucket/datagrail-drop/
├── manifest.json ← declares which files to ingest
└── data/
├── part-0001.ndjson.gz
├── part-0002.ndjson.gz
└── ...

Target 256 MB – 1 GB compressed per file. This range optimizes for parallel reads, resumability, and memory efficiency.

Corpus size (compressed)Recommended splitResulting files
< 1 GBSingle file1
1–10 GB~500 MB each2–20
10–100 GB~1 GB each10–100
> 100 GB~1 GB each100+

Manifest

A manifest.json at the root of your prefix declares what to ingest. DataGrail polls for new manifests every 15 minutes by default. Presence of a new manifest triggers ingestion.

{
"schema_version": "1.0",
"record_type": "consumer_identifier_manifest",
"broker_registration_id": "your-br-id-from-datagrail",
"emitted_at": "2026-05-30T14:22:03Z",
"format": "ndjson",
"compression": "gzip",
"files": [
{
"path": "data/part-0001.ndjson.gz",
"size_bytes": 536870912,
"sha256": "a1b2c3d4...",
"record_count": 4200000
}
],
"total_record_count": 4200000
}

Manifest fields:

FieldRequiredDescription
schema_versionyes"1.0"
record_typeyes"consumer_identifier_manifest"
broker_registration_idyesYour DataGrail broker registration ID
emitted_atyesISO 8601 UTC — when you produced this batch
formatyes"ndjson"
compressionyes"gzip" (required for files >10 MB) or "none"
files[].pathyesRelative to manifest directory. Must not contain ../
files[].size_bytesyesExact byte size of compressed file. Mismatch aborts ingest.
files[].sha256yesSHA-256 hex digest of compressed file. Mismatch aborts ingest.
files[].record_countyesNumber of records in the file
total_record_countyesSum of all files[].record_count values
Idempotency

DataGrail tracks processed manifests by (broker_registration_id, manifest_sha256). Re-uploading the same manifest is a no-op. To re-ingest, produce a new manifest even if data files are unchanged.

Incremental Updates

For ongoing updates (new consumers, changed identifiers):

  1. Export only new or changed records since the last manifest
  2. Upload new data files to your bucket
  3. Write a new manifest.json referencing only the new files

DataGrail upserts records by (remote_identifier, hash_type) — existing records are updated, new records are inserted.


Record Format — Clear Identifiers

One JSON record per line, gzipped.

{
"schema_version": "1.0",
"record_type": "consumer_identifier",
"emitted_at": "2026-05-30T14:22:03Z",
"data": {
"remote_identifier": "0035f00000A1B2CAAZ",
"remote_identifier_kind": "external_id",
"emails": ["alice@example.com", "alice.work@example.com"],
"phones": ["+15551234567"],
"name": {"first": "Alice", "last": "Smith"},
"dob": "1985-03-12",
"zip": "94103",
"vins": ["1HGBH41JXMN109186"],
"maids": ["A1B2C3D4-E5F6-7890-ABCD-EF1234567890"],
"ctvids": []
}
}

Data Fields (Clear Mode)

The following fields are available in clear identifier records:

FieldRequiredTypeConstraints
remote_identifieryesstring1–256 chars. Your stable consumer pointer (e.g., Salesforce ID, database UUID). Returned to you on deletion.
remote_identifier_kindyesenumexternal_id | row_uuid | email | phone
emailsnostring[]0–100 elements, each ≤254 chars
phonesnostring[]0–100 elements, each ≤32 chars. Any format — DataGrail normalizes.
namenoobject{"first": "...", "last": "..."} — both 1–100 chars
dobnostringYYYY-MM-DD preferred; MM/DD/YYYY also accepted
zipnostringAny format — DataGrail normalizes
vinsnostring[]0–100 elements, each exactly 17 chars
maidsnostring[]0–100 elements, each ≤64 chars
ctvidsnostring[]0–100 elements, each ≤256 chars
Which fields to populate

Only populate fields relevant to your enabled DROP list types. For example, if you're registered for Email + NDZ lists only, populate emails, name, dob, and zip. Missing fields are fine — DataGrail only hashes what's present.


Record Format — Pre-Hashed Identifiers

Same envelope, but the data payload contains hashes instead of clear values. Set "hashed": true to signal pre-hashed mode.

{
"schema_version": "1.0",
"record_type": "consumer_identifier",
"emitted_at": "2026-05-30T14:22:03Z",
"data": {
"remote_identifier": "0035f00000A1B2CAAZ",
"remote_identifier_kind": "external_id",
"hashed": true,
"email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="],
"phone_hashes": ["ptzVkgbv9DonwvPCHmXmJ2SEOaolSh37z3ZzY/Gmm+U="],
"ndz_hashes": ["PQOfn1RffEKmqMmNAzDKKaoZCwxWbQZkQzPWmQo9REA="],
"name_vin_hashes": ["rtnDuXIe63jXYQQXW5r07GJ7lSsrib8+46QuKFwkOmk="],
"maid_hashes": [],
"ctvid_hashes": []
}
}

Data Fields (Pre-Hashed Mode)

The following fields are available in pre-hashed records:

FieldRequiredTypeConstraints
remote_identifieryesstringSame as clear mode
remote_identifier_kindyesenumSame as clear mode
hashedyesbooleanMust be true
email_hashesnostring[]SHA-256 Base64 hashes of standardized emails
phone_hashesnostring[]SHA-256 Base64 hashes of standardized phones
ndz_hashesnostring[]Composite NDZ hashes (first+last+dob+zip)
name_vin_hashesnostring[]Composite NameVIN hashes (first+last+vin)
maid_hashesnostring[]SHA-256 Base64 hashes of standardized MAIDs
ctvid_hashesnostring[]SHA-256 Base64 hashes of standardized CTVIDs

Each hash must be a valid Base64-encoded SHA-256 digest (44 characters, typically ending with =). See Hashing Algorithm for the exact standardization and hashing rules.


Validation & Error Handling

The following table describes how DataGrail handles common validation conditions:

ConditionBehavior
Required fields missingRecord rejected
Invalid hash format (pre-hashed mode)Record rejected
Array exceeds 100 elementsRecord rejected
String exceeds length limitRecord rejected
Rejection rate >1% of records in a fileFile halted — fix and re-upload
Rejection rate ≤1%File succeeds; rejected records reported in ingest results
Manifest checksum mismatchIngest aborted for that file
File missing from bucketIngest aborted; error surfaced in admin UI
Duplicate manifestNo-op — not double-processed

Minimal Examples

Smallest valid record (clear, email-only):

{"schema_version": "1.0", "record_type": "consumer_identifier", "emitted_at": "2026-05-30T00:00:00Z", "data": {"remote_identifier": "cust-001", "remote_identifier_kind": "row_uuid", "emails": ["alice@example.com"]}}

Smallest valid record (pre-hashed, email-only):

{"schema_version": "1.0", "record_type": "consumer_identifier", "emitted_at": "2026-05-30T00:00:00Z", "data": {"remote_identifier": "cust-001", "remote_identifier_kind": "row_uuid", "hashed": true, "email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="]}}

Smallest valid manifest:

{"schema_version": "1.0", "record_type": "consumer_identifier_manifest", "broker_registration_id": "br-001", "emitted_at": "2026-05-30T00:00:00Z", "format": "ndjson", "compression": "gzip", "files": [{"path": "data/part-0001.ndjson.gz", "size_bytes": 1048576, "sha256": "abc123...", "record_count": 10000}], "total_record_count": 10000}

Next Steps

  • Hashing Algorithm — required reading if using pre-hashed ingestion
  • Testing & Validation — run end-to-end test requests to verify your ingestion and matching configuration before go-live

 

Need help?
If you have any questions, please reach out to your dedicated Account Manager or contact us at support@datagrail.io.

Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.