Ingestion Format

Before DataGrail can match DROP registry deletion requests against your consumer data, you need to deliver your consumer identifiers to DataGrail. This article covers file format specifications, delivery via cloud storage, and validation behavior.

Prerequisites

Before starting, ensure you have:

Completed Identifier Configuration and saved your DROP list selections
A cloud storage bucket (AWS S3, GCS, or Azure Blob Storage) that DataGrail can read from ::

Ingestion Mode

DataGrail recommends delivering pre-hashed identifiers — you apply the DROP standardization and hashing rules to your identifiers before sending, and DataGrail stores and matches the hashes directly. This ensures raw PII never leaves your environment. See Hashing Algorithm for the exact rules you need to implement.

Clear identifier ingestion is available

DataGrail can also accept clear (unhashed) identifiers and handle standardization and hashing on your behalf. This option requires prior discussion before implementation — contact support@datagrail.io to explore this path. Both modes use the same file format and delivery method described below — the difference is whether identifier fields contain clear values or pre-computed hashes.

Delivery: Cloud Storage Import

DataGrail ingests identifier files from a cloud storage bucket you own. You upload files to your bucket; DataGrail reads them.

Supported providers: AWS S3, Google Cloud Storage (GCS), Azure Blob Storage.

Setup

Create a bucket (or dedicate a prefix in an existing bucket) for DataGrail DROP data.
Grant DataGrail read-only access to the prefix:
- AWS S3: IAM role with s3:GetObject + s3:ListBucket scoped to the prefix
- GCS: Workload Identity Federation (WIF) or service account with storage.objects.get + storage.objects.list
- Azure Blob: RBAC role Storage Blob Data Reader scoped to the container/prefix
Configure the bucket and prefix in DataGrail under DROP Compliance > Settings.

File Layout

s3://your-bucket/datagrail-drop/
├── manifest.json              ← declares which files to ingest
└── data/
    ├── part-0001.ndjson.gz
    ├── part-0002.ndjson.gz
    └── ...

Recommended File Size

Target 256 MB – 1 GB compressed per file. This range optimizes for parallel reads, resumability, and memory efficiency.

Corpus size (compressed)	Recommended split	Resulting files
< 1 GB	Single file	1
1–10 GB	~500 MB each	2–20
10–100 GB	~1 GB each	10–100
> 100 GB	~1 GB each	100+

Manifest

A manifest.json at the root of your prefix declares what to ingest. DataGrail polls for new manifests every 15 minutes by default. Presence of a new manifest triggers ingestion.

{
  "schema_version": "1.0",
  "record_type": "consumer_identifier_manifest",
  "broker_registration_id": "your-br-id-from-datagrail",
  "emitted_at": "2026-05-30T14:22:03Z",
  "format": "ndjson",
  "compression": "gzip",
  "files": [
    {
      "path": "data/part-0001.ndjson.gz",
      "size_bytes": 536870912,
      "sha256": "a1b2c3d4...",
      "record_count": 4200000
    }
  ],
  "total_record_count": 4200000
}

Manifest fields:

Field	Required	Description
`schema_version`	yes	`"1.0"`
`record_type`	yes	`"consumer_identifier_manifest"`
`broker_registration_id`	yes	Your DataGrail broker registration ID
`emitted_at`	yes	ISO 8601 UTC — when you produced this batch
`format`	yes	`"ndjson"`
`compression`	yes	`"gzip"` (required for files >10 MB) or `"none"`
`files[].path`	yes	Relative to manifest directory. Must not contain `../`
`files[].size_bytes`	yes	Exact byte size of compressed file. Mismatch aborts ingest.
`files[].sha256`	yes	SHA-256 hex digest of compressed file. Mismatch aborts ingest.
`files[].record_count`	yes	Number of records in the file
`total_record_count`	yes	Sum of all `files[].record_count` values

Idempotency

DataGrail tracks processed manifests by (broker_registration_id, manifest_sha256). Re-uploading the same manifest is a no-op. To re-ingest, produce a new manifest even if data files are unchanged.

Incremental Updates

For ongoing updates (new consumers, changed identifiers):

Export only new or changed records since the last manifest
Upload new data files to your bucket
Write a new manifest.json referencing only the new files

DataGrail upserts records by (external_id, identifier type) — for example, if you send a record with external_id: "cust-001" and an updated email list, the existing email hashes for that consumer are replaced. New consumers are inserted.

Clear Identifiers Record Format

One flat JSON record per line, gzipped.

{
  "external_id": "0035f00000A1B2CAAZ",
  "emails": ["alice@example.com", "alice.work@example.com"],
  "phones": ["+15551234567"],
  "first_name": "Alice",
  "last_name": "Smith",
  "dob": "1985-03-12",
  "zip": "94103",
  "vins": ["1HGBH41JXMN109186"],
  "maids": ["A1B2C3D4-E5F6-7890-ABCD-EF1234567890"],
  "ctvids": []
}

Clear Mode Data Fields

The following fields are available in clear identifier records:

Field	Required	Type	Constraints
`external_id`	yes	string	1–256 chars. Your stable consumer pointer (e.g., Salesforce ID, database UUID). Returned to you on deletion.
`emails`	no	string[]	0–20 elements, each ≤320 chars
`phones`	no	string[]	0–10 elements, each ≤20 chars. Any format — DataGrail normalizes.
`first_name`	no	string	≤200 chars
`last_name`	no	string	≤200 chars
`dob`	no	string	≤20 chars. `YYYY-MM-DD` preferred; `MM/DD/YYYY` also accepted
`zip`	no	string	≤20 chars. Any format — DataGrail normalizes
`vins`	no	string[]	0–10 elements, each exactly 17 chars
`maids`	no	string[]	0–10 elements, each ≤36 chars
`ctvids`	no	string[]	0–10 elements, each ≤64 chars

Composite list types need every component field

The NDZ and NameVIN list types are matched on a composite hash built from several fields. If any component is missing, DataGrail cannot compute the composite hash and that record will never match a DROP request for that list type — there is no partial match.

NDZ requires all of first_name, last_name, dob, and zip.
NameVIN requires all of first_name, last_name, and vins.

Single-field list types (Email, Phone, MAID, CTVID) are matched independently — send only the fields relevant to your enabled list types. But for a composite list type, you must send every component field, or you will not match on that list.

Pre-Hashed Identifiers Record Format

Like clear mode, each record is a flat JSON object. Set "hashed": true to signal pre-hashed mode, and send hash arrays instead of clear values.

{
  "external_id": "0035f00000A1B2CAAZ",
  "hashed": true,
  "email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="],
  "phone_hashes": ["ptzVkgbv9DonwvPCHmXmJ2SEOaolSh37z3ZzY/Gmm+U="],
  "ndz_hashes": ["PQOfn1RffEKmqMmNAzDKKaoZCwxWbQZkQzPWmQo9REA="],
  "name_vin_hashes": ["rtnDuXIe63jXYQQXW5r07GJ7lSsrib8+46QuKFwkOmk="],
  "maid_hashes": [],
  "ctvid_hashes": []
}

Pre-Hashed Mode Data Fields

The following fields are available in pre-hashed records:

Field	Required	Type	Constraints
`external_id`	yes	string	Same as clear mode
`hashed`	yes	boolean	Must be `true`
`email_hashes`	no	string[]	SHA-256 Base64 hashes of standardized emails
`phone_hashes`	no	string[]	SHA-256 Base64 hashes of standardized phones
`ndz_hashes`	no	string[]	Composite NDZ hashes (first+last+dob+zip)
`name_vin_hashes`	no	string[]	Composite NameVIN hashes (first+last+vin)
`maid_hashes`	no	string[]	SHA-256 Base64 hashes of standardized MAIDs
`ctvid_hashes`	no	string[]	SHA-256 Base64 hashes of standardized CTVIDs

Each hash must be a valid Base64-encoded SHA-256 digest (44 characters, typically ending with =). See Hashing Algorithm for the exact standardization and hashing rules.

Validation & Error Handling

The following table describes how DataGrail handles common validation conditions:

Condition	Behavior
Required fields missing	Record rejected
Invalid hash format (pre-hashed mode)	Record rejected
Array exceeds its element limit	Record rejected (see per-field limits in the field table above)
String exceeds length limit	Record rejected
Rejection rate >1% of records in a file	File halted — fix and re-upload
Rejection rate ≤1%	File succeeds; rejected records reported in ingest results
Manifest checksum mismatch	Ingest aborted for that file
File missing from bucket	Ingest aborted; error surfaced in admin UI
Duplicate manifest	No-op — not double-processed

Minimal Examples

Smallest valid record (clear, email-only):

{"external_id": "cust-001", "emails": ["alice@example.com"]}

Smallest valid record (pre-hashed, email-only):

{"external_id": "cust-001", "hashed": true, "email_hashes": ["vGM7y5n+hBXRSEAklhHDPCbysyNgYTmXdMcagGUOY8E="]}

Smallest valid manifest:

{"schema_version": "1.0", "record_type": "consumer_identifier_manifest", "broker_registration_id": "br-001", "emitted_at": "2026-05-30T00:00:00Z", "format": "ndjson", "compression": "gzip", "files": [{"path": "data/part-0001.ndjson.gz", "size_bytes": 1048576, "sha256": "abc123...", "record_count": 10000}], "total_record_count": 10000}

Next Steps

Hashing Algorithm — required reading if using pre-hashed ingestion
Testing & Validation — run end-to-end test requests to verify your ingestion and matching configuration before go-live

Need help?

If you have any questions, please reach out to your dedicated Account Manager or contact us at support@datagrail.io.

Disclaimer: The information contained in this message does not constitute as legal advice. We would advise seeking professional counsel before acting on or interpreting any material.

Ingestion Mode​

Delivery: Cloud Storage Import​

Setup​

File Layout​

Recommended File Size​

Manifest​

Incremental Updates​

Clear Identifiers Record Format​

Clear Mode Data Fields​

Pre-Hashed Identifiers Record Format​

Pre-Hashed Mode Data Fields​

Validation & Error Handling​

Minimal Examples​

Next Steps​