TTechclick All lessons
Zscaler · Batch 11 · Lesson 8L3 / DATA SECURITY

Data Protection — DLP, EDM/IDM & CASB

How ZIA stops sensitive data from leaving your org — dictionaries, exact-match hashes, indexed documents, and the inline-versus-API CASB models that protect SaaS in two completely different ways.

📅 23 May 2026 · ⏱ 14 min read · 🏷 10-question assessment included
🎯 By the end of this lesson, you'll be able to

Why this lesson matters

DLP is the control that makes Zscaler legally and contractually defensible. URL Filtering keeps users off bad sites. Threat Protection keeps malware out. But the question Legal and Compliance will eventually ask is: "if an employee tries to upload our customer database to a personal Gmail, do we know? Do we stop it?" That is DLP's job — and it's what goes into your SOC-2, ISO-27001, DPDP / GDPR and cyber-insurance evidence packs. Get it wrong and you don't just have a security gap, you have a contract violation.

CASB is the SaaS-era hole URL Filtering and Cloud App Control alone cannot close. URL Filtering sees onedrive.com; it cannot tell which OneDrive tenant the user logged into, what file they uploaded, or whether a share link from six months ago is now public on the internet. CASB Inline answers the first two. CASB Out-of-Band (SaaS Security API) answers the third. Together with DLP they form ZIA's data-protection triangle.

The shape of ZIA Data Protection

Three engines, three placements in the data path:

Inline catches data leaving now. Out-of-Band finds what slipped through last quarter and is sitting in SharePoint with a public link. Most tenants run all three, each tuned to a different risk class.

SVG #1 — ZIA Data Protection planes (Inline DLP + CASB Inline + CASB OOB)
ZIA Data Protection architecture Inline data path goes User to Z-Tunnel to ZIA PSE which runs SSL Inspection then DLP and CASB Inline engines before reaching destination. Parallel out-of-band path goes from CASB OOB connector through OAuth to SaaS provider APIs and scans data at rest. INLINE DATA PATH (in motion) User + Z-App ZIA PSESSL Inspect Inline DLP EngineDict · EDM · IDM · OCR CASB InlineTenant + activity aware SaaS / WebGmail · OneDrive · GH Verdict: Allow · Confirm · Block · Quarantine · ICAP Insights · Notify SecOps · Log to NSS / SIEM OUT-OF-BAND PATH (at rest, retroactive) CASB OOB Connector(SaaS Security API)OAuth signed by tenant admin Admin API SaaS tenant data at restBox · O365 · GDrive ·Salesforce · GitHub · ServiceNow DLP engine (re-used)Dict · EDM · IDMscan files + shares Retroactive actionRevoke share · Quarantine ·Change owner · Notify Inline catches data leaving NOW. Out-of-Band finds what slipped through last quarter and is still sitting in SaaS with a public link.

Two planes, one DLP engine reused on both. Inline DLP + CASB Inline live in the proxy data path. CASB Out-of-Band lives outside the data path entirely, scanning SaaS via OAuth-signed admin APIs. The same dictionaries, EDM templates and IDM templates power both planes.

DLP Dictionaries — the matching primitives

A dictionary is a named pattern definition. DLP rules reference dictionaries. Dictionaries can contain word lists, phrases, regex patterns, or EDM/IDM template references. So you CAN match regex — but always via a dictionary, never directly in the rule. Build dictionaries once, reuse them across rules. ZIA ships three flavours.

Predefined dictionaries (40+ out of the box)

Curated by Zscaler, kept up to date with regulatory format changes. The high-value ones in production:

Custom dictionaries — regex + score + threshold

Three matching modes per dictionary:

ModeWhat it doesGood for
WordCase-insensitive exact word match (whole-token). "Confidential" matches; "Confidentially" does not.Project codenames, classification labels (e.g. "INTERNAL ONLY")
PhraseMulti-word ordered match. "Patient ID" matches when those two tokens appear in order with whitespace.Standard form labels, header phrases
RegexFull PCRE2 regex. Slowest; use only when word/phrase can't express the pattern.Internal customer ID format (e.g. CUST-\d{8})

Every match contributes a score. A dictionary has a threshold — minimum cumulative score before it counts as "triggered". This lets you say: "trigger only when at least 4 different credit-card numbers appear in the same request, not on a single one". Threshold scoring is what turns DLP from a false-positive generator into a usable production control.

Composite dictionaries — AND / OR / proximity

A composite combines atomic dictionaries with logical operators and a proximity window (chars or words) — the biggest false-positive killer in production. Example: PCI-CC-Strict = CreditCardNumber AND (CardholderName OR ExpiryDate OR CVV) within 50 words. A log file with one test card no longer fires; a real cardholder record does.

EDM — Exact Data Match (structured data)

EDM answers: "don't match any 16-digit number — match only one of our 1.2M customer card numbers". Export the sensitive table (CSV: card_number, customer_name, dob, email), the EDM tool salts + hashes each cell and uploads the hash index. The PSE never sees plaintext, only hashes. At inspection time, ZIA hashes candidate tokens and looks them up.

Primary + secondary field matching

EDM templates designate one field as primary (strong identifier — card number, SSN) and others as secondary. A rule typically demands "primary present AND ≥N secondary fields from the same row" — catches a full-record leak, not a single-field accidental paste. Pair EDM with dictionary rules at different thresholds for layered control.

Production constraints (the gotchas)

IDM — Indexed Document Match (full documents)

IDM is for unstructured leaks: board decks, M&A drafts, source-code ZIPs, legal contracts, design docs. Upload the protected documents; ZIA computes a rolling-hash fingerprint of overlapping shingles (small text windows). At inspection time, candidate content is compared against the fingerprint index with a partial-match threshold (30–80%). Higher (80%) for legal contracts where templated reuse is OK; lower (30%) for trade secrets where any meaningful overlap is alarming. IDM is the anchor for "the board deck PDF leaked to ChatGPT" — a few pasted paragraphs still fire the rule because their shingles hash to entries in the index.

🔤EDM tokenizer + case-folding

EDM tokenizer: Splits on whitespace + punctuation. Case Folding (checkbox in EDM template config) controls case-sensitive matching — if unchecked, 'John Smith' and 'john smith' do NOT match. Most common EDM false-negative cause in production.

EDM pipeline: source CSV → per-column normaliser (strip/lowercase/ASCII-fold) → tokenizer (whitespace + punctuation) → optional case-fold → salt+hash → upload to PSE hash index.

OCR — what gets scanned, what doesn't

OCR scope: Runs on image attachments in JPG/PNG/TIFF/PDF-with-images. Supports defined language set. File-size cap default ≤10 MB. Adds 200–800 ms per scanned image — scope to high-risk destinations only or you'll get Webex/Teams perf tickets the same day.

DLP Trigger Walkthrough Lab Cloud Connector + DLP Sandbox
SVG #2 — Inline DLP flow: a single Gmail attachment upload
DLP inline inspection flow User uploads file to Gmail. ZIA tunnel forwards to PSE. SSL Inspection decrypts the multipart body. DLP engine scans body and attachment against dictionaries and EDM and IDM. Match score above threshold triggers Block action. User sees notification banner. SecOps gets Insights log entry. 1. UserAttaches customers.csvto Gmail compose 2. Z-Tunnel → PSESSL Inspect terminatesmultipart/form-data body 3. DLP engine inspectsBody: "here is the data you asked for"Attachment: customers.csvScore: PCI-CC ✓ EDM Cust-DB ✓ 4. Threshold crossedAction: BLOCKReason: PCI-CC-Strict 5. User sees banner"Upload blocked bycorporate DLP policy" In parallel: SecOps Insights + NSS loguser=alice@corp · dst=mail.google.com · matched=PCI-CC-Strictredacted_preview="4111-XXXX-XXXX-1234 · Alice Wang · 04/27" Match preview is redacted in the log so the log itself isn't a secondary data leak.

A single attachment upload exercises the entire engine: SSL Inspection terminates the TLS, the multipart body is unpacked, the CSV is parsed and scanned against both the PCI dictionary and the EDM customer-DB hash index, the threshold trips, the action fires, the user is notified, and the Insights log captures a redacted preview so SecOps can investigate without re-exposing the data.

DLP rule configuration — GUI walkthrough

The path in the Admin Portal:

ZIA · Inline DLP rule creation
Policy → Data Loss Prevention → DLP Policy → + Add Rule

  Order:           20
  Rule Name:       Block-PCI-to-Non-Corp-Webmail
  Status:          Enabled
  DLP Engines:     PCI-CC-Strict  (composite: CC + name/expiry within 50 words)
                   + EDM-Customer-DB  (require primary + ≥2 secondary fields)
  Min Match Count: 1  (engine threshold counts, not raw matches)
  File Types:      All  (DLP scans body AND attachments)
  URL Categories:  Webmail
  Cloud Apps:      Gmail (personal), Yahoo Mail, Outlook.com personal
  Users / Groups:  All EXCEPT Group=Customer-Support-Approved-Senders
  Locations:       All
  Action:          Block
  Notification:    User notification template "PCI-block-banner-v3"
  Auditor:         secops-dlp@corp.com  (gets per-incident email)
  ICAP:            Forward redacted preview to Forensics-ICAP
  Severity:        High

A paired gentler rule for the same content to corporate destinations:

ZIA · Inline DLP rule — Alert + Allow for sanctioned destinations
Policy → Data Loss Prevention → DLP Policy → + Add Rule

  Order:           10  (HIGHER priority — fires before the Block rule)
  Rule Name:       Allow-SourceCode-to-GH-Enterprise-Alert-Only
  DLP Engines:     SourceCode-Composite  (any of: Java + Python + C + Go)
  URL Categories:  (none)
  Cloud Apps:      GitHub Enterprise (tenant-aware via CASB Inline)
  Action:          Allow
  Notification:    None (silent — engineering workflow)
  Auditor:         dev-dlp@corp.com (weekly digest, not per-incident)

Then a paired Block rule at Order 30:
  Cloud Apps:      GitHub.com (public)
  Action:          Block
  Notification:    "Push source code to public GitHub blocked — use GH Enterprise"

The pattern: same content, different destinations, different actions. Without CASB Inline's tenant awareness, ZIA would only see "github.com" — both rules collapse to one and engineering is either silently leaking or completely blocked.

🔌ICAP integration

ICAP: Configured tenant-wide under Administration → DLP Incident Receiver, not per-rule. Sends DLP incidents to an external incident management platform.

🎚DLP Severity Levels

ZIA DLP supports 5 severity levels: Info / Low / Medium / High / Critical. Map them deliberately to your SIEM noise budget — Critical and High page on-call, Medium goes to a daily digest, Low to a weekly review, Info is silent telemetry. Without severity mapping, all DLP looks the same to SecOps.

📏DLP file-size cap (the silent gap)

DLP file-size cap: Default DLP_inspect_max_bytes = 16 MB. Files larger than this skip inspection entirely. Surface this in design discussions — large CAD files, design assets, video clips all bypass DLP by default. Consider raising the cap for high-risk groups (legal, design, M&A) or layering File Type Control to block specific large-file types outbound.

🔒Encrypted / password-protected files

Encrypted/password-protected files: DLP cannot inspect. Combine with File Type Control: block password-protected ZIP/7z outbound, or quarantine for review. Otherwise this is the single easiest DLP bypass in the wild.

CASB Inline vs Out-of-Band — when to use which

DimensionCASB InlineCASB Out-of-Band (SaaS Security API)
Where it sitsIn the proxy data path (same as ZIA)Outside the data path; OAuth-signed admin API to the SaaS provider
Detection latencyReal-time (tens of ms)Scheduled scan (minutes to hours) plus event-driven webhooks where supported
What it can doBlock upload / download / share in motion · enforce tenant restrictions (only corp Microsoft 365) · redact a message in-flightFind files already at rest · revoke public share links · change file owner · quarantine to admin-only folder · notify uploader's manager · scan for malware in stored files
What it cannot doSee data that already exists in the SaaS (didn't pass through ZIA) · catch user accessing SaaS from an unmanaged device that bypasses the tunnelBlock a leak in real time — only finds it after the fact
Coverage gap when user is on personal device off-tunnelBlindStill works — connector talks to SaaS directly, not the user
Typical SaaSMicrosoft 365, Google Workspace, Box, Dropbox, Slack, ServiceNow, Salesforce, GitHub Enterprise (any SaaS reachable through the tunnel)Microsoft 365, Google Workspace, Box, Dropbox, Salesforce, ServiceNow, GitHub, Workday, Slack (subset supports OAuth admin scope)
Auth dependencyZ-Tunnel + ZIA identityOAuth token signed by tenant admin (must be refreshed before expiry — top failure mode)
Best use case"Block this file from being uploaded to personal OneDrive right now.""Find every file in our O365 tenant with a public share link AND containing PCI data, revoke share, notify owner."
Verify — confirm Data Protection is actually working

After enabling DLP + CASB, validate on a controlled test account before relaxing:

Common Mistakes — DLP and CASB
💡Pro Tips

Real-world scenario — Gmail outbound DLP with inline redaction

Scenario: Gmail outbound DLP with inline redaction — User pastes a 16-digit card number into a Gmail compose window. ZIA CASB Inline sees the POST to Gmail, runs the body through the Credit Card dictionary, and rewrites the affected digits to XXXX before the request leaves ZIA. The user sees the redacted version in their Sent folder. Same flow works for outbound webmail (Outlook Web, Yahoo Mail). For Slack and most SaaS chat: redaction is NOT supported — the action must be Block instead.

Rules already in place

  1. Composite dictionary PCI-CC-Strict = CreditCardNumber AND (CardholderName OR ExpiryDate OR CVV) within 50 words. Threshold = 1 composite match.
  2. EDM template Customer-DB-v2026-05-20 — 1.2M rows, primary = card_number, secondary = name / dob / email. Suppresses a duplicate dictionary fire when the same record is an EDM hit.
  3. CASB Inline rule: Cloud App = Gmail / Outlook Web / Yahoo Mail (webmail family), Action = "Redact and replace with notification".

What the user sees

User hits Send in the Gmail compose window. CASB Inline intercepts the POST to mail.google.com, inspects the form body, trips PCI-CC-Strict, and rewrites the affected digits in-flight: card number digits → XXXX-XXXX-XXXX-XXXX, name → [REDACTED — PII], with a banner appended. Gmail receives the redacted version — that's what gets sent and what appears in the user's Sent folder. The user's UI shows a notification: "DLP redacted PCI data — ask the recipient to use the secure-pay link instead."

What SecOps sees

  1. Insights → DLP → today: "Redacted" entry within 5s. User=engineer@corp, App=Gmail, Engine=PCI-CC-Strict, Action=Redact, Severity=High.
  2. Incident detail: redacted preview (4111-XXXX-XXXX-1234 · J*** B**** · 04/27), URL path, recipient address (hashed).
  3. NSS feed: parallel event in the SIEM, queryable; pre-built "DLP severity=High by hour" widget updates.
  4. Auditor email: secops-dlp@corp gets a structured email in ~30s with redacted preview + Jump-to-Insights link.
  5. SOC ticket (low-priority): confirm the user used the correct workaround; log it for the quarterly compliance report. No customer data was exposed — the point of the exercise.

What CASB OOB confirms later

The next hourly OOB scan on the Gmail (Google Workspace) connector re-confirms the stored Sent message body is the redacted version, not the original. PCI scan: zero hits in that mailbox for the day. The compliance evidence pack writes itself. That's "DLP done right" — silent to the recipient, helpful to the user, defensible to the auditor, no PII in the alert log.

Important caveat: Inline body-modification (redaction) is currently GA only for webmail-family SaaS (Gmail / Outlook Web / Yahoo Mail) and a small set of HTTP-form-style apps. For Slack and most SaaS chat platforms ZIA's CASB Inline does NOT modify the message body — the only supported action is Block (the message is rejected before reaching the SaaS, and the user is notified). Plan rules accordingly; don't promise "redaction" for a SaaS where only Block is supported.

DLP Redaction Flow Lab CASB Tenant Restriction Walkthrough EDM Hash Upload Simulator

📌 Quick reference (memorise — this is the data-protection arc)

QUICK LAB · ~15 MIN

Build + test a DLP rule end-to-end:

  1. Create a Credit Card dictionary using the built-in pattern. Set confidence threshold to Medium.
  2. Create a DLP rule: outbound + Webmail destination + Credit Card dict + Action = Block + Notify User.
  3. From a test laptop, attempt to paste a Luhn-valid 16-digit card into Gmail compose — verify block page.
  4. Now upload a CSV with 100 cards as a file attachment — DLP should match. If it doesn't, check the 16 MB file-size cap.
  5. Check CASB → Last Scan age for your O365 tenant — re-consent if > 80 days.

📝 Check your understanding

10 scenario questions — interview + ZDTA exam depth. Pick one answer per question. You need 70% (7 of 10) to mark this lesson complete on your profile.

Q1

Your CISO asks: "if a user uploads a file to a personal OneDrive account from a corporate laptop, which ZIA control can tell personal OneDrive apart from corporate OneDrive and block the personal one?"

Correct: (b). Tenant awareness is the defining feature of CASB Inline. URL Filtering sees only the domain (onedrive.com — same for both). File Type Control filters by MIME. SSL Inspection bypass would actually remove the ability to see inside the request. Tenant restriction in CASB Inline lets you say "allow only tenant=corp-MS-tenant-id, block all other Microsoft 365 logins on this device".
Q2

A regulator asks for evidence of every file in your Microsoft 365 tenant that contains PCI data and has a public share link, plus proof you revoked the share. Which ZIA capability gives you this?

Correct: (c). "Data at rest" + "share link metadata" + "retroactive revoke" all point to the SaaS Security API (OOB CASB). Inline DLP only sees data crossing the tunnel right now — files uploaded last year never passed through it, so it cannot enumerate them. URL Filtering and File Type Control don't operate on resting SaaS objects.
Q3

Your SOC complains DLP is firing 50,000 alerts/day on a "Credit Card Number" dictionary rule, drowning real incidents. Most fire on developer log files. What's the right fix?

Correct: (b). Composite dictionaries with proximity are the textbook false-positive fix. Add EDM if you need "only OUR customers" matching. (a) creates a compliance gap. (c) is too broad — dev environments do touch real data sometimes. (d) doesn't address the dictionary pattern at all.
Q4

You want DLP to match only YOUR 1.2 million customer card numbers, not random 16-digit numbers. Which engine?

Correct: (a). EDM is the structured-row-match engine — exactly the "match only OUR data" use case. The hash-on-upload approach means the PSE never sees plaintext customer data. (b) matches any Luhn-valid 16-digit number, including test cards. (c) regex still matches the pattern, not the values. (d) IDM is for full documents (M&A drafts, board decks), not row-based data.
Q5

A user pastes three paragraphs of last quarter's confidential board deck into ChatGPT. The board deck itself was never uploaded — just an excerpt. Which engine has any chance of catching this?

Correct: (b). IDM's shingle / rolling-hash model is purpose-built for partial-document leaks — a paragraph or two is enough if the threshold is set low (e.g. 30% for trade secrets). EDM is for structured rows, not free text. PCI is the wrong content class. File Type Control can't read semantics. Lower IDM thresholds for high-secrecy docs; raise them for legal/templated content.
Q6

Your CASB OOB connector for Microsoft 365 silently stopped scanning three weeks ago. Compliance only noticed when an external audit asked for last month's scan report. Root cause?

Correct: (d). OAuth token expiry is the #1 silent failure mode of OOB CASB. The fix is two-part: re-authorise immediately and add monitoring on connector health so it never silently dies again. (a) would have impacted Inline DLP not just OOB. (c)/(b) only affect the inline path; OOB talks to SaaS directly and is unaffected by tunnel state.
Q7

You uploaded the customer database to EDM six weeks ago. New customers have onboarded daily since. A new customer's card number is pasted into Gmail and the rule does NOT fire. Why?

Correct: (c). EDM is a snapshot — fresh customer data needs a fresh upload. Operationalise this with a cron-driven pipeline + an alert on staleness. (a)/(b)/(d) are possible in other scenarios but the symptom "new customer specifically slips through" is the classic EDM-staleness signature.
Q8

You want to allow source-code pushes from your engineering team to GitHub Enterprise (corporate) but block them to public github.com. Which configuration is correct?

Correct: (a). The same content / different destination / different action pattern. CASB Inline's tenant awareness is what lets ZIA tell GH Enterprise apart from public GH on the same parent domain. Order matters because ZIA uses first-match. (b) is hostile. (c) is too broad. (d) creates a giant exfil hole.
Q9

A new DLP rule is going live next week to block PCI uploads to all personal webmail. You want minimum disruption. What's the recommended rollout sequence?

Correct: (d). The "Confirm before Block" pattern is the textbook safe rollout — it gives SecOps the false-positive data, gives users the training, and gives compliance the audit trail. (a) generates Monday-morning chaos and a flood of tickets. (c)/(b) don't address the underlying tuning need.
Q10

Insights shows a "Blocked" DLP event. The match preview in the log displays the full credit card number in plaintext. Compliance is concerned. Correct posture?

Correct: (c). A DLP alert that leaks the very data it caught is a textbook compliance failure (PCI DSS specifically calls out logging-of-PAN). Always configure rules to log redacted previews, and verify across the entire forwarding chain — ZIA Insights, NSS feed, SIEM, ticketing. (a)/(b)/(d) all miss the structural issue.
Lesson complete — saved to your profile.
Almost! Review the sections above and try again — you need 70% (7 of 10) to mark this lesson complete.

What's next — Lesson 9

Module 9 switches tracks completely — from ZIA (internet-bound traffic) to ZPA (private app access). Same Z-App, totally different architecture, totally different problem space.