Why this lesson matters
DLP is the control that makes Zscaler legally and contractually defensible. URL Filtering keeps users off bad sites. Threat Protection keeps malware out. But the question Legal and Compliance will eventually ask is: "if an employee tries to upload our customer database to a personal Gmail, do we know? Do we stop it?" That is DLP's job — and it's what goes into your SOC-2, ISO-27001, DPDP / GDPR and cyber-insurance evidence packs. Get it wrong and you don't just have a security gap, you have a contract violation.
CASB is the SaaS-era hole URL Filtering and Cloud App Control alone cannot close. URL Filtering sees onedrive.com; it cannot tell which OneDrive tenant the user logged into, what file they uploaded, or whether a share link from six months ago is now public on the internet. CASB Inline answers the first two. CASB Out-of-Band (SaaS Security API) answers the third. Together with DLP they form ZIA's data-protection triangle.
The shape of ZIA Data Protection
Three engines, three placements in the data path:
- Inline DLP — sits in the ZIA proxy. Inspects body + attachments in motion. Actions:
Allow,Confirm(user-justification banner before allow),Block, plus per-incident notify + ICAP forward to external forensics. Latency: tens of ms per inspected request. - CASB Inline — same proxy path, with SaaS-tenancy awareness. Knows corporate vs personal Microsoft / Google / Box accounts on the same domain. Inspects the JSON / multipart body of SaaS API calls. Can block upload to personal OneDrive while allowing the same file to corporate OneDrive on the same session.
- CASB Out-of-Band / SaaS Security API — outside the data path. Connects to the SaaS provider over its admin API (OAuth, signed by the tenant admin). Scans data at rest — every file already in Box / O365 / GDrive / Salesforce / GitHub / ServiceNow — for sensitive content, public share links, externally-shared folders, malware. Acts after the fact: revoke share, change owner, quarantine, notify manager.
Inline catches data leaving now. Out-of-Band finds what slipped through last quarter and is sitting in SharePoint with a public link. Most tenants run all three, each tuned to a different risk class.
Two planes, one DLP engine reused on both. Inline DLP + CASB Inline live in the proxy data path. CASB Out-of-Band lives outside the data path entirely, scanning SaaS via OAuth-signed admin APIs. The same dictionaries, EDM templates and IDM templates power both planes.
DLP Dictionaries — the matching primitives
A dictionary is a named pattern definition. DLP rules reference dictionaries. Dictionaries can contain word lists, phrases, regex patterns, or EDM/IDM template references. So you CAN match regex — but always via a dictionary, never directly in the rule. Build dictionaries once, reuse them across rules. ZIA ships three flavours.
Predefined dictionaries (40+ out of the box)
Curated by Zscaler, kept up to date with regulatory format changes. The high-value ones in production:
- PCI — Credit Card Number (Visa, MasterCard, Amex, Discover, JCB; each with its own checksum). Default uses Luhn validation, so a random 16-digit string fails the check.
- PII (per region) — US SSN, US Driver's License, UK NI Number, India Aadhaar (with Verhoeff check), India PAN, EU IBAN, ABA Routing Number, Brazil CPF.
- PHI — Medical Record Numbers, ICD-10 / ICD-11 codes, NPI (US National Provider Identifier).
- Source Code — language-aware: C/C++, Java, Python, Go, JavaScript, SQL DDL. Detects keyword density, not just file extension.
- OFAC / Sanctions — name + alias matching against the US Treasury sanctions list.
- AWS / Azure / GCP keys — recognise the signed prefix structure of cloud access keys (e.g.
AKIAfor AWS). - Crypto wallet addresses — BTC, ETH, etc.
Custom dictionaries — regex + score + threshold
Three matching modes per dictionary:
| Mode | What it does | Good for |
|---|---|---|
| Word | Case-insensitive exact word match (whole-token). "Confidential" matches; "Confidentially" does not. | Project codenames, classification labels (e.g. "INTERNAL ONLY") |
| Phrase | Multi-word ordered match. "Patient ID" matches when those two tokens appear in order with whitespace. | Standard form labels, header phrases |
| Regex | Full PCRE2 regex. Slowest; use only when word/phrase can't express the pattern. | Internal customer ID format (e.g. CUST-\d{8}) |
Every match contributes a score. A dictionary has a threshold — minimum cumulative score before it counts as "triggered". This lets you say: "trigger only when at least 4 different credit-card numbers appear in the same request, not on a single one". Threshold scoring is what turns DLP from a false-positive generator into a usable production control.
Composite dictionaries — AND / OR / proximity
A composite combines atomic dictionaries with logical operators and a proximity window (chars or words) — the biggest false-positive killer in production. Example: PCI-CC-Strict = CreditCardNumber AND (CardholderName OR ExpiryDate OR CVV) within 50 words. A log file with one test card no longer fires; a real cardholder record does.
EDM — Exact Data Match (structured data)
EDM answers: "don't match any 16-digit number — match only one of our 1.2M customer card numbers". Export the sensitive table (CSV: card_number, customer_name, dob, email), the EDM tool salts + hashes each cell and uploads the hash index. The PSE never sees plaintext, only hashes. At inspection time, ZIA hashes candidate tokens and looks them up.
Primary + secondary field matching
EDM templates designate one field as primary (strong identifier — card number, SSN) and others as secondary. A rule typically demands "primary present AND ≥N secondary fields from the same row" — catches a full-record leak, not a single-field accidental paste. Pair EDM with dictionary rules at different thresholds for layered control.
Production constraints (the gotchas)
- Source CSV size limit — multi-million-row sources may need chunking + per-cell normalisation. Plan for the export pipeline, not just the upload.
- Refresh cadence — the hash index is a snapshot. New customers added after the last upload are invisible until the next refresh. Production tenants rebuild + re-upload weekly, cron-driven from the system of record.
- Normalisation — if source stores
4111-1111-1111-1111but the user pastes4111111111111111, hashes differ unless both sides strip identically. Use per-column normalisers (strip spaces, lowercase, ASCII-fold) at upload time.
IDM — Indexed Document Match (full documents)
IDM is for unstructured leaks: board decks, M&A drafts, source-code ZIPs, legal contracts, design docs. Upload the protected documents; ZIA computes a rolling-hash fingerprint of overlapping shingles (small text windows). At inspection time, candidate content is compared against the fingerprint index with a partial-match threshold (30–80%). Higher (80%) for legal contracts where templated reuse is OK; lower (30%) for trade secrets where any meaningful overlap is alarming. IDM is the anchor for "the board deck PDF leaked to ChatGPT" — a few pasted paragraphs still fire the rule because their shingles hash to entries in the index.
EDM tokenizer: Splits on whitespace + punctuation. Case Folding (checkbox in EDM template config) controls case-sensitive matching — if unchecked, 'John Smith' and 'john smith' do NOT match. Most common EDM false-negative cause in production.
EDM pipeline: source CSV → per-column normaliser (strip/lowercase/ASCII-fold) → tokenizer (whitespace + punctuation) → optional case-fold → salt+hash → upload to PSE hash index.
OCR — what gets scanned, what doesn't
OCR scope: Runs on image attachments in JPG/PNG/TIFF/PDF-with-images. Supports defined language set. File-size cap default ≤10 MB. Adds 200–800 ms per scanned image — scope to high-risk destinations only or you'll get Webex/Teams perf tickets the same day.
A single attachment upload exercises the entire engine: SSL Inspection terminates the TLS, the multipart body is unpacked, the CSV is parsed and scanned against both the PCI dictionary and the EDM customer-DB hash index, the threshold trips, the action fires, the user is notified, and the Insights log captures a redacted preview so SecOps can investigate without re-exposing the data.
DLP rule configuration — GUI walkthrough
The path in the Admin Portal:
Policy → Data Loss Prevention → DLP Policy → + Add Rule
Order: 20
Rule Name: Block-PCI-to-Non-Corp-Webmail
Status: Enabled
DLP Engines: PCI-CC-Strict (composite: CC + name/expiry within 50 words)
+ EDM-Customer-DB (require primary + ≥2 secondary fields)
Min Match Count: 1 (engine threshold counts, not raw matches)
File Types: All (DLP scans body AND attachments)
URL Categories: Webmail
Cloud Apps: Gmail (personal), Yahoo Mail, Outlook.com personal
Users / Groups: All EXCEPT Group=Customer-Support-Approved-Senders
Locations: All
Action: Block
Notification: User notification template "PCI-block-banner-v3"
Auditor: secops-dlp@corp.com (gets per-incident email)
ICAP: Forward redacted preview to Forensics-ICAP
Severity: HighA paired gentler rule for the same content to corporate destinations:
Policy → Data Loss Prevention → DLP Policy → + Add Rule Order: 10 (HIGHER priority — fires before the Block rule) Rule Name: Allow-SourceCode-to-GH-Enterprise-Alert-Only DLP Engines: SourceCode-Composite (any of: Java + Python + C + Go) URL Categories: (none) Cloud Apps: GitHub Enterprise (tenant-aware via CASB Inline) Action: Allow Notification: None (silent — engineering workflow) Auditor: dev-dlp@corp.com (weekly digest, not per-incident) Then a paired Block rule at Order 30: Cloud Apps: GitHub.com (public) Action: Block Notification: "Push source code to public GitHub blocked — use GH Enterprise"
The pattern: same content, different destinations, different actions. Without CASB Inline's tenant awareness, ZIA would only see "github.com" — both rules collapse to one and engineering is either silently leaking or completely blocked.
ICAP: Configured tenant-wide under Administration → DLP Incident Receiver, not per-rule. Sends DLP incidents to an external incident management platform.
ZIA DLP supports 5 severity levels: Info / Low / Medium / High / Critical. Map them deliberately to your SIEM noise budget — Critical and High page on-call, Medium goes to a daily digest, Low to a weekly review, Info is silent telemetry. Without severity mapping, all DLP looks the same to SecOps.
DLP file-size cap: Default DLP_inspect_max_bytes = 16 MB. Files larger than this skip inspection entirely. Surface this in design discussions — large CAD files, design assets, video clips all bypass DLP by default. Consider raising the cap for high-risk groups (legal, design, M&A) or layering File Type Control to block specific large-file types outbound.
Encrypted/password-protected files: DLP cannot inspect. Combine with File Type Control: block password-protected ZIP/7z outbound, or quarantine for review. Otherwise this is the single easiest DLP bypass in the wild.
CASB Inline vs Out-of-Band — when to use which
| Dimension | CASB Inline | CASB Out-of-Band (SaaS Security API) |
|---|---|---|
| Where it sits | In the proxy data path (same as ZIA) | Outside the data path; OAuth-signed admin API to the SaaS provider |
| Detection latency | Real-time (tens of ms) | Scheduled scan (minutes to hours) plus event-driven webhooks where supported |
| What it can do | Block upload / download / share in motion · enforce tenant restrictions (only corp Microsoft 365) · redact a message in-flight | Find files already at rest · revoke public share links · change file owner · quarantine to admin-only folder · notify uploader's manager · scan for malware in stored files |
| What it cannot do | See data that already exists in the SaaS (didn't pass through ZIA) · catch user accessing SaaS from an unmanaged device that bypasses the tunnel | Block a leak in real time — only finds it after the fact |
| Coverage gap when user is on personal device off-tunnel | Blind | Still works — connector talks to SaaS directly, not the user |
| Typical SaaS | Microsoft 365, Google Workspace, Box, Dropbox, Slack, ServiceNow, Salesforce, GitHub Enterprise (any SaaS reachable through the tunnel) | Microsoft 365, Google Workspace, Box, Dropbox, Salesforce, ServiceNow, GitHub, Workday, Slack (subset supports OAuth admin scope) |
| Auth dependency | Z-Tunnel + ZIA identity | OAuth token signed by tenant admin (must be refreshed before expiry — top failure mode) |
| Best use case | "Block this file from being uploaded to personal OneDrive right now." | "Find every file in our O365 tenant with a public share link AND containing PCI data, revoke share, notify owner." |
After enabling DLP + CASB, validate on a controlled test account before relaxing:
- Insights → DLP dashboard. Trigger a deliberate test (paste 4 Luhn-valid test cards into a Gmail compose window from a corp laptop). Confirm a "Blocked" entry appears within 30s with the rule name and a redacted match preview.
- Insights → Web → filter destination=mail.google.com. Confirm the request shows "DLP=hit", policy name, score, and action.
- NSS / SIEM — verify the DLP event also arrived in your SIEM with the same redacted preview. If only ZIA sees it, your forwarding pipeline is broken (run a sample NSS feed query).
- CASB API connector status — Admin → SaaS Security API → Connectors. Each connector should show "Connected · last scan <1h ago · OAuth expires in >30d". A red status on a connector means a blind spot on that SaaS.
- EDM index health — Admin → DLP → EDM Templates → check "Last upload" and "Row count". If row count jumped down or "Last upload" is more than 7 days old on a weekly cadence, the export pipeline broke.
- Single-dictionary CC rule = false-positive flood. A bare "Credit Card Number" dictionary fires on every Luhn-valid 16-digit string — including dev test cards in log files. Wrap CC detection in a composite with proximity (CC + name OR expiry OR CVV within 50 words). Day-1 naïve PCI DLP = 50,000 alerts and SecOps stops looking at the queue.
- EDM hash file not refreshed. New customers walk out unmatched. Automate weekly export → hash → upload; alert if the job hasn't completed in 9 days.
- CASB OAuth token silently expired. OAuth refresh-token lifetime is provider-specific: Microsoft Graph: 90-day sliding (refreshes on use); Box: 60 days; Google Workspace: non-expiring for service accounts. Monitor the 'Last Successful Scan' age field — re-consent before the tenant-specific expiry. Don't assume 90 days everywhere. Connector goes red on the status page, OOB scanning stops, no one notices — until a forensic ask reveals six months of un-scanned SharePoint.
- Watermarking turned on for everything. Visible per-user PDF watermarks are great for sensitive previews but kill performance and confuse users when applied to every document. Scope to board decks, legal contracts, M&A drafts — not "all PDFs".
- Source-code block rule with no engineering exception. Universal block on source-code uploads to public destinations breaks the 3 AM open-source release pipeline. Add Group=Engineering + Destination=approved-repos exception above the universal block.
- OCR not enabled for image attachments. Screenshot of a credit card in a Slack DM bypasses text-only DLP. Turn on OCR for inline DLP (scope to high-risk destinations to manage latency).
- CASB Inline without tenant restrictions. Without "Allow only Microsoft tenant ID X, block all other Microsoft 365 logins", a user can sign into personal OneDrive on the same browser as corporate OneDrive and CASB Inline only sees "OneDrive". Tenant restriction is the most valuable single CASB Inline setting; configure per SaaS, per tenant.
- Always run DLP in "Confirm" mode for two weeks before "Block". Confirm allows the action but pops a user-justification banner ("type a reason"). You get the false-positive list before users get angry, you get end-user training data, and you get a clean audit trail showing the org informed users before enforcing.
- Pair Inline DLP with CASB OOB for the same SaaS. Inline catches new uploads; OOB sweeps existing data. On day 1 of a new CASB connector, the OOB scan usually finds thousands of pre-existing public share links with sensitive content — that's your first quarter of cleanup work.
- Use Severity correctly. ZIA DLP supports 5 levels — Info / Low / Medium / High / Critical per rule. Map them deliberately to your SIEM noise budget — Critical and High page on-call, Medium goes to a daily digest, Low to a weekly review, Info is silent telemetry. Without severity mapping, all DLP looks the same to SecOps.
Real-world scenario — Gmail outbound DLP with inline redaction
Scenario: Gmail outbound DLP with inline redaction — User pastes a 16-digit card number into a Gmail compose window. ZIA CASB Inline sees the POST to Gmail, runs the body through the Credit Card dictionary, and rewrites the affected digits to XXXX before the request leaves ZIA. The user sees the redacted version in their Sent folder. Same flow works for outbound webmail (Outlook Web, Yahoo Mail). For Slack and most SaaS chat: redaction is NOT supported — the action must be Block instead.
Rules already in place
- Composite dictionary
PCI-CC-Strict=CreditCardNumberAND (CardholderNameORExpiryDateORCVV) within 50 words. Threshold = 1 composite match. - EDM template
Customer-DB-v2026-05-20— 1.2M rows, primary = card_number, secondary = name / dob / email. Suppresses a duplicate dictionary fire when the same record is an EDM hit. - CASB Inline rule: Cloud App = Gmail / Outlook Web / Yahoo Mail (webmail family), Action = "Redact and replace with notification".
What the user sees
User hits Send in the Gmail compose window. CASB Inline intercepts the POST to mail.google.com, inspects the form body, trips PCI-CC-Strict, and rewrites the affected digits in-flight: card number digits → XXXX-XXXX-XXXX-XXXX, name → [REDACTED — PII], with a banner appended. Gmail receives the redacted version — that's what gets sent and what appears in the user's Sent folder. The user's UI shows a notification: "DLP redacted PCI data — ask the recipient to use the secure-pay link instead."
What SecOps sees
- Insights → DLP → today: "Redacted" entry within 5s. User=engineer@corp, App=Gmail, Engine=PCI-CC-Strict, Action=Redact, Severity=High.
- Incident detail: redacted preview (
4111-XXXX-XXXX-1234 · J*** B**** · 04/27), URL path, recipient address (hashed). - NSS feed: parallel event in the SIEM, queryable; pre-built "DLP severity=High by hour" widget updates.
- Auditor email: secops-dlp@corp gets a structured email in ~30s with redacted preview + Jump-to-Insights link.
- SOC ticket (low-priority): confirm the user used the correct workaround; log it for the quarterly compliance report. No customer data was exposed — the point of the exercise.
What CASB OOB confirms later
The next hourly OOB scan on the Gmail (Google Workspace) connector re-confirms the stored Sent message body is the redacted version, not the original. PCI scan: zero hits in that mailbox for the day. The compliance evidence pack writes itself. That's "DLP done right" — silent to the recipient, helpful to the user, defensible to the auditor, no PII in the alert log.
Important caveat: Inline body-modification (redaction) is currently GA only for webmail-family SaaS (Gmail / Outlook Web / Yahoo Mail) and a small set of HTTP-form-style apps. For Slack and most SaaS chat platforms ZIA's CASB Inline does NOT modify the message body — the only supported action is Block (the message is rejected before reaching the SaaS, and the user is notified). Plan rules accordingly; don't promise "redaction" for a SaaS where only Block is supported.
📌 Quick reference (memorise — this is the data-protection arc)
- Three engines, three placements. Inline DLP and CASB Inline live in the proxy data path; CASB Out-of-Band lives outside it (OAuth admin API to SaaS).
- Inline catches now; OOB catches what slipped through. Run both for the same SaaS — Inline blocks new leaks, OOB sweeps the back-catalog.
- Three dictionary types. Predefined (40+, regulator-current), Custom (word / phrase / regex with score and threshold), Composite (AND/OR + proximity window — the false-positive killer).
- EDM = hash-upload of structured exact data (your actual customer rows). Primary + secondary field model. Plan the refresh pipeline; weekly cadence with alert-on-staleness.
- IDM = rolling-hash fingerprints of full documents (board decks, M&A, legal). Partial-match threshold — set high for templated content, low for trade secrets.
- Rule order matters. Same first-match logic as other ZIA policies — Allow / sanctioned-destination exceptions ABOVE generic Block.
- CASB Inline tenant restriction is the most valuable single setting — without it, you cannot tell corporate OneDrive from personal OneDrive.
- CASB OOB OAuth tokens expire silently. Monitor "Last Scan" age in your SIEM — the connector goes red, scanning stops, no one notices.
- Run Confirm before Block for at least two weeks per new rule — false-positive triage plus end-user training plus audit trail.
- Verify path. Insights → DLP for triggers · Insights → Web with DLP filter for per-request view · Connector status page for OOB health · NSS feed in SIEM for the structured evidence pack.
Build + test a DLP rule end-to-end:
- Create a Credit Card dictionary using the built-in pattern. Set confidence threshold to Medium.
- Create a DLP rule: outbound + Webmail destination + Credit Card dict + Action = Block + Notify User.
- From a test laptop, attempt to paste a Luhn-valid 16-digit card into Gmail compose — verify block page.
- Now upload a CSV with 100 cards as a file attachment — DLP should match. If it doesn't, check the 16 MB file-size cap.
- Check CASB → Last Scan age for your O365 tenant — re-consent if > 80 days.
📝 Check your understanding
10 scenario questions — interview + ZDTA exam depth. Pick one answer per question. You need 70% (7 of 10) to mark this lesson complete on your profile.
What's next — Lesson 9
Module 9 switches tracks completely — from ZIA (internet-bound traffic) to ZPA (private app access). Same Z-App, totally different architecture, totally different problem space.