After enabling a new DLP rule, users report "Outlook Web won't load attachments". You want to confirm whether the new DLP rule is the cause. First place to check?

Correct: (a). Insights Web is the live, in-tenant log for HTTPS via ZIA, and every entry shows which rule matched. (b) is wrong — Outlook is public SaaS, not ZPA. (c) tells you latency, not rule match. (d) eventually shows it but Insights is faster for real-time debugging.

A user can't reach jira.corp.internal via ZPA. The error in Z-App says "no connection". You want to know if the right App Segment / Connector path was even selected for this user. Which tool?

Correct: (b). Trace User is the ZPA equivalent of traceroute — replays the access attempt and shows you the matched policy + selected Connector. (a) is for public web. (c) would only help if Jira loaded slowly. (d) has the data but isn't an interactive trace.

Your team rolled out SSL Inspection and now Microsoft Authenticator stops delivering push notifications. Browser-based M365 sign-in works fine. Insights → Web for the user shows "status = TLS-Fail" for the Authenticator API endpoints. Root cause?

Correct: (b). The browser works because OS cert store has the Zscaler Root, but Authenticator pins independently and ignores the OS store. The TLS-Fail pattern in Insights — only for that app — is the giveaway. Fix is always a Bypass rule for pinned apps. (Slack dropped strict pinning in 2022; modern pinners are Authenticator, iCloud, banking apps, WhatsApp desktop, and Webex.) (a) would break all HTTPS. (c) would block all traffic. (d) doesn't match the TLS layer.

After a "small" PAC file edit, a subset of users start showing no Zscaler activity at all in Insights — their traffic appears to be bypassing Zscaler entirely. Z-App-tunneled users are unaffected. What likely happened and how do you fix it without a 1-hour outage?

Correct: (b). PAC parse errors often cause the entire file to fall through to DIRECT — silent failure mode. The Z-App tunneled users are unaffected because they don't use PAC. Always version-control PACs and use the validator. Rollback first, then re-apply changes incrementally. (a)/(c)/(d) don't address the root cause and (c) isn't even an option you have.

Branch site reports "Z-Tunnel red on every laptop since 2 AM. All users lost ZIA". You've already verified the Z-Tunnel client and Root CA are healthy. Which Insights tab and which likely root cause do you check first?

Correct: (b). Tunnel-down symptoms belong in the Tunnel tab, not Web. The 2 AM maintenance window + simultaneous failure across the whole site strongly suggests the branch firewall closed outbound 443 to Zscaler IPs. Auth-token expiry is the other common cause but tends to be staggered, not simultaneous. (a)/(c)/(d) are the wrong layer.

Your SIEM ingests NSS feeds but no dashboards or alerts have been built on top. Auditor asks: "Show me all DLP triggers for PII data, by user, in the last 12 months." What's the actual problem and the fix?

Correct: (b). NSS-without-parsers is a common waste — raw logs land but no value extracted. The fix is to deploy vendor-provided SIEM content packs that normalize fields and ship pre-built searches. (a) is overkill. (c) is wrong — Insights ages out within months (SKU-dependent, often 30–90 days on Standard). (d) makes the audit problem worse.

You fixed a ZPA Access Policy that was blocking a contractor from reaching the internal HR app. You activated the change. The user still says "not working". Best next step?

Correct: (b). Always verify a ZPA fix with Trace User. The new rule may not be matching due to rule order, posture failure, or Connector unhealthy. Activating a change is not proof it works. (a) is unprofessional. (c) is premature. (d) is unrelated.

Logs, ZDX and The 5 Production Troubleshooting Scenarios

Q: A user reports "Salesforce is slow for the whole APAC team". You want to know whether the slowness is in the Zscaler path, the SaaS provider's path, or the user's local ISP. Which Zscaler telemetry source answers that in one screen?

Correct: (c). ZDX is the only source that measures hop-by-hop user-perceived latency from endpoint through Zscaler to the SaaS endpoint. Insights tells you if the request was allowed/blocked but not where on the path it slowed. Trace User is for ZPA private apps. NSS retains the data but doesn't show hop latency natively.

Q: Your SOC wants to retain 18 months of Zscaler Web logs for compliance and join them with EDR + email gateway events. What's the right architecture?

Correct: (b). NSS is purpose-built for long-term streaming + SIEM correlation. Insights is in-tenant only and ages out. CSV exports lose schema fidelity. Screenshots are not auditable. NSS feeds are continuous, fault-tolerant, and CIM-compliant.

Q: You deploy ZDX and assume it's "monitoring everything". A month later a user complains Workday is slow and ZDX has no data. What did you miss?

Correct: (b). ZDX is opt-in per application probe. You must enumerate the SaaS apps your users care about during ZDX onboarding. The classic mistake is deploying ZDX and never configuring application probes beyond the defaults. Always inventory your top-20 user-facing SaaS and configure probes for each. (a)/(c) are wrong. (d) is dismissive and inaccurate.

Content-specific feature visual for this lesson: use it as the 60-second map before reading the full detail.

Infographic: concept-to-practice path

Start with the mental model, then move into the workflow, evidence, and practice questions.

Infographic: evidence ladder

Use this ladder when the question asks for troubleshooting, rollout, or proof.

Infographic: healthy vs broken thinking

This comparison turns the article into an interview and troubleshooting checklist.

Infographic: mini runbook

Convert the learning into a practical story you can explain to a manager or interviewer.

Pick where you want to start

Four log sources

Insights · Diagnostics · NSS · ZDX — which one per symptom.

ZIA Insights

The 4 tabs and the 5-filter rhythm for day-to-day debugging.

NSS → SIEM

Stream logs to Splunk / Sentinel for long-term retention.

ZDX experience

0–100 score per app and the hop-by-hop drill-down.

Why this lesson matters

A Zscaler deployment without observability is invisible until users complain. By that point, your CIO is on the call, the sales team is locked out of Salesforce, and you're 20 minutes into a fire drill with nothing but a Slack message that says "the internet is broken". The L3 engineer's job is to find the answer before the ticket lands — and when it does land, to walk from symptom to root cause in under 15 minutes. That requires knowing exactly which of the four Zscaler telemetry sources to open first, and which tab inside that source to pivot on. Get that right and you look like a wizard. Get it wrong and you waste an hour staring at Web logs while the actual problem was a dead Connector in ZPA Diagnostics.

This lesson is the muscle memory layer. We'll map all four log surfaces (ZIA Insights, ZPA Diagnostics, NSS-to-SIEM, ZDX), then walk the five production scenarios you will absolutely see in your first three months on the job. Memorise the hunt paths — they repeat.

The four log sources you need

Zscaler ships four distinct telemetry surfaces. Each answers a different question. Mixing them up is the #1 cause of L3 engineers chasing the wrong layer for an hour.

Source	Where it lives	What it answers	Retention
ZIA Insights	ZIA Admin Portal → Analytics → Insights	"What did the user request via ZIA, what rule matched, what was the verdict?"	SKU-dependent: 30–90d Standard, up to ~6mo Business/Transformation
ZPA Diagnostics	ZPA Admin Portal → Diagnostics	"Did this user reach this private app? Which Access Policy + Connector + Server Group was hit?"	14 days rolling
NSS (log streaming)	NSS VM in your DC, NSS for SIEM (cloud-to-cloud), or NSS Cloud → TCP/TLS to Splunk / Sentinel / QRadar / Elastic	"Show me 18 months of Web + Firewall + Tunnel + ZPA + DNS logs joined with EDR + email + identity."	As long as your SIEM keeps it
ZDX	ZDX Admin Portal — separate from ZIA/ZPA admin	"What's the user-perceived performance — hop by hop — for this SaaS app?"	~30 days rolling

The rule of thumb: Insights for "was it allowed/blocked", Diagnostics for "did ZPA path-match correctly", NSS for "what happened weeks ago plus correlation with non-Zscaler signals", ZDX for "is the user-experience actually slow". If you can recite that in your sleep, half this lesson is done.

Legend endpoint / user side (royal) ZIA / ZPA cloud telemetry (cyan) ZDX threshold alert — degraded (amber) healthy / good hop & score the bad hop — high latency / loss

The four telemetry paths — one user, four data streams

One user generates four distinct telemetry streams. Knowing which one to open per symptom is the single biggest L3 productivity multiplier.

Quick check · Pick the right source

A user reports "Salesforce is slow for the whole APAC team" and you need to know whether the slowness is in the Zscaler path, the SaaS provider's path, or the user's local ISP. Which telemetry source answers that in one screen?

a) ZIA Insights → Web tab — it shows the verdict for each request.b) ZPA Diagnostics → Trace User.c) ZDX — it measures user-perceived performance hop-by-hop from endpoint through Zscaler to the SaaS app.d) NSS Web feed in Splunk.

Correct: c. ZDX is the only source that measures hop-by-hop user-perceived latency end to end. Insights tells you allowed/blocked but not where the path slowed; Trace User is for ZPA private apps; NSS retains the data but doesn't show hop latency natively.

ZIA Insights — deep dive

Insights is the in-tenant log explorer. It opens fast, refreshes every few seconds, and is where 80% of your day-to-day debugging happens. It has four tabs, each scoped to a different ZIA service:

Web — every HTTP/HTTPS request. URL Filtering matches, Cloud App Control, DLP triggers, Sandbox verdicts, ATP detections, Malware scans — all joined onto one timeline per user.
Firewall — Cloud Firewall rule hits, non-web TCP/UDP ports allowed or denied via Z-Tunnel 2.0. This is where you see "did the SSH client actually egress?"
DNS — DNS Control queries, whether resolved via recursive Zscaler or forwarded to your AD DNS, blocked categories.
Tunnel — Z-Tunnel up/down events, auth events (SAML/Kerberos), source IP / location attribution.

The filtering rhythm is the same in every tab and you should be able to type it without thinking: User → Time range → Action → Destination → Status. That's the canonical 5-filter sequence. Add Department or Location only when scoping a regional issue. Use the timeline view to spot the moment the symptom started.

Saved search you should build day-1

Tab:        Web
Time:       Last 1h
Action:     Block, TLS-Fail, Caution
Department: All
Group by:   URL Category, then Rule

(Save as: "All blocks last hour — first triage")

Insights also offers Saved Searches. Build five of these the day you get tenant access — top-5 blocks last 24h, TLS-failures last 1h, DLP triggers last 7d, sandbox-quarantined files last 24h, Tunnel-disconnect events last 1h. They become your dashboard.

Quick check · Which Insights tab

A branch reports "Z-Tunnel red on every laptop since 2 AM — all users lost ZIA." The Z-Tunnel client and Root CA are already verified healthy. Which Insights tab do you open first, and what's the likely root cause?

a) Web tab — a certificate problem.b) Tunnel tab — outbound 443 blocked at the branch firewall after the 2 AM maintenance window (or an expired auth token).c) DNS tab — the recursive resolver is down.d) ZPA Diagnostics → Trace User.

Correct: b. Tunnel-down symptoms belong in the Tunnel tab, not Web. A 2 AM maintenance window plus simultaneous, site-wide failure points to the branch firewall closing outbound 443 to Zscaler IPs. Auth-token expiry is the other common cause but tends to be staggered, not simultaneous.

ZPA Diagnostics — deep dive

ZPA Diagnostics is structurally different from ZIA Insights — ZPA brokers private app access, so the question isn't "was this URL allowed" but "did the right policy + Connector + Server combination resolve". Three tools matter:

Trace User — pick a user + an Application Segment. ZPA replays the last access attempt and shows you exactly which Access Policy matched, which Server Group + Connector Group was chosen, which Connector handled it, and the result. This is the L3 equivalent of traceroute for ZPA.
Connector Health — per-region availability graph. Shows you which Connectors in which Connector Group are UP, what their latency back to the ZPA Cloud is, and CPU/memory. A red Connector here is almost always the answer when an entire site loses ZPA access.
App Segment Browsing tracer — type a hostname (e.g. jira.corp.internal) and ZPA tells you which App Segment + Segment Group + Server Group it would route to. Critical when wildcard segments overlap and the wrong one is matching.

Rule of thumb: if the user says "the app didn't load", open Trace User before anything else. It tells you in one screen whether the failure was policy (no match), Posture (failed device check), Connector (unhealthy), or Server (origin down).

NSS — streaming logs to your SIEM

Insights and Diagnostics are great for live debugging, but they're in-tenant and they don't keep data forever. The moment your SOC needs to correlate Zscaler events with EDR, email gateway, identity (Okta / Entra), or build a custom retention dashboard — you need NSS (Nanolog Streaming Service). NSS continuously streams ZIA and ZPA logs over TCP/TLS to your SIEM collector.

NSS comes in three deployment forms:

Form	Where it runs	When to pick it
NSS VM	Customer DC or cloud (OVA / AMI)	You have an on-prem SIEM and want network locality. Most common.
NSS for SIEM (cloud-to-cloud)	Direct from Zscaler cloud to a hosted SIEM API (Splunk Cloud, Sentinel, QRadar SaaS, Elastic Cloud)	No VM to manage — cleanest path when your SIEM is SaaS too.
NSS Cloud	Zscaler-hosted NSS that delivers via syslog-over-TLS to your collector IP	You want zero customer-side infrastructure but still control the destination.

You configure one NSS feed per log type — Web, Firewall, Tunnel, ZPA, DNS — pick the fields you want, choose JSON / LEEF / CEF format, point at your collector. The feed is continuous, low-latency (seconds), and you can replay missed data within a buffer window.

The reason NSS is non-negotiable for any tenant above ~500 users: Insights only goes back a few months at most, but compliance, incident response, and long-running threat hunts need 12–24 months of data. NSS pipes that retention into a system you control.

NSS retention caveat: NSS retention varies by SKU — Standard ZIA Web Insights often 30–90 days; Business/Transformation tiers approach 6 months. Verify your tenant's retention under Administration → NSS → Insights Retention.

Modern alternative to NSS: Zscaler's Log Streaming Service (LSS for ZPA) and REST API polling now feed most new SIEM integrations directly (Splunk Cloud, Sentinel, Chronicle). Classic NSS VMs are still the gold standard for high-EPS / on-prem SIEM (QRadar, ArcSight), but plan modern integrations API-first.

NSS Sizing

Rough planning: ~150-300 EPS per 1000 users at peak business hours. NSS VM baseline: 4 vCPU / 8 GB RAM / 100 GB disk. The NSS VM buffers ~10 minutes of logs in memory; if downstream SIEM ack lag exceeds the buffer, NSS drops logs and may panic-restart. Monitor: nss-stats CLI for buffer depth, SIEM-ack lag, and EPS in/out.

Quick check · Long-term retention

Your SOC wants to retain 18 months of Zscaler Web logs for compliance and join them with EDR + email-gateway events. What's the right architecture?

a) Extend ZIA Insights retention by upgrading the SKU.b) Configure NSS (VM, for-SIEM, or Cloud) to continuously stream Web + Firewall + Tunnel + ZPA + DNS feeds into the SIEM, where correlation happens.c) Export Insights to CSV every week and email it.d) Screenshot the dashboard daily.

Correct: b. NSS is purpose-built for long-term streaming plus SIEM correlation — continuous, low-latency, and CIM-compliant. Insights is in-tenant only and ages out (often 30–90 days on Standard); CSV exports lose schema fidelity; screenshots aren't auditable.

ZDX — proactive user-experience monitoring

Insights and Diagnostics tell you what happened on the Zscaler side. ZDX tells you what the user actually experienced — including the parts that have nothing to do with Zscaler. It runs as a lightweight agent inside Z-App (or as a standalone install) and continuously probes the path from endpoint to SaaS.

ZDX scores across 5 dimensions: (1) Page Load Time for SaaS app probes; (2) CloudPath hop-by-hop latency to the app; (3) Network last-mile / Wi-Fi / DNS health; (4) Device CPU/memory/battery health; (5) Application probe response (synthetic or real-user).

Application probes (synthetic) — run on a schedule from the Z-App agent even when the user is idle. Web probes / Real-user monitoring — passive, capture timings only when the user actually visits the app. ZDTA exam discriminator: synthetic catches outages when no one is working (early-morning M365 issues), RUM catches user-experience problems mid-day.

ZDX rolls the result into a single 0–100 ZDX Score per user per app. Anything above 80 = good, 50–80 = degraded, below 50 = poor. The top operational use case is: "user complains Teams is slow — ZDX shows the hop where latency spikes — was it the Zscaler edge, ISP middle-mile, or customer ISP egress?" You answer in 30 seconds instead of an hour of guessing.

When NOT to trust the ZDX score

SSL Inspection breaks the probe itself — ZDX probes to pinned apps (M365 with cert pinning quirks) can fail at the TLS step. Score drops to 50, but app actually works. Check by probing from a Z-App that's bypassed for the target FQDN.
Captive portals — hotel/airport Wi-Fi intercept ZDX synthetic probes. False low scores for the whole region until users authenticate.
Probe configured but app not in inventory — score stays 'no data' even when users are actively complaining.
Cross-check: if ZDX says 92 but the user says 'slow', check Web Insights for that user's actual transactions before defending the dashboard.

ZDX probe flow — symptom to root-cause hop

ZDX turns "users say Teams is slow" into "Hop C between Singapore edge and SaaS provider added 480 ms — here's the trace". 30 seconds instead of an hour.

▶ Walk a ZDX score-drill — "Salesforce broken for APAC"

Friday 5 PM: APAC can't update opportunities. Press Play to drill from the ZDX score drop to the bad hop, then Break it to see how a missing application probe blinds you — and the fix.

① Insights WebFilter destination=*.salesforce.com + region APAC, last 15 min. Traffic is flowing (mixed 200 OK and some TLS-Fail) — so it's partial degradation, not a blanket block.

▼

② ZDX scorePivot to ZDX → Salesforce → APAC. The ZDX Score has dropped from a baseline of 92 to 41 in the last 20 minutes — confirmed, user-perceived degradation.

▼

③ Hop-by-hopOpen CloudPath. Hops 1–3 (endpoint → Z-Tunnel → Singapore POP) are ~12 ms baseline. Hop 4 (Singapore POP → Salesforce edge) jumped from ~18 ms to ~480 ms with packet loss — the bad hop.

▼

④ Root causeCheck the Zscaler trust-portal status page: the Singapore POP is on a partial maintenance window with reduced peering capacity. Vendor's path, not your config.

▼

⑤ Workaround + verifyPush a temporary forwarding-profile / PAC override routing APAC Salesforce via the Tokyo POP for 2 hours. Five minutes later the ZDX Score recovers to 88 and the hop view shows a healthy Tokyo → Salesforce path.

Press Play to drill from symptom to recovery, then press Break it.

Quick check · Trust the score?

You deploy ZDX and assume it's "monitoring everything." A month later a user complains Workday is slow and ZDX has no data for it. What did you miss?

a) ZDX is broken.b) ZDX only probes the SaaS apps you explicitly configure — Workday and your other top apps must be added as application probes during onboarding, or no data is collected for them.c) Workday doesn't support ZDX.d) The user is wrong.

Correct: b. ZDX is opt-in per application probe. The out-of-the-box defaults don't cover every SaaS your org uses, so you must inventory your top user-facing apps and configure a probe for each — otherwise the score stays "no data" even while users complain.

🔑 Lock in the key terms — tap to flip

🔎

Trace User

tap to flip

The ZPA Diagnostics traceroute equivalent. Pick a user + App Segment and it replays the access attempt, showing the matched Access Policy, Server Group, Connector Group, Connector and result in one screen.

📡

NSS

tap to flip

Nanolog Streaming Service. Continuously streams ZIA + ZPA logs over TCP/TLS (JSON / LEEF / CEF) to Splunk / Sentinel / QRadar / Elastic for long-term retention and cross-source correlation.

📈

ZDX Score

tap to flip

A single 0–100 score per user per app. Above 80 = good, 50–80 = degraded, below 50 = poor. Drill into the hop-by-hop view to find the bad hop in 30 seconds.

🧪

Synthetic vs RUM

tap to flip

Synthetic application probes run on a schedule even when the user is idle (catch early-morning outages). RUM / web probes capture timings only when the user actually visits the app (catch mid-day user-experience problems).

Zscaler Troubleshooting Lab Cloud Connector + Log Path

The 5 most common production troubleshooting scenarios

You will see these five patterns repeatedly. Memorise the hunt path for each — they're worth more than any abstract theory.

(a) "SaaS app is broken" — e.g. Outlook on the Web won't load

Symptom: users in a department can't load outlook.office.com. Page hangs or partial-render. Reproducible.

Hunt path: Insights → Web tab → filter destination=outlook.office.com + time-range last 1h + user-region affected. Look at Action column. If status is Block, the rule column shows the offending URL Filter / Cloud App / DLP rule. If status is TLS-Fail, you've got an SSL Inspection break (probably pinning — see scenario b). If status is Allow but bytes are 0, you're chasing a backend issue not a Zscaler one — pivot to ZDX next.

Fix: the rule column tells you the answer. Common cases — someone added a new DLP rule that classifies Outlook downloads as PII, a Cloud App Control rule blocked file-attachment uploads, a URL Filter rule mis-categorized outlook.office.com as Webmail (blocked) instead of Productivity (allowed). Edit the rule, narrow the match, activate, re-test.

(b) "One app silently fails after SSL Inspection rollout"

Symptom: ERR_CERT_AUTHORITY_INVALID in browser, OR the native app just refuses to connect with no error. Most common after enabling SSL Inspection or rolling a new pinned-app version.

Hunt path: Insights → Web → filter destination= + status=TLS-Fail. If you see lots of TLS-Fail entries, the app is pinning its certificate (it doesn't trust the Zscaler-issued one). Cross-check with tcpdump on the client during the connection attempt — you'll see an immediate TLS abort from the app side.

Fix: add the app's API domains to the SSL Inspection Bypass rule (Order 5 — above the generic Inspect rule). The classics that pin (current as of 2026): Microsoft Authenticator, Apple iCloud sync, banking apps, WhatsApp desktop, Webex. (Slack desktop dropped strict pinning in 2022 — modern Slack works through MITM with the Zscaler root cert installed; if Slack breaks today it's usually a root-cert distribution issue, not pinning.) Maintain a "Pinned Apps — Always Bypass" list in your runbook and apply on day-1 of every new tenant.

(c) "ZPA app is unreachable"

Symptom: Z-App shows "no connection" or app spinner forever. Affects all users of one app, or all users at one site.

Hunt path: ZPA Diagnostics → Trace User → pick affected user, pick affected app. Look at the matched Access Policy + Server Group + Connector. Three things go wrong here, in this order of frequency:

Connector unhealthy — Diagnostics → Connector Health → red dot. The Connector VM crashed, lost outbound 443 to ZPA Cloud, or ran out of memory. Bounce or scale.
App Segment misconfigured — the wildcard in the App Segment doesn't actually match the hostname the user is hitting. Use the App Segment Browsing tracer to verify which segment a hostname resolves to.
Posture failed — user's device suddenly fails a Posture profile (disk encryption disabled, AV out of date, certificate expired). Posture failures show in Diagnostics with the specific posture rule that failed.

Fix: depends on which of the three. Always re-run Trace User after the fix to confirm it now matches the right path.

(d) "PAC file typo locks out a subset of users"

Symptom: users who forward via PAC (not Z-App tunnel) get random failures, or no Zscaler at all, or all their traffic bypasses Zscaler. Z-App users are fine. Often happens an hour after a "harmless" PAC edit.

Hunt path: pull the currently-deployed PAC. Run it through a PAC validator (Zscaler ships one in the Admin Portal under Administration → Hosted PAC Files → Validate). One missing semicolon or a wrong domain literal in shExpMatch() can cascade into "the entire PAC returns DIRECT for everything" — silently bypassing Zscaler.

Fix: deploy a known-good PAC, then re-add the new domain logic one block at a time. Version-control your PAC in Git. Always validate before pushing. The number of incidents traced to "someone edited the live PAC in the textbox" is staggering — treat the PAC like production code.

(e) "Z-Tunnel is down — Z-App icon red"

Symptom: Z-App icon is red on the user's tray. No traffic to ZIA. User sees raw internet (or nothing, if firewall blocks direct).

Hunt path: open Z-App on the client → View Logs / Tunnel status. Then ZIA Insights → Tunnel tab → filter by that user. Common causes:

Outbound 443 blocked by user's home / hotel router — Z-Tunnel needs egress to the Zscaler edge on 443. Captive portals are a frequent culprit. Switch network and re-test.
Auth token expired — Z-App's SAML token rolled over and Z-App needs the user to re-authenticate. Sign out, sign in.
Trusted Network detection — Z-App detects the user is on the corporate LAN and intentionally drops Z-Tunnel (Disable on Trusted Network policy). Check forwarding profile.
Z-App version stale — older Z-App versions get deprecated. Push the current version via MDM.

Fix: walk that list top-down. Most outage tickets resolve at step 1 (network) or step 2 (re-auth).

✓Verify — your observability is actually wired up

Don't trust that the telemetry is healthy just because nobody's complained yet. Run these checks weekly:

Insights saved-searches — confirm your top-5 saved searches (blocks last 1h, TLS-failures, DLP triggers, sandbox quarantines, tunnel disconnects) all return data and refresh in real-time.
NSS pipeline health — in your SIEM, run last 15 minutes of zscaler_web — if the count is 0 the feed is broken. Set a SIEM alert on "no Zscaler events for 5 minutes".
ZDX threshold alert routing — temporarily drop a probe threshold (e.g. set Salesforce to alert below 90) and verify the email / Slack / PagerDuty hook actually fires. Then set it back to the real threshold (~70).
ZPA Trace User dry-run — pick any user + any production app every week and run Trace User. Confirm the policy match path is still what you expect (catches drift from rule edits).

⚠Common Mistakes — observability done wrong

Looking only at Insights → Web when the issue is Tunnel auth — different tab, different data set. If the symptom is "no traffic at all", check Tunnel tab first.
Assuming NSS retention exists when it's in-tenant only — Insights data ages out within months (often 30–90 days on Standard SKUs). If NSS was never configured, that 9-month-old incident has zero forensic evidence.
ZDX deployed but no probes for the SaaS apps users actually use — out of the box ZDX probes a few defaults. Add Salesforce / Workday / ServiceNow / Zoom / Teams / your custom internal apps explicitly.
Confusing ZPA Diagnostics with Insights → Web — they show different layers. Diagnostics is private-app policy match path. Insights is public/SaaS request log. Use the right one.
Engineer fixes a rule but doesn't re-run Trace User to confirm — the rule edit looked right but maybe rule order changed and a higher-priority rule still wins. Always re-trace after a fix.
PAC file changes not version-controlled — no rollback when the typo locks everyone out. Commit every PAC change to Git, tag it with the change ticket, deploy via CI.
SIEM ingesting NSS feed but no parsers / dashboards — logs sit in a bucket unused. Wire CIM-compliant Splunk apps or Sentinel content packs from day one.

💡Pro Tips

Build your runbook from saved searches. Every recurring ticket pattern (Slack down, Salesforce slow APAC, OneDrive sync) becomes a saved Insights search you can pull in one click. Six months in, you have a 50-search library and your MTTR drops to minutes.
Pair ZDX alerts with PagerDuty / Slack rather than email. Email gets ignored. A Slack hook in #network-noc with the ZDX score graph inline gets eyes within 30 seconds and shortens incident lead time enormously.
Treat NSS feeds as your audit trail, not optional. When auditors ask "show me a year of egress traffic for user X", you need NSS-to-SIEM. Configure it on day 1, not after the audit.

Real-world scenario — Friday 5 PM: "Salesforce broken for APAC"

Sales lead pings the on-call channel — "Salesforce is broken for everyone in APAC. No one can update opportunities. End of quarter, fix now." You have 30 minutes before the global all-hands. Walk it:

Insights → Web → filter destination=*.salesforce.com + user-region=APAC + last 15 min. Traffic is flowing — not a blanket block. Good.
Look at Status column. Mixed bag — many 200 OK but also a chunk of TLS-Fail. So partial degradation, not full block.
Pivot to ZDX → Salesforce app → APAC region view. The ZDX Score for APAC has dropped from 92 (last week's baseline) to 41 in the last 20 minutes. Confirmed degradation, confirmed user-perceived.
Open the hop-by-hop view. Hops 1–3 (endpoint → Z-Tunnel → ZIA Singapore POP) all show baseline latency ~12 ms. Hop 4 (Singapore POP → Salesforce edge) has jumped from ~18 ms to ~480 ms with packet loss.
Check Zscaler's trust portal status page. Confirmed — Singapore POP is on a partial maintenance window with reduced peering capacity. Vendor's fault, not your config.
Workaround: in Z-App's forwarding profile, push a temporary PAC override that routes APAC Salesforce traffic via the Tokyo POP for the next 2 hours. Activate.
Verify: 5 minutes later, ZDX Score recovers to 88. Hop view shows healthy Tokyo→Salesforce path. Sales lead confirms users back to normal.
Post-incident: file a ticket with Zscaler asking why the Singapore maintenance wasn't in the customer-facing change calendar. Update the runbook with "Tokyo failover PAC override — APAC Salesforce" so the next on-call engineer doesn't reinvent it.

Eight steps. ~22 minutes from ticket to recovery. The path is reproducible because each step asks one specific question of one specific telemetry source. That's the L3 discipline this lesson is building.

Re-run this scenario in the Lab Connector Health Lab

📌 Quick reference (memorise — comes up in every ZDTA scenario question)

4 log sources: ZIA Insights · ZPA Diagnostics · NSS streaming · ZDX.
Insights = in-tenant, real-time, GUI, SKU-dependent retention (30d–6mo). Daily debugging lives here.
ZIA Insights 4 tabs: Web · Firewall · DNS · Tunnel. Pick the tab that matches the symptom.
Filter rhythm: User → Time → Action → Destination → Status. Same in every tab.
ZPA Diagnostics → Trace User is the traceroute equivalent for private apps.
NSS = TCP/TLS feed to Splunk / Sentinel / QRadar / Elastic. Long-term retention + cross-source correlation.
ZDX = 0–100 score per user per app. Below 50 = poor. Hop-by-hop view finds the bad hop in 30 seconds.
Top-5 scenarios memorised: SaaS broken (Insights Web), SSL pinning (Insights Web TLS-Fail), ZPA unreachable (Trace User), PAC typo (validate + version), Z-Tunnel down (Insights Tunnel + Z-App logs).
Always re-run Trace User after a ZPA fix — confirm the new path actually matched.
PAC = production code. Git it, tag it, validate it, deploy it via CI.

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. The exact framing an interviewer wants to hear.

Pre-curated from Zscaler docs + interview Q&A, scoped to this lesson. For a live tenant issue, paste your ip.zscaler.com output into chat.techclick.in.

▶ QUICK LAB · ~15 MIN

Hunt a user-reported slowness ticket:

User reports "SAP is slow at 3pm IST". Open ZDX → Users → search the user.
Note the ZDX score over the last 24h. If above 75 most of the time but dropped at 3pm → confirm a real event.
Drill into CloudPath at 3pm — find which hop introduced latency. Was it the last-mile, the ISP, or Zscaler?
Open Web Insights for that user at 3pm — find actual transaction timings to SAP.
If ZDX says 92 but user says slow → check synthetic vs RUM — synthetic may be fine, RUM telling a different story.

What's next?

Module 14 is the finisher — your ZDTA exam blueprint walkthrough, the 25 most-asked Zscaler interview questions with model answers, and a 4-week study schedule to clear the cert.

Lesson 14 — ZDTA Cert + Interview Prep → Practice on exam.techclick.in

📩 Quiz me on this in 7 days. Opt in and we'll email you 3 micro-questions from this lesson at Day 1, Day 7 and Day 30 — spaced repetition is how it sticks. Un-tick any time.

Logs, ZDX & The 5 Production Troubleshooting Scenarios You'll See Most

Pick where you want to start

Four log sources

ZIA Insights

NSS → SIEM

ZDX experience

Why this lesson matters

The four log sources you need

ZIA Insights — deep dive

ZPA Diagnostics — deep dive

NSS — streaming logs to your SIEM

NSS Sizing

ZDX — proactive user-experience monitoring

When NOT to trust the ZDX score

▶ Walk a ZDX score-drill — "Salesforce broken for APAC"

The 5 most common production troubleshooting scenarios

(a) "SaaS app is broken" — e.g. Outlook on the Web won't load

(b) "One app silently fails after SSL Inspection rollout"

(c) "ZPA app is unreachable"

(d) "PAC file typo locks out a subset of users"

(e) "Z-Tunnel is down — Z-App icon red"

Real-world scenario — Friday 5 PM: "Salesforce broken for APAC"

📌 Quick reference (memorise — comes up in every ZDTA scenario question)

🤖 Ask the AI Tutor

📝 Check your understanding

What's next?