Why this lesson matters
A Zscaler deployment without observability is invisible until users complain. By that point, your CIO is on the call, the sales team is locked out of Salesforce, and you're 20 minutes into a fire drill with nothing but a Slack message that says "the internet is broken". The L3 engineer's job is to find the answer before the ticket lands β and when it does land, to walk from symptom to root cause in under 15 minutes. That requires knowing exactly which of the four Zscaler telemetry sources to open first, and which tab inside that source to pivot on. Get that right and you look like a wizard. Get it wrong and you waste an hour staring at Web logs while the actual problem was a dead Connector in ZPA Diagnostics.
This lesson is the muscle memory layer. We'll map all four log surfaces (ZIA Insights, ZPA Diagnostics, NSS-to-SIEM, ZDX), then walk the five production scenarios you will absolutely see in your first three months on the job. Memorise the hunt paths β they repeat.
The four log sources you need
Zscaler ships four distinct telemetry surfaces. Each answers a different question. Mixing them up is the #1 cause of L3 engineers chasing the wrong layer for an hour.
| Source | Where it lives | What it answers | Retention |
|---|---|---|---|
| ZIA Insights | ZIA Admin Portal β Analytics β Insights | "What did the user request via ZIA, what rule matched, what was the verdict?" | SKU-dependent: 30β90d Standard, up to ~6mo Business/Transformation |
| ZPA Diagnostics | ZPA Admin Portal β Diagnostics | "Did this user reach this private app? Which Access Policy + Connector + Server Group was hit?" | 14 days rolling |
| NSS (log streaming) | NSS VM in your DC, NSS for SIEM (cloud-to-cloud), or NSS Cloud β TCP/TLS to Splunk / Sentinel / QRadar / Elastic | "Show me 18 months of Web + Firewall + Tunnel + ZPA + DNS logs joined with EDR + email + identity." | As long as your SIEM keeps it |
| ZDX | ZDX Admin Portal β separate from ZIA/ZPA admin | "What's the user-perceived performance β hop by hop β for this SaaS app?" | ~30 days rolling |
The rule of thumb: Insights for "was it allowed/blocked", Diagnostics for "did ZPA path-match correctly", NSS for "what happened weeks ago plus correlation with non-Zscaler signals", ZDX for "is the user-experience actually slow". If you can recite that in your sleep, half this lesson is done.
One user generates four distinct telemetry streams. Knowing which one to open per symptom is the single biggest L3 productivity multiplier.
ZIA Insights β deep dive
Insights is the in-tenant log explorer. It opens fast, refreshes every few seconds, and is where 80% of your day-to-day debugging happens. It has four tabs, each scoped to a different ZIA service:
- Web β every HTTP/HTTPS request. URL Filtering matches, Cloud App Control, DLP triggers, Sandbox verdicts, ATP detections, Malware scans β all joined onto one timeline per user.
- Firewall β Cloud Firewall rule hits, non-web TCP/UDP ports allowed or denied via Z-Tunnel 2.0. This is where you see "did the SSH client actually egress?"
- DNS β DNS Control queries, whether resolved via recursive Zscaler or forwarded to your AD DNS, blocked categories.
- Tunnel β Z-Tunnel up/down events, auth events (SAML/Kerberos), source IP / location attribution.
The filtering rhythm is the same in every tab and you should be able to type it without thinking: User β Time range β Action β Destination β Status. That's the canonical 5-filter sequence. Add Department or Location only when scoping a regional issue. Use the timeline view to spot the moment the symptom started.
Tab: Web Time: Last 1h Action: Block, TLS-Fail, Caution Department: All Group by: URL Category, then Rule (Save as: "All blocks last hour β first triage")
Insights also offers Saved Searches. Build five of these the day you get tenant access β top-5 blocks last 24h, TLS-failures last 1h, DLP triggers last 7d, sandbox-quarantined files last 24h, Tunnel-disconnect events last 1h. They become your dashboard.
ZPA Diagnostics β deep dive
ZPA Diagnostics is structurally different from ZIA Insights β ZPA brokers private app access, so the question isn't "was this URL allowed" but "did the right policy + Connector + Server combination resolve". Three tools matter:
- Trace User β pick a user + an Application Segment. ZPA replays the last access attempt and shows you exactly which Access Policy matched, which Server Group + Connector Group was chosen, which Connector handled it, and the result. This is the L3 equivalent of
traceroutefor ZPA. - Connector Health β per-region availability graph. Shows you which Connectors in which Connector Group are UP, what their latency back to the ZPA Cloud is, and CPU/memory. A red Connector here is almost always the answer when an entire site loses ZPA access.
- App Segment Browsing tracer β type a hostname (e.g.
jira.corp.internal) and ZPA tells you which App Segment + Segment Group + Server Group it would route to. Critical when wildcard segments overlap and the wrong one is matching.
Rule of thumb: if the user says "the app didn't load", open Trace User before anything else. It tells you in one screen whether the failure was policy (no match), Posture (failed device check), Connector (unhealthy), or Server (origin down).
NSS β streaming logs to your SIEM
Insights and Diagnostics are great for live debugging, but they're in-tenant and they don't keep data forever. The moment your SOC needs to correlate Zscaler events with EDR, email gateway, identity (Okta / Entra), or build a custom retention dashboard β you need NSS (Nanolog Streaming Service). NSS continuously streams ZIA and ZPA logs over TCP/TLS to your SIEM collector.
NSS comes in three deployment forms:
| Form | Where it runs | When to pick it |
|---|---|---|
| NSS VM | Customer DC or cloud (OVA / AMI) | You have an on-prem SIEM and want network locality. Most common. |
| NSS for SIEM (cloud-to-cloud) | Direct from Zscaler cloud to a hosted SIEM API (Splunk Cloud, Sentinel, QRadar SaaS, Elastic Cloud) | No VM to manage β cleanest path when your SIEM is SaaS too. |
| NSS Cloud | Zscaler-hosted NSS that delivers via syslog-over-TLS to your collector IP | You want zero customer-side infrastructure but still control the destination. |
You configure one NSS feed per log type β Web, Firewall, Tunnel, ZPA, DNS β pick the fields you want, choose JSON / LEEF / CEF format, point at your collector. The feed is continuous, low-latency (seconds), and you can replay missed data within a buffer window.
The reason NSS is non-negotiable for any tenant above ~500 users: Insights only goes back a few months at most, but compliance, incident response, and long-running threat hunts need 12β24 months of data. NSS pipes that retention into a system you control.
NSS retention caveat: NSS retention varies by SKU β Standard ZIA Web Insights often 30β90 days; Business/Transformation tiers approach 6 months. Verify your tenant's retention under Administration β NSS β Insights Retention.
Modern alternative to NSS: Zscaler's Log Streaming Service (LSS for ZPA) and REST API polling now feed most new SIEM integrations directly (Splunk Cloud, Sentinel, Chronicle). Classic NSS VMs are still the gold standard for high-EPS / on-prem SIEM (QRadar, ArcSight), but plan modern integrations API-first.
NSS Sizing
Rough planning: ~150-300 EPS per 1000 users at peak business hours. NSS VM baseline: 4 vCPU / 8 GB RAM / 100 GB disk. The NSS VM buffers ~10 minutes of logs in memory; if downstream SIEM ack lag exceeds the buffer, NSS drops logs and may panic-restart. Monitor: nss-stats CLI for buffer depth, SIEM-ack lag, and EPS in/out.
ZDX β proactive user-experience monitoring
Insights and Diagnostics tell you what happened on the Zscaler side. ZDX tells you what the user actually experienced β including the parts that have nothing to do with Zscaler. It runs as a lightweight agent inside Z-App (or as a standalone install) and continuously probes the path from endpoint to SaaS.
ZDX scores across 5 dimensions: (1) Page Load Time for SaaS app probes; (2) CloudPath hop-by-hop latency to the app; (3) Network last-mile / Wi-Fi / DNS health; (4) Device CPU/memory/battery health; (5) Application probe response (synthetic or real-user).
Application probes (synthetic) β run on a schedule from the Z-App agent even when the user is idle. Web probes / Real-user monitoring β passive, capture timings only when the user actually visits the app. ZDTA exam discriminator: synthetic catches outages when no one is working (early-morning M365 issues), RUM catches user-experience problems mid-day.
ZDX rolls the result into a single 0β100 ZDX Score per user per app. Anything above 80 = good, 50β80 = degraded, below 50 = poor. The top operational use case is: "user complains Teams is slow β ZDX shows the hop where latency spikes β was it the Zscaler edge, ISP middle-mile, or customer ISP egress?" You answer in 30 seconds instead of an hour of guessing.
When NOT to trust the ZDX score
- SSL Inspection breaks the probe itself β ZDX probes to pinned apps (M365 with cert pinning quirks) can fail at the TLS step. Score drops to 50, but app actually works. Check by probing from a Z-App that's bypassed for the target FQDN.
- Captive portals β hotel/airport Wi-Fi intercept ZDX synthetic probes. False low scores for the whole region until users authenticate.
- Probe configured but app not in inventory β score stays 'no data' even when users are actively complaining.
- Cross-check: if ZDX says 92 but the user says 'slow', check Web Insights for that user's actual transactions before defending the dashboard.
ZDX turns "users say Teams is slow" into "Hop C between Singapore edge and SaaS provider added 480 ms β here's the trace". 30 seconds instead of an hour.
The 5 most common production troubleshooting scenarios
You will see these five patterns repeatedly. Memorise the hunt path for each β they're worth more than any abstract theory.
(a) "SaaS app is broken" β e.g. Outlook on the Web won't load
Symptom: users in a department can't load outlook.office.com. Page hangs or partial-render. Reproducible.
Hunt path: Insights β Web tab β filter destination=outlook.office.com + time-range last 1h + user-region affected. Look at Action column. If status is Block, the rule column shows the offending URL Filter / Cloud App / DLP rule. If status is TLS-Fail, you've got an SSL Inspection break (probably pinning β see scenario b). If status is Allow but bytes are 0, you're chasing a backend issue not a Zscaler one β pivot to ZDX next.
Fix: the rule column tells you the answer. Common cases β someone added a new DLP rule that classifies Outlook downloads as PII, a Cloud App Control rule blocked file-attachment uploads, a URL Filter rule mis-categorized outlook.office.com as Webmail (blocked) instead of Productivity (allowed). Edit the rule, narrow the match, activate, re-test.
(b) "One app silently fails after SSL Inspection rollout"
Symptom: ERR_CERT_AUTHORITY_INVALID in browser, OR the native app just refuses to connect with no error. Most common after enabling SSL Inspection or rolling a new pinned-app version.
Hunt path: Insights β Web β filter destination= + status=TLS-Fail. If you see lots of TLS-Fail entries, the app is pinning its certificate (it doesn't trust the Zscaler-issued one). Cross-check with tcpdump on the client during the connection attempt β you'll see an immediate TLS abort from the app side.
Fix: add the app's API domains to the SSL Inspection Bypass rule (Order 5 β above the generic Inspect rule). The classics that pin (current as of 2026): Microsoft Authenticator, Apple iCloud sync, banking apps, WhatsApp desktop, Webex. (Slack desktop dropped strict pinning in 2022 β modern Slack works through MITM with the Zscaler root cert installed; if Slack breaks today it's usually a root-cert distribution issue, not pinning.) Maintain a "Pinned Apps β Always Bypass" list in your runbook and apply on day-1 of every new tenant.
(c) "ZPA app is unreachable"
Symptom: Z-App shows "no connection" or app spinner forever. Affects all users of one app, or all users at one site.
Hunt path: ZPA Diagnostics β Trace User β pick affected user, pick affected app. Look at the matched Access Policy + Server Group + Connector. Three things go wrong here, in this order of frequency:
- Connector unhealthy β Diagnostics β Connector Health β red dot. The Connector VM crashed, lost outbound 443 to ZPA Cloud, or ran out of memory. Bounce or scale.
- App Segment misconfigured β the wildcard in the App Segment doesn't actually match the hostname the user is hitting. Use the App Segment Browsing tracer to verify which segment a hostname resolves to.
- Posture failed β user's device suddenly fails a Posture profile (disk encryption disabled, AV out of date, certificate expired). Posture failures show in Diagnostics with the specific posture rule that failed.
Fix: depends on which of the three. Always re-run Trace User after the fix to confirm it now matches the right path.
(d) "PAC file typo locks out a subset of users"
Symptom: users who forward via PAC (not Z-App tunnel) get random failures, or no Zscaler at all, or all their traffic bypasses Zscaler. Z-App users are fine. Often happens an hour after a "harmless" PAC edit.
Hunt path: pull the currently-deployed PAC. Run it through a PAC validator (Zscaler ships one in the Admin Portal under Administration β Hosted PAC Files β Validate). One missing semicolon or a wrong domain literal in shExpMatch() can cascade into "the entire PAC returns DIRECT for everything" β silently bypassing Zscaler.
Fix: deploy a known-good PAC, then re-add the new domain logic one block at a time. Version-control your PAC in Git. Always validate before pushing. The number of incidents traced to "someone edited the live PAC in the textbox" is staggering β treat the PAC like production code.
(e) "Z-Tunnel is down β Z-App icon red"
Symptom: Z-App icon is red on the user's tray. No traffic to ZIA. User sees raw internet (or nothing, if firewall blocks direct).
Hunt path: open Z-App on the client β View Logs / Tunnel status. Then ZIA Insights β Tunnel tab β filter by that user. Common causes:
- Outbound 443 blocked by user's home / hotel router β Z-Tunnel needs egress to the Zscaler edge on 443. Captive portals are a frequent culprit. Switch network and re-test.
- Auth token expired β Z-App's SAML token rolled over and Z-App needs the user to re-authenticate. Sign out, sign in.
- Trusted Network detection β Z-App detects the user is on the corporate LAN and intentionally drops Z-Tunnel (Disable on Trusted Network policy). Check forwarding profile.
- Z-App version stale β older Z-App versions get deprecated. Push the current version via MDM.
Fix: walk that list top-down. Most outage tickets resolve at step 1 (network) or step 2 (re-auth).
Don't trust that the telemetry is healthy just because nobody's complained yet. Run these checks weekly:
- Insights saved-searches β confirm your top-5 saved searches (blocks last 1h, TLS-failures, DLP triggers, sandbox quarantines, tunnel disconnects) all return data and refresh in real-time.
- NSS pipeline health β in your SIEM, run
last 15 minutes of zscaler_webβ if the count is 0 the feed is broken. Set a SIEM alert on "no Zscaler events for 5 minutes". - ZDX threshold alert routing β temporarily drop a probe threshold (e.g. set Salesforce to alert below 90) and verify the email / Slack / PagerDuty hook actually fires. Then set it back to the real threshold (~70).
- ZPA Trace User dry-run β pick any user + any production app every week and run Trace User. Confirm the policy match path is still what you expect (catches drift from rule edits).
- Looking only at Insights β Web when the issue is Tunnel auth β different tab, different data set. If the symptom is "no traffic at all", check Tunnel tab first.
- Assuming NSS retention exists when it's in-tenant only β Insights data ages out within months (often 30β90 days on Standard SKUs). If NSS was never configured, that 9-month-old incident has zero forensic evidence.
- ZDX deployed but no probes for the SaaS apps users actually use β out of the box ZDX probes a few defaults. Add Salesforce / Workday / ServiceNow / Zoom / Teams / your custom internal apps explicitly.
- Confusing ZPA Diagnostics with Insights β Web β they show different layers. Diagnostics is private-app policy match path. Insights is public/SaaS request log. Use the right one.
- Engineer fixes a rule but doesn't re-run Trace User to confirm β the rule edit looked right but maybe rule order changed and a higher-priority rule still wins. Always re-trace after a fix.
- PAC file changes not version-controlled β no rollback when the typo locks everyone out. Commit every PAC change to Git, tag it with the change ticket, deploy via CI.
- SIEM ingesting NSS feed but no parsers / dashboards β logs sit in a bucket unused. Wire CIM-compliant Splunk apps or Sentinel content packs from day one.
- Build your runbook from saved searches. Every recurring ticket pattern (Slack down, Salesforce slow APAC, OneDrive sync) becomes a saved Insights search you can pull in one click. Six months in, you have a 50-search library and your MTTR drops to minutes.
- Pair ZDX alerts with PagerDuty / Slack rather than email. Email gets ignored. A Slack hook in
#network-nocwith the ZDX score graph inline gets eyes within 30 seconds and shortens incident lead time enormously. - Treat NSS feeds as your audit trail, not optional. When auditors ask "show me a year of egress traffic for user X", you need NSS-to-SIEM. Configure it on day 1, not after the audit.
Real-world scenario β Friday 5 PM: "Salesforce broken for APAC"
Sales lead pings the on-call channel β "Salesforce is broken for everyone in APAC. No one can update opportunities. End of quarter, fix now." You have 30 minutes before the global all-hands. Walk it:
- Insights β Web β filter
destination=*.salesforce.com+user-region=APAC+ last 15 min. Traffic is flowing β not a blanket block. Good. - Look at Status column. Mixed bag β many 200 OK but also a chunk of TLS-Fail. So partial degradation, not full block.
- Pivot to ZDX β Salesforce app β APAC region view. The ZDX Score for APAC has dropped from 92 (last week's baseline) to 41 in the last 20 minutes. Confirmed degradation, confirmed user-perceived.
- Open the hop-by-hop view. Hops 1β3 (endpoint β Z-Tunnel β ZIA Singapore POP) all show baseline latency ~12 ms. Hop 4 (Singapore POP β Salesforce edge) has jumped from ~18 ms to ~480 ms with packet loss.
- Check Zscaler's trust portal status page. Confirmed β Singapore POP is on a partial maintenance window with reduced peering capacity. Vendor's fault, not your config.
- Workaround: in Z-App's forwarding profile, push a temporary PAC override that routes APAC Salesforce traffic via the Tokyo POP for the next 2 hours. Activate.
- Verify: 5 minutes later, ZDX Score recovers to 88. Hop view shows healthy TokyoβSalesforce path. Sales lead confirms users back to normal.
- Post-incident: file a ticket with Zscaler asking why the Singapore maintenance wasn't in the customer-facing change calendar. Update the runbook with "Tokyo failover PAC override β APAC Salesforce" so the next on-call engineer doesn't reinvent it.
Eight steps. ~22 minutes from ticket to recovery. The path is reproducible because each step asks one specific question of one specific telemetry source. That's the L3 discipline this lesson is building.
π Quick reference (memorise β comes up in every ZDTA scenario question)
- 4 log sources: ZIA Insights Β· ZPA Diagnostics Β· NSS streaming Β· ZDX.
- Insights = in-tenant, real-time, GUI, SKU-dependent retention (30dβ6mo). Daily debugging lives here.
- ZIA Insights 4 tabs: Web Β· Firewall Β· DNS Β· Tunnel. Pick the tab that matches the symptom.
- Filter rhythm: User β Time β Action β Destination β Status. Same in every tab.
- ZPA Diagnostics β Trace User is the
tracerouteequivalent for private apps. - NSS = TCP/TLS feed to Splunk / Sentinel / QRadar / Elastic. Long-term retention + cross-source correlation.
- ZDX = 0β100 score per user per app. Below 50 = poor. Hop-by-hop view finds the bad hop in 30 seconds.
- Top-5 scenarios memorised: SaaS broken (Insights Web), SSL pinning (Insights Web TLS-Fail), ZPA unreachable (Trace User), PAC typo (validate + version), Z-Tunnel down (Insights Tunnel + Z-App logs).
- Always re-run Trace User after a ZPA fix β confirm the new path actually matched.
- PAC = production code. Git it, tag it, validate it, deploy it via CI.
Hunt a user-reported slowness ticket:
- User reports "SAP is slow at 3pm IST". Open ZDX β Users β search the user.
- Note the ZDX score over the last 24h. If above 75 most of the time but dropped at 3pm β confirm a real event.
- Drill into CloudPath at 3pm β find which hop introduced latency. Was it the last-mile, the ISP, or Zscaler?
- Open Web Insights for that user at 3pm β find actual transaction timings to SAP.
- If ZDX says 92 but user says slow β check synthetic vs RUM β synthetic may be fine, RUM telling a different story.
π Check your understanding
10 scenario questions β interview + ZDTA exam depth. Pick one answer per question. You need 70% (7 of 10) to mark this lesson complete on your profile.
What's next?
Module 14 is the finisher β your ZDTA exam blueprint walkthrough, the 25 most-asked Zscaler interview questions with model answers, and a 4-week study schedule to clear the cert.