TTechclick All lessons
Zscaler Β· Batch 11 Β· Lesson 13L3 / TROUBLESHOOTING

Logs, ZDX & The 5 Production Troubleshooting Scenarios You'll See Most

Insights, Diagnostics, NSS and ZDX β€” the four telemetry sources that turn 'users are complaining' into a 10-minute fix. Plus the five real production patterns L3 engineers diagnose every week.

πŸ“… 23 May 2026 Β· ⏱ 14 min read Β· 🏷 10-question assessment included
🎯 By the end of this lesson, you'll be able to

Why this lesson matters

A Zscaler deployment without observability is invisible until users complain. By that point, your CIO is on the call, the sales team is locked out of Salesforce, and you're 20 minutes into a fire drill with nothing but a Slack message that says "the internet is broken". The L3 engineer's job is to find the answer before the ticket lands β€” and when it does land, to walk from symptom to root cause in under 15 minutes. That requires knowing exactly which of the four Zscaler telemetry sources to open first, and which tab inside that source to pivot on. Get that right and you look like a wizard. Get it wrong and you waste an hour staring at Web logs while the actual problem was a dead Connector in ZPA Diagnostics.

This lesson is the muscle memory layer. We'll map all four log surfaces (ZIA Insights, ZPA Diagnostics, NSS-to-SIEM, ZDX), then walk the five production scenarios you will absolutely see in your first three months on the job. Memorise the hunt paths β€” they repeat.

The four log sources you need

Zscaler ships four distinct telemetry surfaces. Each answers a different question. Mixing them up is the #1 cause of L3 engineers chasing the wrong layer for an hour.

SourceWhere it livesWhat it answersRetention
ZIA InsightsZIA Admin Portal β†’ Analytics β†’ Insights"What did the user request via ZIA, what rule matched, what was the verdict?"SKU-dependent: 30–90d Standard, up to ~6mo Business/Transformation
ZPA DiagnosticsZPA Admin Portal β†’ Diagnostics"Did this user reach this private app? Which Access Policy + Connector + Server Group was hit?"14 days rolling
NSS (log streaming)NSS VM in your DC, NSS for SIEM (cloud-to-cloud), or NSS Cloud β†’ TCP/TLS to Splunk / Sentinel / QRadar / Elastic"Show me 18 months of Web + Firewall + Tunnel + ZPA + DNS logs joined with EDR + email + identity."As long as your SIEM keeps it
ZDXZDX Admin Portal β€” separate from ZIA/ZPA admin"What's the user-perceived performance β€” hop by hop β€” for this SaaS app?"~30 days rolling

The rule of thumb: Insights for "was it allowed/blocked", Diagnostics for "did ZPA path-match correctly", NSS for "what happened weeks ago plus correlation with non-Zscaler signals", ZDX for "is the user-experience actually slow". If you can recite that in your sleep, half this lesson is done.

The four telemetry paths β€” one user, four data streams
Zscaler telemetry sources architecture A user endpoint with Z-App generates four parallel telemetry streams: ZIA Insights from web/FW/DNS traffic, ZPA Diagnostics from private app access, NSS streaming logs to a customer SIEM, and ZDX probes producing user-experience scores. User Endpoint + Z-App ZIA CloudWeb Β· FW Β· DNS Β· Tunnel ZPA CloudPrivate app brokering ZIA Insightsin-tenant, real-time GUI ZPA DiagnosticsTrace User Β· Connector Health NSS β†’ SIEMSplunk Β· Sentinel Β· QRadar Β· Elastic ZDX Score + hop-by-hopuser-experience telemetry from the endpoint

One user generates four distinct telemetry streams. Knowing which one to open per symptom is the single biggest L3 productivity multiplier.

ZIA Insights β€” deep dive

Insights is the in-tenant log explorer. It opens fast, refreshes every few seconds, and is where 80% of your day-to-day debugging happens. It has four tabs, each scoped to a different ZIA service:

The filtering rhythm is the same in every tab and you should be able to type it without thinking: User β†’ Time range β†’ Action β†’ Destination β†’ Status. That's the canonical 5-filter sequence. Add Department or Location only when scoping a regional issue. Use the timeline view to spot the moment the symptom started.

Saved search you should build day-1
Tab:        Web
Time:       Last 1h
Action:     Block, TLS-Fail, Caution
Department: All
Group by:   URL Category, then Rule

(Save as: "All blocks last hour β€” first triage")

Insights also offers Saved Searches. Build five of these the day you get tenant access β€” top-5 blocks last 24h, TLS-failures last 1h, DLP triggers last 7d, sandbox-quarantined files last 24h, Tunnel-disconnect events last 1h. They become your dashboard.

ZPA Diagnostics β€” deep dive

ZPA Diagnostics is structurally different from ZIA Insights β€” ZPA brokers private app access, so the question isn't "was this URL allowed" but "did the right policy + Connector + Server combination resolve". Three tools matter:

Rule of thumb: if the user says "the app didn't load", open Trace User before anything else. It tells you in one screen whether the failure was policy (no match), Posture (failed device check), Connector (unhealthy), or Server (origin down).

NSS β€” streaming logs to your SIEM

Insights and Diagnostics are great for live debugging, but they're in-tenant and they don't keep data forever. The moment your SOC needs to correlate Zscaler events with EDR, email gateway, identity (Okta / Entra), or build a custom retention dashboard β€” you need NSS (Nanolog Streaming Service). NSS continuously streams ZIA and ZPA logs over TCP/TLS to your SIEM collector.

NSS comes in three deployment forms:

FormWhere it runsWhen to pick it
NSS VMCustomer DC or cloud (OVA / AMI)You have an on-prem SIEM and want network locality. Most common.
NSS for SIEM (cloud-to-cloud)Direct from Zscaler cloud to a hosted SIEM API (Splunk Cloud, Sentinel, QRadar SaaS, Elastic Cloud)No VM to manage β€” cleanest path when your SIEM is SaaS too.
NSS CloudZscaler-hosted NSS that delivers via syslog-over-TLS to your collector IPYou want zero customer-side infrastructure but still control the destination.

You configure one NSS feed per log type β€” Web, Firewall, Tunnel, ZPA, DNS β€” pick the fields you want, choose JSON / LEEF / CEF format, point at your collector. The feed is continuous, low-latency (seconds), and you can replay missed data within a buffer window.

The reason NSS is non-negotiable for any tenant above ~500 users: Insights only goes back a few months at most, but compliance, incident response, and long-running threat hunts need 12–24 months of data. NSS pipes that retention into a system you control.

NSS retention caveat: NSS retention varies by SKU β€” Standard ZIA Web Insights often 30–90 days; Business/Transformation tiers approach 6 months. Verify your tenant's retention under Administration β†’ NSS β†’ Insights Retention.

Modern alternative to NSS: Zscaler's Log Streaming Service (LSS for ZPA) and REST API polling now feed most new SIEM integrations directly (Splunk Cloud, Sentinel, Chronicle). Classic NSS VMs are still the gold standard for high-EPS / on-prem SIEM (QRadar, ArcSight), but plan modern integrations API-first.

NSS Sizing

Rough planning: ~150-300 EPS per 1000 users at peak business hours. NSS VM baseline: 4 vCPU / 8 GB RAM / 100 GB disk. The NSS VM buffers ~10 minutes of logs in memory; if downstream SIEM ack lag exceeds the buffer, NSS drops logs and may panic-restart. Monitor: nss-stats CLI for buffer depth, SIEM-ack lag, and EPS in/out.

ZDX β€” proactive user-experience monitoring

Insights and Diagnostics tell you what happened on the Zscaler side. ZDX tells you what the user actually experienced β€” including the parts that have nothing to do with Zscaler. It runs as a lightweight agent inside Z-App (or as a standalone install) and continuously probes the path from endpoint to SaaS.

ZDX scores across 5 dimensions: (1) Page Load Time for SaaS app probes; (2) CloudPath hop-by-hop latency to the app; (3) Network last-mile / Wi-Fi / DNS health; (4) Device CPU/memory/battery health; (5) Application probe response (synthetic or real-user).

Application probes (synthetic) β€” run on a schedule from the Z-App agent even when the user is idle. Web probes / Real-user monitoring β€” passive, capture timings only when the user actually visits the app. ZDTA exam discriminator: synthetic catches outages when no one is working (early-morning M365 issues), RUM catches user-experience problems mid-day.

ZDX rolls the result into a single 0–100 ZDX Score per user per app. Anything above 80 = good, 50–80 = degraded, below 50 = poor. The top operational use case is: "user complains Teams is slow β€” ZDX shows the hop where latency spikes β€” was it the Zscaler edge, ISP middle-mile, or customer ISP egress?" You answer in 30 seconds instead of an hour of guessing.

When NOT to trust the ZDX score

ZDX probe flow β€” symptom to root-cause hop
ZDX probe flow Sequence diagram: user opens app, ZDX agent kicks off probe, measures Z-Tunnel, ZPA Cloud, and App-side latency, calculates ZDX Score, alerts L3 engineer below threshold, engineer opens hop-by-hop view to identify the bad hop. 1. User opens app 2. ZDX agent probes 3. Measure each hop 4. Calculate ZDX Score Hop A: Endpoint→Z-Tunnel Hop B: ZIA/ZPA Cloud Hop C: Edge → SaaS(bad hop — 480 ms) 5. ZDX Score drops to 41 → threshold alert firesL3 engineer notified before user logs ticket 6. Engineer opens hop view → Hop C is the bad oneroot cause = Zscaler POP-to-SaaS path; fix = reroute or wait for vendor

ZDX turns "users say Teams is slow" into "Hop C between Singapore edge and SaaS provider added 480 ms β€” here's the trace". 30 seconds instead of an hour.

Zscaler Troubleshooting Lab Cloud Connector + Log Path

The 5 most common production troubleshooting scenarios

You will see these five patterns repeatedly. Memorise the hunt path for each β€” they're worth more than any abstract theory.

(a) "SaaS app is broken" β€” e.g. Outlook on the Web won't load

Symptom: users in a department can't load outlook.office.com. Page hangs or partial-render. Reproducible.

Hunt path: Insights β†’ Web tab β†’ filter destination=outlook.office.com + time-range last 1h + user-region affected. Look at Action column. If status is Block, the rule column shows the offending URL Filter / Cloud App / DLP rule. If status is TLS-Fail, you've got an SSL Inspection break (probably pinning β€” see scenario b). If status is Allow but bytes are 0, you're chasing a backend issue not a Zscaler one β€” pivot to ZDX next.

Fix: the rule column tells you the answer. Common cases β€” someone added a new DLP rule that classifies Outlook downloads as PII, a Cloud App Control rule blocked file-attachment uploads, a URL Filter rule mis-categorized outlook.office.com as Webmail (blocked) instead of Productivity (allowed). Edit the rule, narrow the match, activate, re-test.

(b) "One app silently fails after SSL Inspection rollout"

Symptom: ERR_CERT_AUTHORITY_INVALID in browser, OR the native app just refuses to connect with no error. Most common after enabling SSL Inspection or rolling a new pinned-app version.

Hunt path: Insights β†’ Web β†’ filter destination= + status=TLS-Fail. If you see lots of TLS-Fail entries, the app is pinning its certificate (it doesn't trust the Zscaler-issued one). Cross-check with tcpdump on the client during the connection attempt β€” you'll see an immediate TLS abort from the app side.

Fix: add the app's API domains to the SSL Inspection Bypass rule (Order 5 β€” above the generic Inspect rule). The classics that pin (current as of 2026): Microsoft Authenticator, Apple iCloud sync, banking apps, WhatsApp desktop, Webex. (Slack desktop dropped strict pinning in 2022 β€” modern Slack works through MITM with the Zscaler root cert installed; if Slack breaks today it's usually a root-cert distribution issue, not pinning.) Maintain a "Pinned Apps β€” Always Bypass" list in your runbook and apply on day-1 of every new tenant.

(c) "ZPA app is unreachable"

Symptom: Z-App shows "no connection" or app spinner forever. Affects all users of one app, or all users at one site.

Hunt path: ZPA Diagnostics β†’ Trace User β†’ pick affected user, pick affected app. Look at the matched Access Policy + Server Group + Connector. Three things go wrong here, in this order of frequency:

  1. Connector unhealthy β€” Diagnostics β†’ Connector Health β†’ red dot. The Connector VM crashed, lost outbound 443 to ZPA Cloud, or ran out of memory. Bounce or scale.
  2. App Segment misconfigured β€” the wildcard in the App Segment doesn't actually match the hostname the user is hitting. Use the App Segment Browsing tracer to verify which segment a hostname resolves to.
  3. Posture failed β€” user's device suddenly fails a Posture profile (disk encryption disabled, AV out of date, certificate expired). Posture failures show in Diagnostics with the specific posture rule that failed.

Fix: depends on which of the three. Always re-run Trace User after the fix to confirm it now matches the right path.

(d) "PAC file typo locks out a subset of users"

Symptom: users who forward via PAC (not Z-App tunnel) get random failures, or no Zscaler at all, or all their traffic bypasses Zscaler. Z-App users are fine. Often happens an hour after a "harmless" PAC edit.

Hunt path: pull the currently-deployed PAC. Run it through a PAC validator (Zscaler ships one in the Admin Portal under Administration β†’ Hosted PAC Files β†’ Validate). One missing semicolon or a wrong domain literal in shExpMatch() can cascade into "the entire PAC returns DIRECT for everything" β€” silently bypassing Zscaler.

Fix: deploy a known-good PAC, then re-add the new domain logic one block at a time. Version-control your PAC in Git. Always validate before pushing. The number of incidents traced to "someone edited the live PAC in the textbox" is staggering β€” treat the PAC like production code.

(e) "Z-Tunnel is down β€” Z-App icon red"

Symptom: Z-App icon is red on the user's tray. No traffic to ZIA. User sees raw internet (or nothing, if firewall blocks direct).

Hunt path: open Z-App on the client β†’ View Logs / Tunnel status. Then ZIA Insights β†’ Tunnel tab β†’ filter by that user. Common causes:

Fix: walk that list top-down. Most outage tickets resolve at step 1 (network) or step 2 (re-auth).

βœ“Verify β€” your observability is actually wired up

Don't trust that the telemetry is healthy just because nobody's complained yet. Run these checks weekly:

⚠Common Mistakes β€” observability done wrong
πŸ’‘Pro Tips

Real-world scenario β€” Friday 5 PM: "Salesforce broken for APAC"

Sales lead pings the on-call channel β€” "Salesforce is broken for everyone in APAC. No one can update opportunities. End of quarter, fix now." You have 30 minutes before the global all-hands. Walk it:

  1. Insights β†’ Web β†’ filter destination=*.salesforce.com + user-region=APAC + last 15 min. Traffic is flowing β€” not a blanket block. Good.
  2. Look at Status column. Mixed bag β€” many 200 OK but also a chunk of TLS-Fail. So partial degradation, not full block.
  3. Pivot to ZDX β†’ Salesforce app β†’ APAC region view. The ZDX Score for APAC has dropped from 92 (last week's baseline) to 41 in the last 20 minutes. Confirmed degradation, confirmed user-perceived.
  4. Open the hop-by-hop view. Hops 1–3 (endpoint β†’ Z-Tunnel β†’ ZIA Singapore POP) all show baseline latency ~12 ms. Hop 4 (Singapore POP β†’ Salesforce edge) has jumped from ~18 ms to ~480 ms with packet loss.
  5. Check Zscaler's trust portal status page. Confirmed β€” Singapore POP is on a partial maintenance window with reduced peering capacity. Vendor's fault, not your config.
  6. Workaround: in Z-App's forwarding profile, push a temporary PAC override that routes APAC Salesforce traffic via the Tokyo POP for the next 2 hours. Activate.
  7. Verify: 5 minutes later, ZDX Score recovers to 88. Hop view shows healthy Tokyo→Salesforce path. Sales lead confirms users back to normal.
  8. Post-incident: file a ticket with Zscaler asking why the Singapore maintenance wasn't in the customer-facing change calendar. Update the runbook with "Tokyo failover PAC override β€” APAC Salesforce" so the next on-call engineer doesn't reinvent it.

Eight steps. ~22 minutes from ticket to recovery. The path is reproducible because each step asks one specific question of one specific telemetry source. That's the L3 discipline this lesson is building.

Re-run this scenario in the Lab Connector Health Lab

πŸ“Œ Quick reference (memorise β€” comes up in every ZDTA scenario question)

β–Ά QUICK LAB Β· ~15 MIN

Hunt a user-reported slowness ticket:

  1. User reports "SAP is slow at 3pm IST". Open ZDX β†’ Users β†’ search the user.
  2. Note the ZDX score over the last 24h. If above 75 most of the time but dropped at 3pm β†’ confirm a real event.
  3. Drill into CloudPath at 3pm β€” find which hop introduced latency. Was it the last-mile, the ISP, or Zscaler?
  4. Open Web Insights for that user at 3pm β€” find actual transaction timings to SAP.
  5. If ZDX says 92 but user says slow β†’ check synthetic vs RUM β€” synthetic may be fine, RUM telling a different story.

πŸ“ Check your understanding

10 scenario questions β€” interview + ZDTA exam depth. Pick one answer per question. You need 70% (7 of 10) to mark this lesson complete on your profile.

Q1

A user reports "Salesforce is slow for the whole APAC team". You want to know whether the slowness is in the Zscaler path, the SaaS provider's path, or the user's local ISP. Which Zscaler telemetry source answers that in one screen?

Correct: (c). ZDX is the only source that measures hop-by-hop user-perceived latency from endpoint through Zscaler to the SaaS endpoint. Insights tells you if the request was allowed/blocked but not where on the path it slowed. Trace User is for ZPA private apps. NSS retains the data but doesn't show hop latency natively.
Q2

After enabling a new DLP rule, users report "Outlook Web won't load attachments". You want to confirm whether the new DLP rule is the cause. First place to check?

Correct: (a). Insights Web is the live, in-tenant log for HTTPS via ZIA, and every entry shows which rule matched. (b) is wrong β€” Outlook is public SaaS, not ZPA. (c) tells you latency, not rule match. (d) eventually shows it but Insights is faster for real-time debugging.
Q3

Your SOC wants to retain 18 months of Zscaler Web logs for compliance and join them with EDR + email gateway events. What's the right architecture?

Correct: (b). NSS is purpose-built for long-term streaming + SIEM correlation. Insights is in-tenant only and ages out. CSV exports lose schema fidelity. Screenshots are not auditable. NSS feeds are continuous, fault-tolerant, and CIM-compliant.
Q4

A user can't reach jira.corp.internal via ZPA. The error in Z-App says "no connection". You want to know if the right App Segment / Connector path was even selected for this user. Which tool?

Correct: (b). Trace User is the ZPA equivalent of traceroute β€” replays the access attempt and shows you the matched policy + selected Connector. (a) is for public web. (c) would only help if Jira loaded slowly. (d) has the data but isn't an interactive trace.
Q5

Your team rolled out SSL Inspection and now Microsoft Authenticator stops delivering push notifications. Browser-based M365 sign-in works fine. Insights β†’ Web for the user shows "status = TLS-Fail" for the Authenticator API endpoints. Root cause?

Correct: (b). The browser works because OS cert store has the Zscaler Root, but Authenticator pins independently and ignores the OS store. The TLS-Fail pattern in Insights β€” only for that app β€” is the giveaway. Fix is always a Bypass rule for pinned apps. (Slack dropped strict pinning in 2022; modern pinners are Authenticator, iCloud, banking apps, WhatsApp desktop, and Webex.) (a) would break all HTTPS. (c) would block all traffic. (d) doesn't match the TLS layer.
Q6

After a "small" PAC file edit, a subset of users start showing no Zscaler activity at all in Insights β€” their traffic appears to be bypassing Zscaler entirely. Z-App-tunneled users are unaffected. What likely happened and how do you fix it without a 1-hour outage?

Correct: (b). PAC parse errors often cause the entire file to fall through to DIRECT β€” silent failure mode. The Z-App tunneled users are unaffected because they don't use PAC. Always version-control PACs and use the validator. Rollback first, then re-apply changes incrementally. (a)/(c)/(d) don't address the root cause and (c) isn't even an option you have.
Q7

Branch site reports "Z-Tunnel red on every laptop since 2 AM. All users lost ZIA". You've already verified the Z-Tunnel client and Root CA are healthy. Which Insights tab and which likely root cause do you check first?

Correct: (b). Tunnel-down symptoms belong in the Tunnel tab, not Web. The 2 AM maintenance window + simultaneous failure across the whole site strongly suggests the branch firewall closed outbound 443 to Zscaler IPs. Auth-token expiry is the other common cause but tends to be staggered, not simultaneous. (a)/(c)/(d) are the wrong layer.
Q8

Your SIEM ingests NSS feeds but no dashboards or alerts have been built on top. Auditor asks: "Show me all DLP triggers for PII data, by user, in the last 12 months." What's the actual problem and the fix?

Correct: (b). NSS-without-parsers is a common waste β€” raw logs land but no value extracted. The fix is to deploy vendor-provided SIEM content packs that normalize fields and ship pre-built searches. (a) is overkill. (c) is wrong β€” Insights ages out within months (SKU-dependent, often 30–90 days on Standard). (d) makes the audit problem worse.
Q9

You fixed a ZPA Access Policy that was blocking a contractor from reaching the internal HR app. You activated the change. The user still says "not working". Best next step?

Correct: (b). Always verify a ZPA fix with Trace User. The new rule may not be matching due to rule order, posture failure, or Connector unhealthy. Activating a change is not proof it works. (a) is unprofessional. (c) is premature. (d) is unrelated.
Q10

You deploy ZDX and assume it's "monitoring everything". A month later a user complains Workday is slow and ZDX has no data. What did you miss?

Correct: (b). ZDX is opt-in per application probe. You must enumerate the SaaS apps your users care about during ZDX onboarding. The classic mistake is deploying ZDX and never configuring application probes beyond the defaults. Always inventory your top-20 user-facing SaaS and configure probes for each. (a)/(c) are wrong. (d) is dismissive and inaccurate.
Lesson complete β€” saved to your profile.
Almost! Review the sections above and try again β€” you need 70% (7 of 10) to mark this lesson complete.

What's next?

Module 14 is the finisher β€” your ZDTA exam blueprint walkthrough, the 25 most-asked Zscaler interview questions with model answers, and a 4-week study schedule to clear the cert.