TTechclick All lessons
Palo Alto · Operations · Real TAC CasesInteractive · L2 / L3

Operational Failures — When the Firewall Breaks at 3 AM

Four real TAC-case patterns every PA admin meets eventually: HA flap, mystery commit failure, dataplane CPU spike, silently-dropped logs. Pick one, watch the diagnostic tree fire stage by stage, learn the exact CLI to settle it in under 5 minutes.

📅 2026-05-25 · ⏱ 12 min · 3 failure-tree demos · 🏷 10-Q assessment + AI Tutor inline

Pick a failure pattern — jump straight to it

1

HA Heartbeat Flap

Pair flips active/passive every 2 hours. Heartbeat link is the suspect — but which one?

2

Commit Fails

"Unknown error". Missing object. 30-min commits. What the message really means.

3

Dataplane CPU

DP at 95%, packets dropping. Find the elephant flow in 4 commands.

4

Log Drop to SIEM

SIEM says zero events, PA says success. The silent failure that wrecks incident response.

Before you reach for the keyboard — name the failure first

Production outages on a PA are rarely "the firewall died". 90% of the time you're chasing one of four patterns. Naming the pattern correctly in the first 60 seconds cuts MTTR by an order of magnitude. The framework: Symptom → Counter → CLI → Fix. Every failure below follows it.

🔁
HA Flap
tap to flip

Pair alternates active/passive. Counter to watch: ha_path_monitor_failure. CLI: show high-availability all.

💢
Commit Fail
tap to flip

Validation error or runtime "unknown error". Job log: show jobs id <N>. Plus revert candidate and rebuild incrementally.

🔥
DP CPU Spike
tap to flip

Latency, drops, packet-buffer warnings. CLI: show running resource-monitor second + show running resource-monitor ingress-backlogs.

📭
Log Drop
tap to flip

SIEM gap, no errors in GUI. Counter: log_traffic_loss_queue_full. CLI: debug log-receiver statistics.

① HA Heartbeat Flap — pair flips every two hours

Rahul at TCS gets paged: PA-5250 active firewall flipped to passive at 02:14, back to active at 04:18, and again at 06:21. Sessions resync, customers complain about momentary disconnects. The HA1 link itself shows "up". What's happening?

HA1 carries heartbeats and config sync on TCP/28771 + UDP/28769. HA2 carries session-sync. HA1 saturation — a noisy management VLAN, a duplex mismatch, or a switch buffer overflow on the intermediate switch — causes heartbeats to be delayed beyond the hello-interval timeout, even though the link technically stays "up". The peer declares the active dead, takes over, and the original active recovers a second later. Result: an oscillation every time the buffer fills.

▶ Failure tree — HA flap diagnosis ladder

Click Play. Each stage is the next CLI command you'd run.

① SYMPTOM Pair flipped 3 times in 4 hours. System log: HA mode changed: active → passive
② STATE show high-availability all → confirm last state-change timestamp + reason
③ COUNTER show counter global filter category ha | match ha_ → look for ha_heartbeat_loss, ha_path_monitor_failure
④ LINK show interface ha1 → confirm errors / discards / duplex. Direct-cable HA1 → 1 in 1,000 cases the SFP is dying.
⑤ ROOT CAUSE Intermediate switch buffer overflows during nightly backups → delayed heartbeats → false failover.
⑥ FIX Direct-cable HA1 between peers (eliminate the switch); enable HA1 Backup on the management interface so a single link issue doesn't trigger failover.
Press Play to step the diagnosis. Each press of Next advances one stage.
The single HA design fix nobody bothers with

Always configure an HA1 Backup link — typically the management interface — so a single heartbeat-path failure doesn't trigger failover. Palo Alto's design assumes you'll do this. Many shops skip it because "we have direct cables, what could go wrong". Then a single SFP dies at 2 AM and the whole pair flaps. HA1 Backup = 5 minutes of config, saves your night.

Quick check · Q1 of 10

Sneha at Infosys sees the HA pair flipping active/passive every 2 hours during nightly backup windows. HA1 link reports "up", no errors on the interface. Which CLI gives the fastest answer about why heartbeats are being lost?

Correct: c. show counter global filter category ha tells you which failure trigger fired — heartbeat-loss (HA1 saturation), path-monitor-failure (a monitored destination became unreachable), or link-monitor-failure (a watched interface went down). Without that counter pivot, you're guessing. debug software trace is heavy and rarely needed; running it without TAC guidance can mask the issue. Resource-monitor is for dataplane, not control-plane HA.

② Commit Failures — "unknown error" and other crimes

Priya at HCL hits Commit at 4 PM Friday. The job spins for 28 minutes, then fails with: "Commit failed: unknown error". The candidate config has 14 changes she can't undo individually. Now what?

Commit failures come in three flavours. Knowing which flavour you have decides whether to revert, edit, or call TAC:

Validation Error
tap

Missing or renamed object reference, ghost reference in a disabled rule. Job log names the exact rule. Easy fix — edit and re-commit.

Lock Conflict
tap

Another admin holds the config lock, or a Panorama push is in flight. show config-lock. Wait or break the lock with admin auth.

Resource Exhaustion
tap

Management plane out of RAM mid-commit. Job hangs, eventually "unknown error". Reboot resolves; a RAM upgrade or PAN-OS hotfix is the real fix.

Bloat (30-min commits)
tap

Config grew over time — 12K rules, 50K addresses. Mgmt-CPU pegged for 20-30 min. Fix: cleanup pass + Panorama "Share Unused Objects = OFF" if applicable.

CLI — diagnose a failed commit
show jobs id <job-id>
show jobs all
show config-lock
debug swm status
show system resources follow
Expected output (job log with validation error)
Job ID    Type    Status     Result    Description
4231      Commit  FIN        FAIL      validation Error: address-group 'web-servers'
                                       references missing address 'web-04' in rule
                                       'allow-web-prod' line 47
The "ghost reference" trap

You deleted an address object three weeks ago — but it was still referenced inside a disabled rule, a PBF rule, or a Zone Protection profile. The validator catches it during the next unrelated commit. Job log: "reference to invalid or missing object". Fix: GUI search for the object name (Objects → Find Usage), edit/remove the orphan reference, re-commit. Don't just delete the disabled rule blindly — confirm what it was protecting first.

Quick check · Q2 of 10

Karthik runs commit. Job spins 8 minutes and fails: "reference to invalid or missing object". He searches Objects tab — the missing object isn't visible anywhere. What's the most likely cause?

Correct: a. Ghost references live in disabled rules, PBF rules, Zone Protection profiles, NAT rules, and Decryption profiles. The Objects tab only shows top-level objects; references hide in rule bodies. Use Find Usage or grep the running config XML for the object name to locate the orphan reference.
When the commit itself is slow — three optimisations

1. On Panorama, disable Share Unused Address and Service Objects with Devices — Panorama → Setup → Management. Stops 50K unused objects flooding every device's commit. 2. Use Partial Commit (in Panorama and PAN-OS): commit only your admin scope. Cuts commit time when many admins share a box. 3. If management CPU is stuck near 100% during every commit, the device is undersized — schedule a memory upgrade or migrate to the next hardware tier. PA-220 and older PA-3000 series are notorious here.

Quick check · Q3 of 10

Commits on the firewall now take 25-30 minutes consistently. The config has grown from 800 to 9,400 security rules over two years. Three admins commit per day. Which set of fixes addresses the root cause, not just the symptoms?

Correct: d. Commit time scales with config size + mgmt-CPU + RAM. Each fix in the bundle addresses one of those: cleanup shrinks config, Partial Commit reduces the scope per admin, Panorama setting stops object bloat propagating, hardware upgrade buys headroom. Reboots, disabling logging, and off-hour scheduling all dodge the root cause.

③ Dataplane CPU Spike — find the elephant in 4 commands

Aditya at Wipro sees dashboard alerts: Dataplane CPU 96% sustained. Latency on production-bound flows jumps from 8ms to 180ms. Drops on the WAN side. The dataplane is choking — but on what? An IPS overload? A massive DDoS? A single misbehaving flow?

▶ Failure tree — DP CPU diagnosis ladder

Same Symptom → Counter → CLI → Fix framework. Each stage narrows the cause.

① SYMPTOM Dashboard: DP CPU 96%, packet-drop rate climbing, latency 20×.
② TREND show running resource-monitor second → was it always high, or just last 60s? Spike vs sustained changes everything.
③ BACKLOG show running resource-monitor ingress-backlogs → sessions with the highest packet/byte share. Spot the elephant flow.
④ BUFFERS show running resource-monitor → check Packet Buffer + Packet Descriptor. Near 100% = packet-buffer exhaustion, often a DoS or misconfigured offload.
⑤ COUNTERS show counter global delta yes | match drop → identifies the drop reason: flow_rcv_dot1q_tag_err, flow_no_session, resource_packet_buffer_full.
⑥ FIX Block the elephant flow with a deny rule. Long-term: split sessions across NAT rules, enable Zone Protection (Packet Buffer Protection — DoS), confirm hardware sizing vs throughput SLA.
Trend before backlog. Backlog before counters. Never start with packet-capture for a CPU issue — it makes the problem worse.
CLI — the 4-command DP CPU pivot
show running resource-monitor second
show running resource-monitor ingress-backlogs
show running resource-monitor | match -A 4 "Packet Buffer"
show counter global delta yes | match drop
Packet Buffer Protection — flip this once, save yourself later

Network → Network Profiles → Packet Buffer Protection. Enable it globally + apply per-zone. When buffers fill above the alert/activate thresholds, the firewall starts discarding the noisiest sessions first instead of dropping random packets. Massive QoS win during DoS or runaway flows. Default-off — flip it on.

Quick check · Q4 of 10

Dataplane CPU is at 92%, packet drops climbing. Which CLI gives the fastest "who is the elephant flow?" answer?

Correct: b. ingress-backlogs is purpose-built for "who is hogging the dataplane right now". show session all dumps every session (millions, on a busy box) — not actionable. Packet captures make DP CPU worse. show system info is metadata, not telemetry.

④ Log Forwarding Drop — when the SIEM goes quiet

Sneha at Infosys gets a Sev-2 ticket: "No traffic logs from PA-5250 in Splunk for the last 4 hours." She checks Monitor → Traffic → logs scroll fine on the firewall. System logs say "syslog forwarding enabled". The PA looks healthy. So where are the logs going?

This is the most dangerous failure of the four — because nothing visible breaks. The firewall keeps blocking and allowing traffic. Users notice nothing. The SOC has a gap. Detection misses an attack window.

Two failure modes dominate:

(1) Queue full. Logs generate faster than the management plane can ship them out. The forwarding queue fills, new logs are silently discarded. Counter: log_traffic_loss_queue_full.

(2) Connection silently dead. The syslog TCP socket reconnects in the background, the firewall thinks it's connected, but the SIEM closed the socket. The session ages out hours later. Logs in this window are lost.

CLI — verify log forwarding is actually working
debug log-receiver statistics
show counter global filter aspect log | match drop
show log-collector all
show logging-status
ping host <syslog-server>
Expected output (a problem)
Log incoming rate ............ 2,400 logs/sec
Log forwarding rate .......... 1,820 logs/sec   <-- shipping slower than generation
log_traffic_loss_queue_full .. 178,400          <-- losses confirmed
Syslog enqueue count ......... 9,224 (queue ceiling near)
The "PAN-255253" silent-syslog bug pattern

On some PAN-OS releases, the syslog forwarding TCP socket can fail to re-establish after the SIEM's TCP keepalive timeout, while the firewall still reports the profile as healthy. Symptom: scheduled SIEM gap, no errors in System log. Workaround: change the Log Forwarding profile to UDP (lossier but resilient) or schedule a daily profile-cycle via API. Long-term fix: PAN-OS upgrade per release notes — search the release notes for "log forwarding" before upgrading.

SOC-side check that catches this in 2 minutes

Don't rely only on PA-side telemetry. Add a SIEM-side rule: "if log-source 'PA-prod-firewall' has zero events for 5 minutes during business hours, alert." This catches log-drop bugs the PA-side counters miss (e.g. when the firewall thinks logs are being sent but the receiver isn't accepting). Two-sided monitoring beats one-sided every time.

🤖 Ask the AI Tutor

Tap any question — instant context-aware answer. No login, no waiting.

Pre-curated answers from PAN-OS docs + Live Community + KB. For complex prod issues, paste your show counter global delta yes | match drop output into chat.techclick.in.

📝 Wrap-up — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete. Tap Submit all answers at the end.

Q5 · Apply

During HA1 maintenance, you need to suspend the secondary firewall so it doesn't briefly take over. Where do you click?

Correct: c. Suspend (functional non-functional) is the safe way to isolate one peer for HA1 work — it prevents split-brain. Reboots interrupt traffic unnecessarily; disabling the HA1 interface directly often triggers exactly the failover you're trying to avoid.
Q6 · Analyze

After a content update, dataplane CPU jumps from 35% to 88% sustained, even though traffic volume hasn't changed. show running resource-monitor ingress-backlogs shows no single elephant flow. What's the most likely cause?

Correct: a. Content updates ship new threat/AV signatures every few hours. Occasionally a release contains expensive regex patterns that inflate dataplane work. The classic fingerprint is "DP CPU jumped after the content version changed, traffic profile didn't". Use request content-update install version <older> to revert; open a TAC case with show counter global deltas.
Q7 · Analyze

A commit fails with "Validation Error — duplicate rule name 'allow-web' in pre-rulebase". The rule exists exactly once in the device's local rulebase. What's actually wrong?

Correct: d. Panorama-managed devices have pre-rulebase (Panorama) + local rulebase + post-rulebase (Panorama). A device-local rule with the same name as a Panorama pre-rule triggers the duplicate-name validation. Rename the local rule, or move it into a device-group override to inherit Panorama's version. We deep-dive Panorama hierarchy next blog.
Q8 · Analyze

debug log-receiver statistics shows incoming rate = 3,200 logs/sec, forwarding rate = 2,400 logs/sec, queue near ceiling, log_traffic_loss_queue_full incrementing. SIEM-side: gaps every few minutes. What's the fastest mitigation?

Correct: b. Loss is rate-based — forwarding can't keep up with generation. Two-sided fix: shrink what you forward (filter), add a parallel destination (log balancing). Long-term, add a Log Collector or scale your SIEM tier. Reboot empties the queue but doesn't fix the underlying rate mismatch — you'll be back in the same place in 30 minutes.
Q9 · Evaluate

An HA pair runs in Active/Passive. The customer wants zero false failovers but maximum failure detection. Which combination is right?

Correct: c. Best-practice HA hygiene: direct HA1 + Backup HA1 (resilience), default timers (don't go sub-second without good reason), Link Monitoring on data plane uplinks (catches NIC failure), Path Monitoring on upstream IPs (catches upstream network issues), Preempt OFF on Active/Passive (prevents flap-back after the original active recovers). Lowering heartbeat below default + saturated HA1 = flap factory.
Q10 · Evaluate

A junior admin proposes: "to fix the 25-minute commits, let's just delete all disabled rules and unused address objects automatically before every commit using an API script." Is this a sound plan?

Correct: a. Production firewall configs accumulate disabled rules as paused troubleshooting state, scheduled re-enablement (e.g. "enable this on 2026-06-01"), audit evidence, or rollback artifacts. Mass automated deletion is a known incident generator. Run cleanup as a documented manual pass; keep a tagged config snapshot before each cleanup commit; communicate change windows. Speed should not be bought with risk.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the section that tripped you up and tap "Try again".

📚 Sources

  1. Palo Alto Knowledge Base — High-Availability HA Links Status. knowledgebase.paloaltonetworks.com
  2. Palo Alto LIVECommunity — Heartbeat Backup showing down on both HA peers (thread 314431) & Panorama HA1 connection daily flips because of buffer space (thread 594495).
  3. Palo Alto Docs — Troubleshoot Commit Failures (Panorama 10.2). docs.paloaltonetworks.com
  4. Palo Alto Knowledge Base — Commit Fails with Reference to Invalid or Missing Object. knowledgebase.paloaltonetworks.com
  5. Palo Alto LIVECommunity / PANCast — Episode 4: Why Is My Dataplane CPU So High? & Support FAQ: How to Handle High Data Plane CPU Issues.
  6. Palo Alto Knowledge Base — Log Forwarding to Syslog Delayed Troubleshooting & How to troubleshoot and verify log forwarding issues.
  7. Palo Alto Networks — PCNSE Exam Blueprint (Manage and Operate / Troubleshooting domains). paloaltonetworks.com

What's next?

Operational symptoms covered — now zoom out to the multi-device control plane. Panorama is where templates, device-groups, and pre/post rules either save your life or ruin it.