Before you reach for the keyboard — name the failure first
Production outages on a PA are rarely "the firewall died". 90% of the time you're chasing one of four patterns. Naming the pattern correctly in the first 60 seconds cuts MTTR by an order of magnitude. The framework: Symptom → Counter → CLI → Fix. Every failure below follows it.
Pair alternates active/passive. Counter to watch: ha_path_monitor_failure. CLI: show high-availability all.
Validation error or runtime "unknown error". Job log: show jobs id <N>. Plus revert candidate and rebuild incrementally.
Latency, drops, packet-buffer warnings. CLI: show running resource-monitor second + show running resource-monitor ingress-backlogs.
SIEM gap, no errors in GUI. Counter: log_traffic_loss_queue_full. CLI: debug log-receiver statistics.
① HA Heartbeat Flap — pair flips every two hours
Rahul at TCS gets paged: PA-5250 active firewall flipped to passive at 02:14, back to active at 04:18, and again at 06:21. Sessions resync, customers complain about momentary disconnects. The HA1 link itself shows "up". What's happening?
HA1 carries heartbeats and config sync on TCP/28771 + UDP/28769. HA2 carries session-sync. HA1 saturation — a noisy management VLAN, a duplex mismatch, or a switch buffer overflow on the intermediate switch — causes heartbeats to be delayed beyond the hello-interval timeout, even though the link technically stays "up". The peer declares the active dead, takes over, and the original active recovers a second later. Result: an oscillation every time the buffer fills.
▶ Failure tree — HA flap diagnosis ladder
Click Play. Each stage is the next CLI command you'd run.
HA mode changed: active → passive
show high-availability all → confirm last state-change timestamp + reason
show counter global filter category ha | match ha_ → look for ha_heartbeat_loss, ha_path_monitor_failure
show interface ha1 → confirm errors / discards / duplex. Direct-cable HA1 → 1 in 1,000 cases the SFP is dying.
Always configure an HA1 Backup link — typically the management interface — so a single heartbeat-path failure doesn't trigger failover. Palo Alto's design assumes you'll do this. Many shops skip it because "we have direct cables, what could go wrong". Then a single SFP dies at 2 AM and the whole pair flaps. HA1 Backup = 5 minutes of config, saves your night.
Sneha at Infosys sees the HA pair flipping active/passive every 2 hours during nightly backup windows. HA1 link reports "up", no errors on the interface. Which CLI gives the fastest answer about why heartbeats are being lost?
show counter global filter category ha tells you which failure trigger fired — heartbeat-loss (HA1 saturation), path-monitor-failure (a monitored destination became unreachable), or link-monitor-failure (a watched interface went down). Without that counter pivot, you're guessing. debug software trace is heavy and rarely needed; running it without TAC guidance can mask the issue. Resource-monitor is for dataplane, not control-plane HA.② Commit Failures — "unknown error" and other crimes
Priya at HCL hits Commit at 4 PM Friday. The job spins for 28 minutes, then fails with: "Commit failed: unknown error". The candidate config has 14 changes she can't undo individually. Now what?
Commit failures come in three flavours. Knowing which flavour you have decides whether to revert, edit, or call TAC:
Missing or renamed object reference, ghost reference in a disabled rule. Job log names the exact rule. Easy fix — edit and re-commit.
Another admin holds the config lock, or a Panorama push is in flight. show config-lock. Wait or break the lock with admin auth.
Management plane out of RAM mid-commit. Job hangs, eventually "unknown error". Reboot resolves; a RAM upgrade or PAN-OS hotfix is the real fix.
Config grew over time — 12K rules, 50K addresses. Mgmt-CPU pegged for 20-30 min. Fix: cleanup pass + Panorama "Share Unused Objects = OFF" if applicable.
show jobs id <job-id> show jobs all show config-lock debug swm status show system resources follow
Job ID Type Status Result Description
4231 Commit FIN FAIL validation Error: address-group 'web-servers'
references missing address 'web-04' in rule
'allow-web-prod' line 47
You deleted an address object three weeks ago — but it was still referenced inside a disabled rule, a PBF rule, or a Zone Protection profile. The validator catches it during the next unrelated commit. Job log: "reference to invalid or missing object". Fix: GUI search for the object name (Objects → Find Usage), edit/remove the orphan reference, re-commit. Don't just delete the disabled rule blindly — confirm what it was protecting first.
Karthik runs commit. Job spins 8 minutes and fails: "reference to invalid or missing object". He searches Objects tab — the missing object isn't visible anywhere. What's the most likely cause?
1. On Panorama, disable Share Unused Address and Service Objects with Devices — Panorama → Setup → Management. Stops 50K unused objects flooding every device's commit. 2. Use Partial Commit (in Panorama and PAN-OS): commit only your admin scope. Cuts commit time when many admins share a box. 3. If management CPU is stuck near 100% during every commit, the device is undersized — schedule a memory upgrade or migrate to the next hardware tier. PA-220 and older PA-3000 series are notorious here.
Commits on the firewall now take 25-30 minutes consistently. The config has grown from 800 to 9,400 security rules over two years. Three admins commit per day. Which set of fixes addresses the root cause, not just the symptoms?
③ Dataplane CPU Spike — find the elephant in 4 commands
Aditya at Wipro sees dashboard alerts: Dataplane CPU 96% sustained. Latency on production-bound flows jumps from 8ms to 180ms. Drops on the WAN side. The dataplane is choking — but on what? An IPS overload? A massive DDoS? A single misbehaving flow?
▶ Failure tree — DP CPU diagnosis ladder
Same Symptom → Counter → CLI → Fix framework. Each stage narrows the cause.
show running resource-monitor second → was it always high, or just last 60s? Spike vs sustained changes everything.
show running resource-monitor ingress-backlogs → sessions with the highest packet/byte share. Spot the elephant flow.
show running resource-monitor → check Packet Buffer + Packet Descriptor. Near 100% = packet-buffer exhaustion, often a DoS or misconfigured offload.
show counter global delta yes | match drop → identifies the drop reason: flow_rcv_dot1q_tag_err, flow_no_session, resource_packet_buffer_full.
show running resource-monitor second show running resource-monitor ingress-backlogs show running resource-monitor | match -A 4 "Packet Buffer" show counter global delta yes | match drop
Network → Network Profiles → Packet Buffer Protection. Enable it globally + apply per-zone. When buffers fill above the alert/activate thresholds, the firewall starts discarding the noisiest sessions first instead of dropping random packets. Massive QoS win during DoS or runaway flows. Default-off — flip it on.
Dataplane CPU is at 92%, packet drops climbing. Which CLI gives the fastest "who is the elephant flow?" answer?
ingress-backlogs is purpose-built for "who is hogging the dataplane right now". show session all dumps every session (millions, on a busy box) — not actionable. Packet captures make DP CPU worse. show system info is metadata, not telemetry.④ Log Forwarding Drop — when the SIEM goes quiet
Sneha at Infosys gets a Sev-2 ticket: "No traffic logs from PA-5250 in Splunk for the last 4 hours." She checks Monitor → Traffic → logs scroll fine on the firewall. System logs say "syslog forwarding enabled". The PA looks healthy. So where are the logs going?
This is the most dangerous failure of the four — because nothing visible breaks. The firewall keeps blocking and allowing traffic. Users notice nothing. The SOC has a gap. Detection misses an attack window.
Two failure modes dominate:
(1) Queue full. Logs generate faster than the management plane can ship them out. The forwarding queue fills, new logs are silently discarded. Counter: log_traffic_loss_queue_full.
(2) Connection silently dead. The syslog TCP socket reconnects in the background, the firewall thinks it's connected, but the SIEM closed the socket. The session ages out hours later. Logs in this window are lost.
debug log-receiver statistics show counter global filter aspect log | match drop show log-collector all show logging-status ping host <syslog-server>
Log incoming rate ............ 2,400 logs/sec Log forwarding rate .......... 1,820 logs/sec <-- shipping slower than generation log_traffic_loss_queue_full .. 178,400 <-- losses confirmed Syslog enqueue count ......... 9,224 (queue ceiling near)
On some PAN-OS releases, the syslog forwarding TCP socket can fail to re-establish after the SIEM's TCP keepalive timeout, while the firewall still reports the profile as healthy. Symptom: scheduled SIEM gap, no errors in System log. Workaround: change the Log Forwarding profile to UDP (lossier but resilient) or schedule a daily profile-cycle via API. Long-term fix: PAN-OS upgrade per release notes — search the release notes for "log forwarding" before upgrading.
Don't rely only on PA-side telemetry. Add a SIEM-side rule: "if log-source 'PA-prod-firewall' has zero events for 5 minutes during business hours, alert." This catches log-drop bugs the PA-side counters miss (e.g. when the firewall thinks logs are being sent but the receiver isn't accepting). Two-sided monitoring beats one-sided every time.
🤖 Ask the AI Tutor
Tap any question — instant context-aware answer. No login, no waiting.
Pre-curated answers from PAN-OS docs + Live Community + KB. For complex prod issues, paste your show counter global delta yes | match drop output into chat.techclick.in.
📝 Wrap-up — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete. Tap Submit all answers at the end.
📚 Sources
- Palo Alto Knowledge Base — High-Availability HA Links Status. knowledgebase.paloaltonetworks.com
- Palo Alto LIVECommunity — Heartbeat Backup showing down on both HA peers (thread 314431) & Panorama HA1 connection daily flips because of buffer space (thread 594495).
- Palo Alto Docs — Troubleshoot Commit Failures (Panorama 10.2). docs.paloaltonetworks.com
- Palo Alto Knowledge Base — Commit Fails with Reference to Invalid or Missing Object. knowledgebase.paloaltonetworks.com
- Palo Alto LIVECommunity / PANCast — Episode 4: Why Is My Dataplane CPU So High? & Support FAQ: How to Handle High Data Plane CPU Issues.
- Palo Alto Knowledge Base — Log Forwarding to Syslog Delayed Troubleshooting & How to troubleshoot and verify log forwarding issues.
- Palo Alto Networks — PCNSE Exam Blueprint (Manage and Operate / Troubleshooting domains). paloaltonetworks.com
What's next?
Operational symptoms covered — now zoom out to the multi-device control plane. Panorama is where templates, device-groups, and pre/post rules either save your life or ruin it.