The interview question that trips up L2 candidates
Interview: "Member 1 was Active. Failover triggered. Now Member 2 is Active but users complain sessions dropped. What went wrong?"
Wrong answers: "Sync interface", "CCP timeout". Right answer: "Either (a) the sync interface wasn't keeping up with sync traffic — state didn't propagate before failover, or (b) the failed-over connections were Non-Synced services (some apps are explicitly excluded from state sync via $FWDIR/conf/discntd.if), or (c) ARP didn't update at the upstream router — gratuitous ARP from Member 2 was lost or arrived after the router's ARP cache TTL window. Diagnose with cphaprob syncstat + check 'Non-Synced services' table + ping upstream router MAC table."
💡 The captain-and-co-pilot analogy
Two pilots fly the same plane. The captain (Active member) handles all the radio + controls. The co-pilot (Standby) watches every action and notes it in a shared logbook (sync interface). If the captain has a heart attack mid-flight, the co-pilot takes over instantly — because they were tracking every decision. If the logbook (sync interface) was missing pages, some of the captain's flight plan is lost. CCP is the radio between them announcing "I'm still alive" every 100 ms. Miss 3 announcements → co-pilot assumes captain is dead, takes over.
① HA mode (Active/Standby) — the default
Two members. One is Active, handling all traffic. Other is Standby, idle but synced. Failure detection: CCP (UDP/8116) heartbeats every 100 ms. Miss 3 in a row → Standby promotes itself, sends gratuitous ARP for the VIP, takes over. Existing TCP sessions survive (state already synced).
② Load Sharing modes
Three LS modes — both members process traffic in parallel:
- LS Multicast — VIP maps to a multicast MAC. Both members receive every frame, each member decides which 50% to process based on a hash. Requires switch to allow multicast MAC on the L2 interface.
- LS Unicast (Pivot) — one member ("Pivot") receives all traffic, forwards 50% to the other over sync. Simpler L2 (no multicast MAC); pivot becomes single point of bottleneck.
- LS Unicast (Hashing) — newer, both members advertise unicast MACs and the switch hash-distributes via LAG.
4 things every interview asks about
Cluster Control Protocol. UDP/8116. Heartbeat every 100ms. Miss 3 (300ms total) → failover. Carries member ID, state, priority. Hardened: spoofable, must be on a dedicated VLAN/interface.
The MAC scheme for the VIP. HA: 00:1C:7F:01:<id>:<ifidx>. LS Multicast: 01:<same suffix> (multicast bit set). LS Unicast: real NIC MAC. The "magic" lets the cluster advertise predictable MACs.
Connection table + NAT table + VPN SAs replicated over sync interface. Some services excluded for performance (DNS, ICMP usually). List in $FWDIR/conf/discntd.if.
Process notification. Each daemon (cphad, fwd, vpnd, etc.) reports health. If any daemon fails its pnote check → member goes Down → failover triggered. cphaprob list shows all pnotes.
▶ Watch a failover happen in 350 ms
Active member's external NIC link goes down. Standby promotes itself. ARP updates. Sessions survive.
cphaprob list on Member-1 logs "interface ext-NIC is DOWN".cphaprob stat on Member-2 shows ACTIVE.③ CCP + sync interface — the cluster's nervous system
Sync interface = a dedicated physical (or VLAN) interface between members. Carries:
- CCP heartbeats — UDP/8116, every 100 ms.
- State sync — connections table, NAT table, VPN SAs. Constant trickle (high on busy gateways).
- Delta sync — periodic "what changed" updates to keep tables aligned without retransmitting everything.
Best practice: dedicated NIC, gigabit or 10G, no other traffic. On busy gateways (1M+ sessions) the sync interface gets ~200 Mbps steady-state. Skimp here = silent failover bugs.
Sneha's HA cluster failed over correctly, but existing TCP sessions all dropped. cphaprob stat shows both members healthy. Most likely cause?
④ cphaprob — the cluster diagnostic CLI
cphaprob stat # current state (Active/Standby/Down) on this member cphaprob list # all health checks (interfaces, processes, pnotes) + reasons cphaprob syncstat # sync stats — bytes, delays, drops cphaprob -a if # all monitored interfaces + state cphaprob -d # admin DOWN this member (manual failover trigger) cphaprob -d normal # admin UP back to normal
Karthik wants to do controlled maintenance on Member-1. What's the right way to force failover without rebooting?
The 5 mistakes that cost candidates the cluster question
Saturated sync = silent failover bugs. Dedicated 1G/10G NIC, no VLAN sharing with user traffic.
Frames disappear. Either fix switch config (allow CGMP / IGMP snooping) or switch to LS Unicast Hashing.
DNS / ICMP often excluded from sync for performance. On failover, those flows reset. Acceptable for most apps; not for IoT telemetry.
Gratuitous ARP arrives but router ignores it because cache TTL hasn't expired. Reduce upstream ARP timeout to ~60 sec.
3-member clusters exist (Pivot mode) but configuration is more complex than HA pairs. Plan licenses + sync bandwidth carefully.
🤖 Ask the AI Tutor
Tap any question — instant context-aware answer.
Deeper questions → chat.techclick.in.
📝 Check your understanding — 10 questions, 70% to pass
Q1–Q2 above already count. Below are Q3 to Q10.
What protocol + port does CCP use, and how often does it heartbeat by default?
Rahul needs to do firmware upgrade on Member-1 during business hours with zero user impact. Which sequence?
Cluster fails over correctly, but external users see 30-second outage. Both members healthy after failover. Most likely cause?
LS Multicast was working perfectly. Network team replaces Cisco switch with Juniper. Both members go to "Active/Active" state but traffic broken. Why?
Sneha's gateway carries 1.2M concurrent connections with high churn (many short flows). What sync interface spec does she need?
Aditya's cluster flaps Active→Standby→Active every 5 minutes. cphaprob list shows pnote "interfaces" failing intermittently. What's the diagnostic?
For a new DC build with 50k users and Cisco Catalyst 9300 switches, which ClusterXL mode + sizing is right?
Post-CVE-2024-24919, what cluster hygiene matters most?
Next up — Check Point vs Palo Alto vs Fortinet
You can now design a cluster. Next: the vendor design comparison that helps you justify the choice.