Most engineers think…
Most people picture SRX HA as 'an active/standby pair where the backup box just sits there'. That mental model fails both in the interview chair and when you're troubleshooting at 2 am.
SRX chassis clustering is a distributed single logical device: node0 and node1 share one configuration, one session table and one set of policies. The control link synchronises state; fabric links forward data-plane traffic from the standby node to the active node. Interfaces are bundled into redundancy groups and presented to the outside world as reth interfaces — so the network never sees a topology change during a failover. Understanding the split between RG0 (routing engine control) and RG1+ (data-plane groups), and how reth inherits its active node from its redundancy group, is what separates a confident answer from a guess.
① Cluster basics — node0, node1, control link and fabric links
An SRX chassis cluster joins exactly two SRX devices. One is assigned node0 and the other node1 via a bootstrap command (set chassis cluster cluster-id <id> node <0|1> reboot). After reboot both nodes load the same Junos configuration and behave as a single logical firewall to the rest of the network.
Two types of inter-node links connect them. The control link (typically a dedicated management-grade port, historically fxp1 on branch models) carries heartbeat pulses, configuration synchronisation and session state. If the control link goes silent, each node assumes the other has failed. The fabric links (ge/xe interfaces bound as fab0 on node0 and fab1 on node1) carry data-plane traffic: when the standby node receives a packet for an active session, it forwards it across the fabric to the active node rather than dropping it.
Because both nodes share one configuration, you manage the pair from a single point. A show chassis cluster status tells you which node holds each redundancy group right now.
In an interview, always separate the control link (heartbeat + sync) from the fabric links (data-plane forwarding). Mixing them up is the most common mistake — and interviewers test it directly.
What does the fabric link carry in an SRX chassis cluster?
② Redundancy groups — RG0 controls the RE, RG1+ control data
A redundancy group (RG) is the unit of failover. Every resource — interfaces, session tables — belongs to a redundancy group that is primary on exactly one node at a time. RG0 is special: it controls routing-engine (control-plane) primacy. The node that holds RG0 primary is the one whose routing protocols, management plane and jsrpd process are authoritative. You cannot enable preempt on RG0 — if you want to move RG0, you do a manual failover.
RG1 through RG127 are data-plane groups. Each one has a priority value configured per node (higher wins). When priorities tie, the lower node-id wins. Preempt, when enabled on a data-plane RG, lets the higher-priority node take back primary automatically after recovery — with an optional delay timer (introduced in Junos 17.4R1) to prevent flapping.
Interface monitoring
Each RG can monitor physical interfaces with assigned weights (0–255). If monitored interfaces fail and the cumulative weight falls to zero, the RG fails over to the other node. Juniper advises not to apply interface monitoring to RG0, because interface flaps would then trigger control-plane switchovers — a disruptive event.
Carries heartbeat pulses, configuration sync and session state between node0 and node1. Loss triggers a failover arbitration.
Data-plane forwarding path between nodes. Packets received on the standby node are forwarded across the fabric to the active node for processing.
The unit of failover — primary on one node at a time. RG0 owns the routing engine; RG1–127 own data-plane groups and reth interfaces.
A logical pseudo-interface that spans both nodes, inheriting active/standby state from its RG. The network sees one stable IP and MAC regardless of which node is active.
Interface monitoring on RG0 means an interface flap triggers a control-plane switchover — which is far more disruptive than a data-plane failover. Use interface monitoring only on RG1+ data-plane groups.
Which redundancy group controls routing-engine (control-plane) primacy in an SRX cluster?
③ reth interfaces — one logical interface across both nodes
A reth (redundant ethernet) interface is a pseudo-interface that binds at least one physical child interface from each node. You configure the reth at the logical level (IP, zone, family) once, and the cluster decides which node's physical child is the active forwarding path based on the redundancy group the reth belongs to.
When the redundancy group fails over, the reth's active child switches from node0's physical port to node1's physical port — but the IP address, MAC address and zone membership stay identical. To the upstream switch or router nothing changes: no ARP flush, no topology update, no route recalculation. That is the key interview line: reth makes the failover invisible to the network.
A reth interface inherits the active/standby state from its redundancy group. You bind a reth to an RG with set interfaces reth<N> redundant-ether-options redundancy-group <N>. You also set redundant-ether-options minimum-links 1 to keep the reth up as long as one child is alive. Multiple reth interfaces can belong to the same RG, and that RG can be primary on different nodes than other RGs — so you can load-balance data-plane groups across the two nodes.
▶ Watch an SRX cluster failover in real time
Step through a healthy active-standby data path, then Break it to see what happens when the control link fails — and how to fix it.
A reth interface fails over from node0 to node1. What does the upstream router need to do?
④ Failover triggers, timing and hardening with dual control links
A redundancy group failover happens for three reasons: the control link goes down (heartbeat loss — each node assumes split-brain and the secondary takes primary if it wins arbitration), interface monitoring weight reaches zero (configured thresholds on the RG), or a manual failover is triggered (request chassis cluster failover redundancy-group <N> node <N>). Session sync over the fabric means established TCP/UDP sessions survive a data-plane RG failover with minimal interruption.
Dual control links
On high-end SRX platforms (SRX5600, SRX5800) you can configure dual control links. The jsrpd process sends and receives heartbeats on both links simultaneously. If one control link fails, the other keeps the cluster alive — preventing a spurious split-brain failover caused by a cable or port fault. This removes the control link as a single point of failure and is strongly recommended for carrier or data-centre deployments.
Preempt delay (Junos 17.4R1+) adds a configurable wait (1–21 600 seconds) before a recovering high-priority node reclaims primary, giving convergence time to settle before another RG move. Always test failover in a maintenance window: show chassis cluster status before, trigger, then verify RG primacy and traffic flow after.
Rohan at a Mumbai financial services firm faces this
After a scheduled maintenance window, the SRX cluster keeps oscillating — RG1 flips between node0 and node1 every few minutes, causing brief traffic interruptions.
Preempt is enabled on RG1 but no preempt delay is set; node0 (higher priority) keeps recovering and immediately reclaiming primary before convergence stabilises.
show chassis cluster status shows rapid RG1 primary changes; show log messages reveals repeated jsrpd preempt events triggered by node0 coming back online.
show chassis cluster status ▸ show log messages ▸ chassis cluster configuration ▸ redundancy-group 1Add a preempt delay of 60–120 seconds: set chassis cluster redundancy-group 1 preempt delay 90. This gives node0 time to fully converge before it reclaims RG1 primary.
Commit, trigger a test failover, confirm RG1 waits the configured delay before moving back to node0; traffic interruptions drop to a single brief event per failover.
Run 'show chassis cluster status' before and after a manual failover. Confirm RG primacy moved, traffic resumed, and 'show security flow session' shows sessions re-synced. Never declare HA working without a live test.
An SRX cluster keeps flip-flopping — RG1 moves back and forth between nodes every few minutes. Which feature would you enable to dampen this?
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: what is the difference between a control link and a fabric link in an SRX chassis cluster? Then compare with the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- Chassis cluster
- An HA configuration pairing two SRX devices (node0 and node1) to act as a single logical firewall with shared configuration, session tables and policies.
- Control link
- The dedicated inter-node link carrying cluster heartbeat, configuration synchronisation and session-state updates.
- Fabric link
- The data-plane inter-node link that forwards traffic from the standby node to the active node when a packet arrives on the wrong node.
- Redundancy group (RG)
- The unit of failover — primary on exactly one node at a time. RG0 owns control-plane primacy; RG1–127 own data-plane resources and reth interfaces.
- reth interface
- A redundant ethernet pseudo-interface spanning both nodes, presenting one stable IP and MAC to the network. Active/standby state is inherited from its redundancy group.
- Preempt
- A redundancy-group setting that allows the higher-priority node to automatically reclaim primary after recovery. Not available on RG0.
- Dual control links
- Two physical control links between nodes (SRX5600/SRX5800 only); jsrpd heartbeats run on both, so one cable failure cannot cause split-brain.
- Interface monitoring
- A per-RG mechanism that assigns weights to physical interfaces; when cumulative weight falls to zero, the RG fails over. Recommended for RG1+ only, not RG0.
- jsrpd
- The Juniper Services Redundancy Protocol process — manages heartbeats, session sync, and redundancy group state across cluster nodes.
- Split-brain
- A failure mode where both cluster nodes believe the other is dead and both attempt to hold primary for the same RG, potentially causing traffic black holes.
📚 Sources
- Juniper Networks — Chassis Cluster Security Devices: Overview and Configuration. juniper.net/documentation/us/en/software/junos/chassis-cluster-security-devices/
- Juniper Networks — Chassis Cluster Redundancy Groups. juniper.net/documentation/en_US/junos/topics/topic-map/security-chassis-cluster-redundancy-groups.html
- Juniper Networks — Chassis Cluster Redundant Ethernet Interfaces (reth). juniper.net/documentation/us/en/software/junos/chassis-cluster-security-devices/topics/topic-map/security-chassis-cluster-redundant-ethernet-interfaces.html
- Juniper Networks — Configuring Cluster Failover Parameters (preempt, delay timer, interface monitoring). juniper.net/documentation/us/en/software/junos/chassis-cluster-security-devices/topics/topic-map/security-chassis-cluster-failover-parameters.html
- Juniper Networks — Chassis Cluster Dual Control Links (SRX5600/SRX5800). juniper.net/documentation/us/en/software/junos/chassis-cluster-security-devices/topics/topic-map/security-chassis-cluster-dual-control-links.html
- Juniper Networks — Troubleshoot Redundancy Group Failover in a Chassis Cluster. juniper.net/documentation/us/en/software/junos/chassis-cluster-security-devices/topics/task/troubleshoot-srx-chassis-cluster-redundancy-group-not-failing-over.html
What's next?
Got clustering down? Next, master SRX security zones, policies and address books — the building blocks you configure on top of your HA pair.