Most engineers think…
Most engineers think an HA pair means we have DR — if one appliance dies the other takes over, so backups and break-glass can wait for next year's budget.
Wrong — HA protects against a dead box, nothing else. A botched upgrade, a deleted Smart Rule storm, ransomware or an admin mistake replicates to the secondary just as faithfully as good data. And if the whole vault is gone, every admin password is rotated, vaulted and unknown to humans — only an offline break-glass escrow gets you in. HA, backups and DR answer three different failures; a tier-0 vault needs all three.
① Deployment shapes — vault pairs, cold spares, brokers and Atlas
Aditya at HCL has just been handed the deployment decision: the new Password Safe estate will hold every domain admin, every root password, every firewall enable secret in the company. The uncomfortable maths: the vault is now the single most attractive failure point in the building. The deployment shape you pick decides what happens the night it dies — so let's walk the shapes BeyondTrust actually ships, not the marketing diagram.
For Password Safe on-prem, the unit is the U-Series appliance. Resilience comes in two flavours. An HA pair is two appliances: the primary serves everything, internal database replication copies data across, and a heartbeat from the primary tells the secondary when to take over. A cold spare is the budget cousin: a standby appliance you restore from backup — recovery measured in hours and your data loss equals the age of the last backup. Honest numbers, lower bill.
Third shape: Password Safe Cloud (your tenant at hcl.ps.beyondtrustcloud.com). BeyondTrust runs and patches the vault; you run Resource Brokers inside your network, grouped into resource zones. Each broker installs nine services (rotation, discovery, directory auth, session monitoring and friends) and makes outbound-443-only connections to the cloud — no inbound firewall holes. Cloud moves the vault's uptime onto BeyondTrust's shoulders, but your zone is still yours: docs recommend 2+ brokers per zone, each with 16 GB RAM minimum and a 64 GB session-cache disk.
PRA (the B Series appliance, physical or virtual, usually in the DMZ) follows the same logic: a failover pair of appliances for resilience, configured under /login > Management > Failover. When the estate goes global, PRA scales out with Atlas clustering: one primary node is the configuration point, and traffic nodes in each region carry session load. TCP 443 must be open bidirectionally between all cluster appliances, and each node can publish a separate Public Address (for users) and Internal Address (for appliance-to-appliance sync). The Atlas capacity tier is real scale: 300–3,000 users and 50,000–250,000 endpoints.
Sizing is the part interviews love because it is money. Sessions: a Jumpoint host with 8–12 cores and 32–64 GB RAM handles roughly 20–25 concurrent RDP sessions (or ~200 SSH/Telnet) — recording RDP with your own tools costs about 4 cores per 5 sessions. Recordings: session logs and recordings stay on the PRA appliance for up to 90 days, then you export via the Integration Client (recordings as .flv, logs as .xml) to SQL Server or a file share. Undersize the broker's 64 GB session-cache disk and session monitoring dies first — a classic week-one surprise.
Four shapes you will quote in design meetings
Tap each card — these four words decide your 2 AM experience.
Two U-Series appliances, DB replication, heartbeat. Box dies, twin takes over in minutes. So: covers hardware death, not bad data.
Standby appliance restored from backup. RTO in hours, RPO = last backup. So: the honest budget option — write the numbers down.
Brokers dial OUT on 443 to PS Cloud. 2+ per zone or local rotation and proxy stop. So: cloud moves the vault, not your duty.
Primary node owns config; traffic nodes serve regions; 443 both ways. So: global PRA scale with one control point.
Aditya at HCL moves Password Safe to the cloud (hcl.ps.beyondtrustcloud.com). Which piece still runs inside HCL's datacenter?
Pause & Predict
Predict: if BeyondTrust runs the vault in the cloud, what still breaks when YOUR datacenter loses power tonight? Type your guess.
② HA mechanics — heartbeat, replication and the failover you rehearse
The U-Series HA model is active/passive. The primary serves users, agents and APIs; the secondary replicates databases and otherwise stays quiet. The trigger is beautifully simple: the primary sends a heartbeat, and when that heartbeat stops arriving, the secondary takes over. Silence is the signal — no human presses a failover button at 2 AM.
Now the gotcha that fails real drills: HA replicates only the databases of features that were enabled when HA was configured. Enable Password Safe (or Secrets Safe) six months after pairing, and its database quietly never joins replication — the HA dashboard still says Healthy, because the pair itself is healthy. The rule to tattoo: enable features first, pair last — or re-establish HA after turning anything new on.
▶ One failover, second by second
Power off the primary in a drill window and watch what is automatic — and what is not. Press Play for the healthy path, then Break it to see the failure.
Meera at ICICI faces this
Quarterly DR drill: the team powers off the primary U-Series appliance. The secondary takes over, BeyondInsight loads — but the Password Safe menu is empty. No managed systems, no managed accounts. The HA dashboard said Healthy all year.
Password Safe was licensed and enabled months AFTER the HA pair was configured. U-Series HA replicates only the databases of features enabled at pairing time — the Password Safe database never joined replication, and nothing alarms about it.
On the U-Series appliance management software, compare the enabled-features list against the date HA was configured; confirm the Password Safe database exists on the primary but is absent on the secondary.
U-Series Appliance Management > High Availability (paired features) · BeyondInsight > ConfigurationRe-establish the HA pairing now that all features are enabled, let the initial replication complete fully, then schedule a fresh drill window.
Power off the primary again inside the window: the secondary must show managed systems and accounts, one test checkout must succeed, and a lab credential must rotate cleanly.
Announce the window → power off (do not gracefully migrate) the primary → confirm takeover → run smoke tests: POST Auth/SignAppIn returns 200, GET Configuration/Version answers, one test-account checkout succeeds, one RDP session rides the proxy on 4489, one lab credential rotates → fail back → write the actual timings next to the RTO you promised management. A failover you have never rehearsed is a diagram, not a capability.
POST https://ps.hcl-lab.in/BeyondTrust/api/public/v3/Auth/SignAppIn Authorization: PS-Auth key=c0ffee9a...e1; runas=HCL\\svc-drcheck; GET https://ps.hcl-lab.in/BeyondTrust/api/public/v3/Configuration/Version
HTTP/1.1 200 OK <- SignAppIn: session established on the NEW active node
HTTP/1.1 200 OK
{ "Version": "25.2.0.1234" }
-> vault is answering after takeover; now run one checkout + one rotation testA U-Series HA pair replicates…
Pause & Predict
Predict: you enabled Secrets Safe last month; the HA pair was configured last year. The primary dies tonight. What exactly is missing tomorrow? Type your guess.
③ Backups & upgrades — copies you can restore, windows you survive
Two backup species, two jobs. A config backup captures settings, policies and wiring — small, fast, take it before every change window. A full backup carries the databases: managed systems, accounts, the credential store, audit history, recordings metadata — large, scheduled, and the thing you rebuild a vault from. The U-Series Business Continuity guidance and every grown-up DR standard agree on placement: backups do not live on the appliance. Encrypted copies go to separate storage, ideally a second site — because whatever kills the appliance (fire, ransomware, a disk controller with opinions) kills anything stored on it in the same instant.
Symptom: after an appliance failure, the team goes to restore and discovers every backup lived on the appliance's own disk — same blast radius, all gone. Second symptom, quieter: backups exist off-box but nobody has ever restored one, and the first restore attempt happens during the outage. Fix: off-appliance encrypted storage at a second site, plus a calendar entry that restores one backup into a lab every quarter. A backup you never restored is a rumour.
Upgrades are the other planned emergency. The field-tested order for an on-prem U-Series estate: (1) suspend HA failover first — otherwise a mid-upgrade reboot looks exactly like a dead primary and the pair flips underneath you; (2) update SUPI (the update installer itself) before the appliance software; (3) upgrade the appliance/BeyondInsight + Password Safe; (4) resume HA and verify replication; (5) then the outer ring — Resource Brokers, agents and Jump Clients (PRA pushes client auto-upgrades in bandwidth-throttled waves), and desktop consoles last. Servers before agents, agents before consoles.
Version policy is part of the plan, not trivia: direct upgrades to BI/PS 25.2 are supported from 23.2 or later — older estates hop through an intermediate version first — and the platform wants SQL Server 2016 SP2+. Watch deprecations in the release notes too (mTLS is being phased out; client certificates are being retired as an API auth method). And remember the patching split from the CVE lesson: cloud tenants were auto-patched on 2024-12-16 during the CVE-2024-12356 emergency, while on-prem owners applied the fix themselves via the /appliance interface. Self-hosted means you own the patch SLA.
Rahul at Infosys faces this
The morning after the PRA appliance upgrade (24.1.4 → 24.2.3), 40 of 100 Jump Clients show Active [Offline]. On the affected machines the old client is uninstalled and the new one never appeared.
The EDR (SentinelOne, in the original field report) blocked the upgrade's stop-service → uninstall → install → start sequence mid-flight, leaving those endpoints with no client at all.
Windows Event Viewer on an affected endpoint shows the service stop and then nothing; the EDR console shows blocked installer executions timestamped to the upgrade wave.
/login > Status (Jump Client list, sort by last-seen) · endpoint Event Viewer + EDR consoleWhitelist the new installer hash in the EDR BEFORE the next upgrade; redeploy the offline installer from the appliance Update tab to the 40 orphans. One admin reported 100% clean upgrades after also logging off all console users before starting.
All 100 clients report online on the new version; duplicate entries cleaned by sorting on last-seen; the next upgrade gets a 5-endpoint pilot ring before the full wave.
Replacing the appliance SSL certificate looks harmless until hundreds of Jump Clients drop offline: docs say allow 24–48 hours for clients to pick up a changed certificate. Never combine a cert swap with a hostname change or an upgrade in the same window — when three things change at once, you cannot tell which one broke the fleet.
Test-NetConnection ps-secondary.hcl-lab.in -Port 4489 # RDP proxy Test-NetConnection ps-secondary.hcl-lab.in -Port 4422 # SSH proxy
ComputerName : ps-secondary.hcl-lab.in RemoteAddress : 10.20.5.12 RemotePort : 4489 TcpTestSucceeded : True (repeat shows 4422 True — both session-proxy listeners answering on the new active node)
Sneha at Wipro opens a 4-hour window to upgrade an on-prem U-Series HA pair to 25.2. Her first TWO moves?
Pause & Predict
Predict: a critical RS/PRA CVE drops tonight. Whose estate is patched by morning — the cloud tenant's or the on-prem team's — and why? Type your guess.
④ DR thinking — the vault is tier-0, plan for the day it is gone
Here is the trap your own success builds: once PAM is rolled out properly, no human knows any privileged password — they are vaulted, rotated, injected. Brilliant on a normal Tuesday. Catastrophic logic on disaster day: if PAM is down, nobody can log in to fix anything — including the systems PAM runs on. AD recovery needs credentials that live in the vault; the vault may need AD to authenticate its admins. That circular dependency is why the vault is tier-0 infrastructure, same shelf as your domain controllers.
The answer is the oldest control in banking, done digitally: break-glass. Keep a tiny set of emergency accounts (a domain admin, a hypervisor root, the vault's own local admin) whose credentials live in an offline escrow that does not depend on the vault. The classic form is sealed envelopes in a physical safe — the modern form is the same ceremony, digital: an encrypted file or offline password store whose unlock is split between two custodians, exactly like an SBI bank locker needing your key and the manager's key. Three non-negotiables: alarmed (any use pages the SOC — the fire-alarm glass box rings when smashed), rotated after every use (the envelope is single-use), and tested on a schedule — a sealed envelope with last year's password inside is theatre.
Symptom: vault restored from last night's backup, dashboards green — then rotation jobs and checkouts start failing with wrong-password errors. Cause: every account that rotated AFTER the backup point now has a different real password than the vault remembers. That gap is your RPO, made concrete. Fix: enable Check Password / scheduled password tests to detect mismatches, then reconcile — re-rotate via the functional account, or Reset on Mismatch where configured. Budget reconciliation time into the RTO you promise.
GET https://ps.hcl-lab.in/BeyondTrust/api/public/v3/ManagedAccounts # filter client-side: LastChangeDate newer than the restored backup's timestamp
HTTP/1.1 200 OK
[ { "AccountName": "adm-db01", "LastChangeDate": "2026-06-10T02:14:00Z" },
{ "AccountName": "root-web04", "LastChangeDate": "2026-06-10T03:02:00Z" } ]
-> both rotated AFTER last night's 01:00 backup: vault now holds stale secrets
-> queue these for password test + re-rotation before declaring recovery doneNow the management conversation, because DR is a business decision wearing a technical costume. RTO: how many hours can the company run with no checkouts, no brokered sessions, no rotations? Frozen change windows, stalled vendor access, engineers locked out — that is money per hour. RPO: how many hours of rotations and audit trail can you lose — knowing each lost rotation is a reconciliation job? The shapes map cleanly: HA pair buys minutes of RTO and near-zero RPO; cold spare buys hours and last-backup; backup-only means a day or more. Present those three rows with prices and let management pick — then get the chosen numbers signed. An RTO that lives only in your head is yours to be blamed for; one signed in writing is a budget.
2 AM: the vault is fully down and a domain controller needs an urgent fix. What gets Karthik in?
Pause & Predict
Predict: you restore the vault from last night's 01:00 backup. This morning 60 accounts rotated on schedule before the crash. What is silently broken, and what is the fix? Type your guess.
One closing sentence that lands: 'I treat the vault as tier-0 — HA pair for box failure, off-appliance restore-tested backups for data failure, offline dual-control break-glass for vault failure, and RTO/RPO numbers that management signed, not numbers I assumed.' That is the whole lesson in 35 seconds.
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from BeyondTrust docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: Your CISO asks: if the Password Safe appliance room floods tonight, exactly how do admins log in tomorrow morning? Answer in three steps. Then compare to the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- U-Series appliance
- BeyondTrust's hardened appliance (physical or virtual) running BeyondInsight + Password Safe on-prem.
- HA pair
- Two appliances in active/passive with database replication; the secondary takes over on heartbeat loss.
- Heartbeat
- The periodic signal from the primary appliance — its absence is what triggers failover.
- Cold spare
- A standby appliance restored from backup when the primary dies — hours of RTO, last-backup RPO.
- Resource Broker
- On-prem worker for Password Safe Cloud (auth, discovery, rotation, session proxy) — dials out on 443 only.
- Resource zone
- A group of brokers serving one network segment; Default zone is built-in; 2+ brokers per zone recommended.
- Atlas cluster
- PRA scale-out: one primary node owns configuration, traffic nodes carry regional sessions, 443 open both ways.
- SUPI
- The U-Series software update installer package — update it first, before the appliance software.
- /appliance
- The B Series appliance/OS management interface where on-prem patches are applied — separate from /login.
- RTO
- Recovery Time Objective — how long the service may stay down before the business hurts.
- RPO
- Recovery Point Objective — how much data (rotations, audit trail) you can afford to lose.
- Break-glass account
- Emergency credential kept OUTSIDE the vault — dual-controlled, alarmed on use, rotated after every use.
📚 Sources
- BeyondTrust U-Series Deployment & Failover Guide — appliance roles, HA pair, replication + heartbeat. docs.beyondtrust.com/bips/docs/u-series-deployment-and-failover-guide
- BeyondTrust U-Series Business Continuity Guide — backup, restore and continuity planning. docs.beyondtrust.com/bips/docs/u-series-business-continuity
- BeyondTrust PRA Atlas Cluster Guide — primary + traffic nodes, bidirectional 443, public vs internal addresses. docs.beyondtrust.com/pra/docs/atlas
- Password Safe Cloud Resource Broker Installation — zones, the 9 broker services, outbound-443 model, sizing. docs.beyondtrust.com/bips/docs/ps-cloud-resource-broker-install
- BeyondInsight and Password Safe 25.2 Release Notes — direct upgrade from 23.2+, SQL Server 2016 SP2+, deprecations. docs.beyondtrust.com/bips/changelog/beyondinsight-and-password-safe-25-2-release-notes
- BeyondTrust advisory BT24-10 (CVE-2024-12356) + CISA KEV — cloud auto-patched 2024-12-16; on-prem patches via /appliance. beyondtrust.com/trust-center/security-advisories/bt24-10
- BeyondTrust PRA SSL Certificates guide — allow 24–48 h for Jump Clients to pick up a changed certificate. docs.beyondtrust.com/pra/docs/on-prem-ssl-certificates
- PeerSpot — BeyondTrust Password Safe reviews: lengthy upgrades, suspend-HA-before-upgrade field practice. peerspot.com/products/beyondtrust-password-safe-pros-and-cons
- BeyondTrust Beekeepers community — Jump Clients offline after appliance upgrade (EDR blocked reinstall). beekeepers.beyondtrust.com/general-51/jump-clients-offline-5503
What's next?
The vault survived the disaster drill — now for the everyday fires: rotations failing, sessions dropping, brokers sulking. Next lesson is the troubleshooting playbook — symptom to root cause to fix, the way interviews want it.