TTechclick ⚡ XP 0% All lessons
BeyondTrust · Operations · TroubleshootingInteractive · L1 / L2 / L3

BeyondTrust Troubleshooting Playbook: — Rotation Failures, Offline Jump Clients & Session Errors

When BeyondTrust breaks at 2 AM, guessing is the enemy. Every failure — a rotation that died, a session that will not start, a Jump Client that vanished — lives in one of a few layers, and the blast radius tells you which one. Symptom in, layer found, fix out: this is the triage hub for the whole series.

📅 2026-06-10 · ⏱ 14 min · 3 live demos · 4 infographics · 🏷 10-Q assessment + AI Tutor inline

🎯 By the end you will be able to

Read as:

Pick where you want to start

1

The triage ladder

Rotation, discovery & release failures — layer by layer.

2

Session failures

Will not start, no recording, laggy — find the side.

3

PRA & Jump issues

Offline clients, Jumpoints, Web Jump, SAML logins.

4

Quick hits + escalate

EPM/PMUL fixes & the support evidence pack.

🧠 Warm-up — 3 questions, no score

Just notice which ones make you pause. We answer all three inside the lesson.

1. Every account on one platform stops rotating at the same time. The smartest first suspect?

Answered in The triage ladder.

2. Your SSH session through Password Safe refuses to connect on port 22. Most likely reason?

Answered in PRA & Jump issues.

3. A Jump Client shows offline but the laptop is on and browsing fine. Which direction does the client connect?

Answered in Session failures.

Most engineers think…

Most engineers think troubleshooting BeyondTrust means restarting services, re-onboarding accounts and reinstalling agents until the error goes away — and blaming "the network" when it doesn't.

Wrong — and it makes outages longer. Every BeyondTrust failure lives in a small set of layers: the functional account, the directory, the proxy ports, the outbound 443 path, a certificate, or an agent service. The blast radius (one victim vs many) names the layer, the log confirms it, and one targeted change fixes it. Restarts and reinstalls are the LAST rung of the ladder — a blind Jump Client redeploy even creates duplicate entries and orphans recordings.

① The triage mindset + the Password Safe ladder

It's 23:42 IST. Sneha, the L2 PAM admin at ICICI's infra team, gets the call: "password rotations are failing." Her weakest move would be opening the first red row and clicking Rotate Now repeatedly. Her strongest move costs ten seconds: count the victims. One account failing is an account problem. Every account on one platform failing is a functional account problem, because that one shared worker performs every change on the platform. Everything failing everywhere is an appliance or agent problem. The blast radius names the layer — like a building electrician who checks the main MCB when the whole house goes dark, instead of testing every bulb one by one.

That is the whole triage mindset, and it carries across every product in this series: ask what changed (an upgrade? a cert swap? a GPO push?), ask which layer the blast radius points at, read that layer's log first, and change one variable at a time. BeyondTrust gives you a precise log for each layer — the skill is knowing which one to open.

Figure 1 — Blast radius → layer
Three symptom cards on the left map to three layers on the right. One account failing maps to the account layer where you check the SID and account rights. All accounts on one platform failing map to the functional account layer where you test rights and lockout. Failures across every platform map to the appliance and agents layer where you check the Password Change Agent, services and disk. Each layer carries a label naming the log to read first. A lime strip states the key insight: count the victims before touching anything. Blast radius → layer: count the victims before touching anything SYMPTOM (what you see) ONE account fails e.g. identityValue null on rotate ALL accounts, ONE platform 212 Linux accounts red at 23:30 UTC EVERYTHING, everywhere rotations + sessions + scans all sick LAYER (where you look) Account layer — SID resolvable? rights on target? read first: the per-account error in the failed-changes queue Functional-account layer — the shared worker credential read first: Test Functional Account result + AD lockout state Appliance & agents layer — services, queue, disk read first: Password Change Agent log + appliance health One victim = its own problem. Many victims sharing ONE thing = that shared thing is the problem. The functional account is the most common shared thing in Password Safe — test it before touching 212 accounts one by one. untrusted/attackertrusted/vaultedpolicy/decisionkey insightallowed/audited
Match the left card to the right layer before touching anything. The amber row is the highest-value test in all of Password Safe: one functional-account check explains (or eliminates) hundreds of red accounts at once.

The four triage questions

Tap each card — ask these in order on EVERY ticket, for every BeyondTrust product.

🕐
What changed?
tap to flip

Upgrades, cert swaps, GPO pushes, EDR updates cause most outages. No change found? Widen the window — someone changed something. So: check change tickets first.

💥
How wide is the blast?
tap to flip

One victim = its problem. Many victims = their shared layer (FA, appliance, proxy, IdP). Counting victims is free and names the layer.

📜
What does the log say?
tap to flip

Change Agent log for rotation, scanner log for discovery, Sessions grid for proxies, endpoint Event Viewer for agents. Exact error strings are gold — copy them verbatim.

🧪
Can I reproduce it safely?
tap to flip

One manual Rotate Now, one test session, one pbrun --testmaster. A controlled experiment beats ten theories. So: reproduce, then fix.

The rotation ladder — start at the functional account

Rotation failures have a strict pecking order. Rung 1: the functional account. It needs real rights on the target — local admin or the delegated change-password right on Windows (plus read/write of the lockout attribute in AD), root or working sudo elevation on Linux, and the correct privileged role on appliances like F5. If the FA got locked out by a bad script, or someone "helpfully" onboarded the FA itself as a managed account so Password Safe rotated its own worker's password mid-job, every account on the platform fails together. The society-watchman analogy from lesson 4 holds: he carries the master key that re-keys every flat — change his key without telling him and no lock in the building can be changed.

Rung 2: the account itself. The field's most famous string is Value cannot be null (Parameter 'identityValue') — Password Safe cannot resolve the account's SID in the directory because the account was renamed, deleted, or onboarded from the wrong domain path. Rung 3: connectivity — can the appliance or Resource Broker reach the target's management port at all? Rung 4: platform quirks — an F5 BIG-IP rotation that dies with "Thread was interrupted from a waiting state" because the FA holds a non-Administrator role or the custom platform's prompt regex is wrong. Same ladder every time: worker → identity → wire → quirk.

👉 So far: count the victims → platform-wide means functional account, single account means SID/rights. Next: where the logs and retry machinery live.
Figure 2 — The rotation decision tree
A decision tree. Root: rotation failed. First question: one account or the whole platform. The one-account branch leads to the identityValue null error meaning the SID is not resolvable, then to account rights and platform quirks like the F5 prompt regex. The whole-platform branch leads to Test Functional Account: on fail check rights, lockout or recreate the account on the client-credentials-config-not-found error; on pass check connectivity to targets and the Password Change Agent retry queue. A note box warns that Reset Password on Mismatch only fires when a password test runs. The rotation ladder — one question, then the right branch Rotation FAILED one account… or many? ONE account WHOLE platform Value cannot be null (identityValue) = SID not resolvable: renamed / deleted / wrong domain path → re-onboard correctly SID fine? → rights & platform quirks F5: account role + prompt regex · Linux: sudo path Test the FUNCTIONAL ACCOUNT the shared worker that performs every change FAIL PASS FA itself is the patient: · AD lockout / expired password · lost local-admin / change right · client credentials config not found → recreate the FA FA fine → look outward: · port to target (22 / 5985 / 1433) · Change Agent retry queue depth · one manual Rotate Now as a controlled experiment ⏰ Reset Password on Mismatch is NOT real-time — it fires only when a password TEST runs. Someone changed the password out-of-band and nothing happened? Enable Check Password + schedule the Password Test Agent. Deep dive on rotation design lives in lesson 6 — this tree is for the 2 AM fix.
One question — one account or the whole platform? — picks the branch. Note the amber box at the bottom: out-of-band password changes are only detected when a password TEST runs, never in real time.

Where do you read all this? Two agents own rotation health, both under Configuration → Privileged Access Management Agents in the BeyondInsight console. The Password Change Agent executes queued changes and retries failures per Retry failed changes after (minutes) and Maximum retries — a silently growing failed-changes queue is your earliest smoke alarm, so alert on it. The Password Test Agent verifies stored passwords on a schedule — and it is the only thing that makes "Reset Password on Mismatch" fire.

🖥️ This is the screen that owns rotation retries — BeyondInsight → Configuration → Privileged Access Management Agents → Password Change Agent. (Recreated for clarity — your console matches this.)
beyondinsight.icici-infra.in · Configuration → PAM Agents
Agent
Password Change Agent
1
Retry failed changes after (minutes)
10
2
Maximum retries
3
3
Unlimited Retries
Off
Save
Common mistake — "Reset on Mismatch did nothing"

Symptom: a server admin changed a password directly on the box; Password Safe kept serving the old one for two days and nobody was alerted. Cause: mismatch detection is NOT real-time — it only happens when a password test runs. Fix: enable Check Password on the accounts and schedule the Password Test Agent; pair with Change Password After Release so checked-in credentials are always re-keyed. It's the milk-delivery rule: the milkman only notices your gate lock changed on his next scheduled visit.

Sneha at ICICI faces this

A freshly onboarded domain account fails its forced first rotation with "Password change failed. Error: Value cannot be null. (Parameter 'identityValue')" — every other account on the platform rotates fine.

Likely cause

Password Safe cannot resolve the account's SID in the directory: the account had been renamed by the AD team a week before onboarding, and the import used a stale CSV with the old samAccountName and the wrong OU path.

Diagnosis

Single-account blast radius → account layer, not the FA. The per-account error in the failed-changes queue names identityValue, which always means SID resolution, not rights or network.

BeyondInsight → Managed Accounts → filter the account → Go to Advanced Details (then run a Directory Query for the account name to confirm what AD actually has)
Fix

Delete the stale managed account, re-onboard it from the correct directory path via a Directory Query-based Smart Rule, and confirm the managed system's domain mapping points at the right domain.

Verify

Manual Rotate Now succeeds; the Password Change Agent failed-changes queue entry clears and stays clear past the next scheduled ChangeTime (23:30 UTC by default).

Password Safe REST API — prove the appliance + your API path are alive in two calls
# 1) Sign in (PS-Auth header: key + runas; pwd sits in square brackets)
curl -s -c /tmp/ps.jar -X POST \
  -H "Authorization: PS-Auth key=<128-char-api-key>; runas=icici\\svc-psapi; pwd=[S3cure!Pass];" \
  https://ps.icici-infra.in/BeyondTrust/api/public/v3/Auth/SignAppIn

# 2) Ask the platform its version — the cheapest end-to-end health check
curl -s -b /tmp/ps.jar https://ps.icici-infra.in/BeyondTrust/api/public/v3/Configuration/Version
Expected output
HTTP 200 — signed in, session cookie stored
{"Version":"25.2.0.184"}
# Got 401 + a WWW-Authenticate-2FA header instead? That is the DESIGNED 2FA
# challenge, not an outage — resend SignAppIn with challenge= in the header.

Discovery returns nothing & portal release issues

Two more Password Safe classics live on this ladder. First, the discovery scan that lies: status says "completed successfully" but zero accounts or services appear. A detailed scan must authenticate as a true local admin — it connects to the IPC$ share, installs a small scan service, and makes remote WMI and registry calls. UAC remote restrictions, Windows Firewall, or NTLM hardening can silently break enumeration while the scan still reports success — like a census enumerator whose ID letter the watchman rejects at every gate, yet whose register still says "area visited". Check the scan credential's rights under Configuration → Discovery Management → Credentials, verify an IPC$ connect manually, and remember the Software collection checkbox slows or kills big scans (24.3+ leaves it off by default). Full discovery design is in lesson 5.

Second, portal release issues: a user swears "the vault is broken" when checkout is actually working as designed. Walk their claim down the ladder before touching config: is the request inside the access policy's schedule window? Did approval actually happen (or is it sitting unapproved with two named approvers on leave)? Retrieve Password shows the secret for a maximum of 20 seconds — blink-and-gone is a feature, not an outage. And the API mirror of the same complaint: GET ManagedAccounts returns an empty list unless the account has Enable for API Access set and the runas user holds a Requestor/ISA role on it. Request-flow design lives in lesson 7.

Quick check · Q1 of 10

Rahul's detailed discovery scan at HCL reports "completed successfully", but the asset shows no local accounts or services even hours later. Most likely cause?

Correct: b. Scan "success" only means the job ran — enumeration needs the credential to mount IPC$, install the scan service and make WMI/registry calls, all of which UAC remote restrictions, firewalls or NTLM hardening can silently block. Speed isn't a failure signal, licensing errors surface loudly, and Smart Rules onboard rather than delete discovery data.

▶ Follow one scheduled rotation job down the ladder

Watch the 23:30 UTC rotation wave for 212 Linux accounts — and see exactly where a locked functional account derails all of them. Press Play for the healthy path, then Break it to see the failure.

① ScheduleChangeTime 23:30 UTC fires → 212 Linux accounts queued for rotation
② Bindfunctional account svc-psafe-lnx logs in to 172.16.40.21 (SSH)
③ Changenew password set for dbadmin → verified by re-login
④ Logsuccess written; failures land in the Password Change Agent retry queue
Press Play to step through the healthy path. Then press Break it.

Pause & Predict

Predict: 200 Linux rotations failed at 23:30 UTC last night, but Test Functional Account passes NOW. Which log do you open first, and what are you looking for? Type your guess.

Answer: The Password Change Agent's failed-changes queue/log — it holds the per-account error string from the moment of failure. FA tests passing NOW doesn't clear the FA: it may have been locked at 23:30 and auto-unlocked since (AD lockout windows expire). The timestamps + error text tell you whether it was a lockout window, a network drop, or something per-account.

② Session problems — user-side, appliance-side or target-side?

Session complaints sound dramatic — "the vault won't let me in!" — but a proxied session has exactly three places to break: the user side (their machine to the proxy), the appliance side (the proxy itself), and the target side (the proxy to the endpoint). The single most common user-side failure is a port assumption: Password Safe's session proxy listens on 4422 for SSH and 4489 for RDP — not 22, not 3389. A new office firewall that opened 443 "for the vault" lets the portal and password checkout work perfectly while every session launch dies. Session architecture itself is lesson 8; this section is the fault-finder.

Figure 3 — Three suspects, one cheap test each
Three columns compare the suspects for a proxied session failure. User-side: only one user or office fails; convict it with Test-NetConnection to ports 4489 and 4422 and a login from another network. Appliance-side: everyone fails and sessions never appear in the grid; convict it by checking session monitoring on port 4488, services and disk on the appliance or resource broker. Target-side: the proxy connects but the session dies; convict it by checking RDP enabled, NLA settings and the firewall leg from the appliance to target port 3389. A bottom strip says change one variable at a time. Three suspects, one cheap test each USER-side smells like: one user / one office fails, others work fine convict it: Test-NetConnection :4489 / :4422 same login from another network usual culprits: office firewall opened 443 only, proxy/SSL inspection, DNS, stale RDP Direct Connect file APPLIANCE-side smells like: EVERYONE fails, sessions never appear in the Sessions grid convict it: session-monitoring listener (4488), services + DISK on broker/appliance usual culprits: session-cache disk full (64 GB), stopped session service, broker undersized for the 9 AM rush TARGET-side smells like: proxy accepts you, then the session dies on ONE system convict it: appliance→target leg: 3389/22 open? RDP enabled + NLA policy on target? usual culprits: GPO turned NLA strict, target firewall, account not allowed to log on via Remote Desktop Golden rule: change ONE variable per test — same user, new network; same network, new user. Two changes at once = you learn nothing even when it starts working.
Read your symptom's column top to bottom: how it smells, the test that convicts it, the usual culprits. The bottom strip is the discipline that separates an L2 from a panicking L1: one variable per test.

The half-split method convicts a side fast. Same user, different network — if it works from HQ but not the new office, the office path (firewall, proxy, SSL inspection) is guilty. Different user, same network — if everyone in one office fails, stop blaming the user's laptop. Everyone everywhere failing, and sessions never even appearing in the grid? That's appliance-side: check the session-monitoring path (the listener on 4488, bound to 127.0.0.1 on the appliance), the services, and — the silent killer — disk space, because session recordings cache locally and the documented 64 GB session-cache disk on a Resource Broker fills up first.

From the user's machine (PowerShell) — convict or acquit the network in 60 seconds
Test-NetConnection ps.icici-infra.in -Port 443    # portal path
Test-NetConnection ps.icici-infra.in -Port 4489   # RDP session proxy
Test-NetConnection ps.icici-infra.in -Port 4422   # SSH session proxy
Expected output
ComputerName     : ps.icici-infra.in
RemotePort       : 4489
TcpTestSucceeded : False
# 443 True but 4489/4422 False = the firewall opened the console
# and forgot the session ports — a network ticket, not a vault ticket.
👉 So far: portal-works-sessions-don't = proxy ports; everyone-fails = appliance side. Next: the target side and the missing-recording case.

Target-side failures are sneakier because the proxy happily accepts you first. The appliance-to-target leg needs its own firewall opening (3389 to a Windows box, 22 to Linux), the target must actually have Remote Desktop enabled, and Windows NLA policy must accept the managed account's logon — a hardening GPO that flips NLA strict or removes the account's "log on through Remote Desktop Services" right breaks exactly one server while everything else hums. A quirky cousin: Direct Connect strings. The SSH form is ssh -p 4422 requester+user@domain+systemname@pshost — students who swap the + delimiters or use the wrong account format get auth failures that look like proxy outages. RDP Direct Connect files are per-account-per-system downloads; a colleague's copied file fails for you by design.

Priya at Infosys faces this

Everyone in the new Pune office can log into the Password Safe portal and view checked-out passwords, but every RDP session launch fails with a generic connection error. The same users connect fine from the Bengaluru HQ.

Likely cause

The new office's egress firewall was provisioned with "HTTPS to the vault" only — TCP 443. The RDP proxy port 4489 (and SSH 4422) were never opened, so the access console can't reach the session proxy from that VLAN.

Diagnosis

Half-split: same user works from HQ → not account/policy. Whole office fails → shared path. Test-NetConnection shows 443 True, 4489 False from Pune; Password Safe → Sessions shows no session ever registering from that office.

Password Safe → Sessions (no entries from the office) + client-side Test-NetConnection to 4489/4422
Fix

Firewall change request: allow TCP 4422 and 4489 from the Pune office egress to the appliance, alongside the existing 443.

Verify

RDP launch from Pune succeeds; the session appears in Password Safe → Sessions with its recording attached, and Test-NetConnection 4489 now returns TcpTestSucceeded True.

Quick check · Q2 of 10

Karthik at Flipkart can check out passwords in the portal, but his SSH session attempt with "ssh -p 22 karthik+root@flipkart.in+web-01@ps.flipkart-infra.in" is refused instantly. What did he get wrong?

Correct: c. Password Safe's SSH proxy listens on 4422 by default — port 22 on the appliance is not the session proxy, so the connection is refused immediately. An unapproved request gives a policy denial (not a refused TCP connect), portal checkout working makes a vault-wide outage unlikely, and SSH Direct Connect is fully supported — that string format is exactly how it works, on the right port.

Recording missing & laggy sessions

An auditor asking "where is last Tuesday's recording?" is a different failure class — the session happened, the evidence didn't. Three checks, in order. One: did the access policy for that user-group even enable Record session (and Keystroke Logging for terminals)? Recording is per-policy, not global — a new vendor policy cloned without the checkbox produces perfectly legal, perfectly unrecorded sessions. Two: was the session-monitoring path healthy at the time — the 4488 listener and its service? Three: disk — a full session cache drops recordings first while sessions keep working, the most embarrassing silent failure in the product. Laggy-but-working sessions usually trace to an undersized broker/appliance at peak (the 9 AM shift-start pile-on), RDP quality settings, or a saturated WAN path — the same suspects as any VDI complaint, tested the same way.

Pause & Predict

Predict: last night's vendor RDP session is missing its recording, but the session itself appears in the grid with start/end times. Name TWO checks before you blame the product. Type your guess.

Answer: One: the access policy that admitted the session — was Record Session actually enabled for that user group/policy? (Per-policy, not global.) Two: session-monitoring health at that timestamp — the 4488 listener's service state and free disk on the appliance/broker session cache. A grid entry without a recording almost always means policy-didn't-ask or disk-couldn't-store.
Prove the isolation, don't feel it

Before you file ANY session ticket, be able to state: which side failed (user/appliance/target), the one test that proved it (Test-NetConnection result, Sessions-grid evidence, or target-leg check), and what changed. "RDP is broken" is a complaint; "4489 unreachable from Pune VLAN only, 443 fine, started after Saturday's firewall change" is a 10-minute fix.

③ The PRA ladder — offline Jump Clients, Jumpoints, Web Jump & SAML

PRA troubleshooting starts from one architectural fact: every component — Jump Client, Jumpoint (Gateway in new docs), and console — dials OUT to the appliance on TCP 443. Nothing listens inbound on the endpoint. So an offline Jump Client is never "open a port on the laptop"; it is a service story, an egress story, or a certificate story. Think of every device calling one call-centre number: if a phone goes silent, either the phone is off (service), the line out of the building is cut (egress/proxy), or the caller no longer trusts the voice answering (certificate). Jump architecture is lesson 11; this is the repair bay.

▶ How a Jump Client earns (and loses) its Online badge

Step through the dial-home that makes a client Online — then break the egress path the way a new corporate proxy does. Press Play for the healthy path, then Break it to see the failure.

① Bootendpoint service starts → reads appliance FQDN pra.wipro-vendors.in
② Dial outpersistent outbound TCP 443 — no inbound port on the endpoint
③ TrustTLS handshake validates the appliance certificate chain
④ Onlineconsole shows Online · state-aware (knows if a user is present)
Press Play to step through the healthy path. Then press Break it.

Walk the four gates in order. Gate 1 — service: EDR products love killing the Jump Client, especially right after an appliance upgrade when the client auto-upgrades and its service/executable names change; a real Beekeepers case saw 40 of 100 clients stranded "Active [Offline]" when SentinelOne blocked the mid-upgrade uninstall/reinstall. A reimaged machine shows the blunt string "The specified Jump Client has been uninstalled." Gate 2 — egress: new proxy, new SSL inspection, new DNS. Gate 3 — certificate: after an appliance SSL cert replacement, clients take 24–48 hours to settle on the new cert — and combining a cert change with a hostname change in one window is the classic mass-offline self-inflicted wound. Gate 4 — maintenance thresholds: clients that haven't connected get marked lost at one threshold and auto-deleted at another; set lost-days smaller than delete-days or machines returning from long leave simply vanish. And "running a different version. Please try again after the upgrade completes" during an upgrade wave is patience, not breakage — upgrades are bandwidth-throttled by design.

Figure 4 — Offline Jump Client — four gates
A vertical decision flow for an offline Jump Client. First gate: is the endpoint service running. If not, suspect EDR kills after upgrades or a reimaged machine showing the uninstalled error string. Second gate: can the endpoint reach the appliance outbound on TCP 443. If not, suspect a new egress proxy or SSL inspection. Third gate: does the TLS handshake trust the appliance certificate; after a certificate swap clients can take 24 to 48 hours to settle. Final gate: maintenance thresholds — clients are marked lost first and auto-deleted later. Side labels quote the real console error strings. Offline Jump Client — four gates, in order GATE 1 · is the endpoint service running? Event Viewer service-stop events · machine reimaged? EDR killed it after the upgrade (service got renamed) "The specified Jump Client has been uninstalled" → re-whitelist in EDR · redeploy from offline installer GATE 2 · outbound TCP 443 to the appliance? the client dials OUT — no inbound port to check new egress proxy / SSL inspection broke the tunnel browsing works, the persistent tunnel does not → bypass inspection for the appliance FQDN GATE 3 · does TLS trust the appliance cert? after a cert swap, clients settle over 24–48 h — wait cert + hostname changed in ONE window = mass offline never combine the two changes; stagger swaps; no panic redeploys inside the 48-hour window GATE 4 · maintenance thresholds eating clients? marked LOST at threshold 1 → auto-DELETED at threshold 2 set lost-days < delete-days, or clients silently vanish "…running a different version. Please try again after the upgrade completes." = mid-upgrade, wait Reinstalling is the LAST gate, not the first — a blind redeploy creates duplicate entries and orphans the old client's session-recording and vault links.
Work top to bottom; the red boxes are the field-verified culprits at each gate, with the real console error strings. Note the bottom strip — reinstalling first creates duplicate entries and orphans recordings.
🖥️ The thresholds that silently delete your fleet — PRA /login → Jump → Jump Clients → Jump Client Settings (Pathfinder consoles: Asset Management). (Recreated for clarity — your console matches this.)
pra.wipro-vendors.in/login · Jump → Jump Clients
1
Days before unconnected Jump Clients are marked lost
7
2
Number of days before Jump Clients that have not connected are automatically deleted
30
3
Uninstalled Jump Client Behavior
Keep in list (marked Uninstalled)
Maximum bandwidth of concurrent Jump Client upgrades
50 MiB/s
Save

Rahul at Wipro faces this

The morning after the PRA appliance upgrade from 24.1.4 to 24.2.3, 40 of 100 vendor-facing Jump Clients show "Active [Offline]". On affected endpoints the old client is gone but the new one never installed.

Likely cause

The EDR agent (SentinelOne) blocked the Jump Client auto-upgrade mid-flight — it allowed the stop-service and uninstall steps, then quarantined the new installer binary, leaving endpoints with no client at all.

Diagnosis

Blast radius: 40 endpoints across sites, all mid-upgrade → shared change (the upgrade) + shared agent (EDR), not 40 separate network faults. Endpoint Event Viewer shows service-stop then blocked-process events at the upgrade timestamp.

/login → Jump → Jump Clients (sort by Last Connected to find the stranded wave) + endpoint EDR console quarantine log
Fix

Whitelist the new Jump Client installer hash in the EDR policy, pull the offline MSI installers from the appliance, redeploy to the 40 endpoints, and clean up duplicate entries by sorting on last-seen before they auto-delete.

Verify

All 40 return Online; duplicates removed; a canary endpoint upgraded with EDR exclusions in place survives the next appliance upgrade — and a pre-upgrade EDR-exclusion step is added to the change template.

Jumpoint failures, Web Jump certificates & SAML logins

A Jumpoint that shows online but fails every push to Windows targets is almost always a rights problem: the service installs as Local System by default, which has zero authority on other machines. Run it as a domain account with local admin on targets, and make sure targets expose ADMIN$/IPC$, run the Remote Registry service, and accept 135/445 from the Jumpoint host. Two more traps: a Jumpoint cannot reach itself (loopback is unsupported), and 20+ users hammering Jump at shift start can overload an undersized host — that's sizing, not breakage. Web Jump has the politest failure in the product: with Verify Certificate enabled, a self-signed or wrong-SAN certificate on the internal web app means the session simply never starts — no error. The right fix is fixing the site's certificate; unticking Verify Certificate is documented only for trusted internal sites, and injection that loads-but-won't-fill needs the Username/Password/Submit Field Hint CSS selectors set.

From the Jumpoint host (PowerShell) — the target-side prerequisites in two checks
Test-NetConnection 172.16.40.13 -Port 445          # ADMIN$ / IPC$ reachable?
Get-Service -ComputerName 172.16.40.13 RemoteRegistry
# and the Jumpoint service itself must NOT run as Local System if it
# needs rights on targets — services.msc → run as a domain account
Expected output
TcpTestSucceeded : True
Status   Name            DisplayName
------   ----            -----------
Stopped  RemoteRegistry  Remote Registry
# 445 open but Remote Registry Stopped = agentless push fails right here.

SAML login failures lock admins and vendors out of the console even though "nothing changed" — except something did, on the IdP side. The repeat offenders, in order: the IdP rotated its signing certificate and PRA still trusts the old one; clock skew between appliance and IdP invalidating assertion timestamps (the Aadhaar-OTP-that-expired-before-you-typed-it failure — check NTP first); an entity-ID/ACS URL mismatch after a hostname change; or the user authenticating fine but landing in no mapped group, so PRA has no policy to give them. All of it lives in /login → Users & Security → Security Providers. For the wider audit trail when you're reconstructing any PRA incident — session logs, syslog, the Reports menu — lean on lesson 14.

👉 So far: Jump Client = service → egress → cert → thresholds; Jumpoint = rights + shares; Web Jump = certificate; SAML = IdP cert/clock/mapping. Next: the EPM/PMUL quick hits and when to call BeyondTrust.
Quick check · Q3 of 10

Meera at Airtel configures a Web Jump to an internal firewall's admin GUI. The Jumpoint is online, RDP Jumps through it work, but the Web Jump session never starts — and shows no error at all. First thing to check?

Correct: a. Web Jump with Verify Certificate enabled is documented to simply not start the session when the target site's certificate fails checks — the classic silent failure for self-signed internal GUIs. Fix the cert (or, for a trusted internal site only, untick the flag). Jump Clients/Jumpoints dial outbound so endpoint inbound rules are irrelevant, reboots aren't part of this failure class, and SAML mapping failures block console login, not one Jump Item type.

Pause & Predict

Predict: you replaced the PRA appliance's SSL certificate on Saturday; by Monday 300 of 2,000 Jump Clients are still offline. Mass redeploy? Type your guess.

Answer: No — docs say Jump Clients need 24–48 hours to settle after a certificate change, and a Monday-morning panic redeploy creates duplicate entries and orphans recording/vault links. Verify a sample endpoint can reach the appliance on 443 and trusts the new chain, confirm you did NOT also change the hostname in the same window (that combination is the real killer), and let the stragglers reconnect. Redeploy only the clients still dark after the window, from the offline installer.
Common mistake — blaming PRA for an IdP change

Symptom: Monday 9 AM, every SSO user gets a SAML error; local-account admins log in fine. Cause: the IdP team rotated the SAML signing certificate over the weekend — PRA still holds the old one. Fix: import the new IdP signing certificate (or refresh federation metadata) under /login → Users & Security → Security Providers, and check NTP/clock skew while you're there. Keep one local break-glass admin outside SSO precisely for this morning.

④ EPM/PMUL quick hits, the support case & the master cheat-sheet

Two endpoint-side classics deserve a fast lane. First, EPM policy not applying on Windows. Before touching the policy, check the two services that must be Running — the agent (still literally named Avecto Defendpoint Service) and the cloud check-in adapter (IC3Adapter) — plus the config registry at HKLM\SOFTWARE\Avecto\Privilege Guard Client. Fresh installs have a famous quirk: until the endpoint reboots, right-click "Run as administrator" still triggers native Windows UAC instead of EPM's on-demand rules. And if services, registry and reboot all check out, suspect order: Workstyles and Application Rules are first-match-wins, so a catch-all rule dragged above your specific rule silently eats everything below it. Policy design depth is lesson 15.

EPM for Windows (PowerShell) — the two-service check before any policy archaeology
Get-Service | Where-Object { $_.DisplayName -match 'Defendpoint|Cloud Adapter' } |
  Format-Table Status, DisplayName -AutoSize

# config sanity — yes, the registry still says Avecto:
Get-Item 'HKLM:\SOFTWARE\Avecto\Privilege Guard Client'
Expected output
Status  DisplayName
------  -----------
Running Avecto Defendpoint Service
Running BeyondTrust Privilege Management Cloud Adapter
# Both Running + registry present, policy still ignored?
# → reboot after fresh install, then audit Workstyle ORDER (first match wins).

Aditya at HCL faces this

Sixty developer laptops at HCL got the EPM agent pushed overnight. Next morning, right-click "Run as administrator" on every NEW laptop pops the native Windows UAC credential box instead of EPM's elevation prompt — while last month's laptops behave perfectly.

Likely cause

The deployment script installed the agent but skipped the post-install restart. Until the endpoint reboots, EPM's on-demand shell-integration hook is not registered, so Windows handles the right-click elevation natively — a known fresh-install quirk. Services, registry and policy were healthy the whole time.

Diagnosis

Blast radius: only the new batch → the shared thing is the deployment, not the policy. On a sample laptop both services (Avecto Defendpoint Service + IC3Adapter) are Running, HKLM\SOFTWARE\Avecto\Privilege Guard Client exists, and the assigned policy revision matches the console — agent and policy check out, which leaves the install step itself.

EPM (PM Cloud) console → Policies → Assign Policy to Groups (confirm the revision is assigned to the laptops' Computer Group) + endpoint Get-Service / registry check
Fix

Reboot the sixty laptops and add a mandatory restart step to the deployment script — no policy edits, no reinstalls.

Verify

After restart, right-click Run as administrator shows EPM's prompt and the elevation is logged against the on-demand Application Rule; native UAC no longer appears on any of the sixty.

Second, pbrun rejected on Linux. PMUL's decision is made centrally: pbrun sends the request to pbmasterd (default TCP 24345), which stamps ACCEPT or REJECT against the policy. The diagnostic gold is the shape of the failure: an instant reject means the policy said no — argue with pb.conf, not the firewall. A long hang means pbmasterd is unreachable — and here's the trap from lesson 16: opening 24345–24347 alone isn't always enough, because optimized connections also use the dynamic listening range (default 1024–65535); behind strict firewalls you must constrain that range in pb.settings and open it. Lint every policy edit with pbcheck -s before pushing — pbmasterd only reports syntax errors at runtime, when it's already rejecting half of TCS's overnight cron jobs. Full PMUL + AD Bridge coverage: lesson 16 (AD Bridge's own first-aid: /opt/pbis/bin/get-status, and remember the first domain join needs a reboot).

PMUL on a TCS lab run host — accepted vs rejected vs hung, and the pre-flight lint
$ pbrun id                        # policy allows id → runs as root
$ pbrun /sbin/shutdown -r now     # policy denies shutdown
$ pbcheck -s -f /etc/pb.conf      # ALWAYS lint pb.conf BEFORE deploying an edit
$ pbrun --testmaster=pbmaster01 /sbin/shutdown -r now   # dry-run accept/reject, runs nothing
Expected output
uid=0(root) gid=0(root) groups=0(root)
pbrun: Request rejected by pbmaster01.tcs-lab.in
# (exact reject wording varies by version/policy)
# instant reject = policy said no · long HANG = pbmasterd unreachable
#   → check TCP 24345 + the dynamic listening-port range in pb.settings
Quick check · Q4 of 10

On Aditya's TCS estate, pbrun works for short commands but intermittently hangs for some sessions — ports 24345, 24346 and 24347 are open between hosts. What's the most likely missing piece?

Correct: d. The three static daemon ports aren't the whole story: PMUL's optimized connections listen on the configurable dynamic range, so strict firewalls cause exactly this intermittent-hang pattern. A licensing or syntax problem would fail consistently (and pbcheck catches syntax), and PMUL doesn't require reboots for normal operation — that's the EPM-Windows on-demand quirk.

When to open a BeyondTrust support case — and what to attach

Escalate when the ladder is exhausted, not when you're tired: you've identified the layer, reproduced the failure, captured the exact error string, and the fix needs vendor internals (appliance-side faults, suspected product bugs, anything mid-upgrade) — or anything security-relevant like a CVE question. Then file like you'd file an FIR: the desk can only move as fast as your evidence. Attach: (1) a timeline — when it broke, what changed (upgrades, cert swaps, EDR pushes, firewall changes); (2) the exact error strings, copied verbatim, with screenshots; (3) the right logs — Password Change Agent log for rotation, scanner log for discovery, endpoint Event Viewer + EDR log for Jump Clients; (4) versions — appliance/cloud release, client versions, recent upgrade history; (5) blast radius — one account or a platform, one endpoint or forty. For on-prem appliances, note that version/patch state lives in the /appliance interface (not /login), and support's remote tunnel goes outbound from your appliance over 443 to gwsupport.bomgar.com — yes, the Bomgar heritage name — so there's no inbound hole to request.

The evidence pack, as a checklist

Timeline + what-changed · exact error string (verbatim) · the layer's log · product + version (appliance, client, agent) · blast radius (who/how many) · what you already ruled out and how. Six lines that routinely turn a 5-day ticket into a 1-day ticket — and the same six lines your own L3 wants from you.

And here it is — the page this whole series funnels into. Every symptom below maps to its layer, its first move, and its deep-dive lesson: rotation design in lesson 6, sessions in lesson 8, Jump tech in lesson 11, PRA policies in lesson 13, EPM in lesson 15, PMUL in lesson 16, and HA/failover (when the fix is "the appliance itself died") in lesson 18.

Figure 5 — The master triage card
A twelve-tile cheat sheet in a four-by-three grid. Each tile names a symptom, the layer it points to in amber, and the first move in green: platform-wide rotation failure to the functional account; identityValue null to the directory SID; empty discovery to scan-credential rights; session will not start to proxy ports 4422 and 4489; missing recording to access policy and port 4488; offline Jump Client to outbound 443, EDR and certificates; Jumpoint push failure to its service account and target shares; Web Jump dead to the certificate check; SAML loop to IdP certificate and clock; EPM policy not applying to its two services and rule order; pbrun rejected or hanging to pbmasterd port 24345 and dynamic ports; and when to escalate with the evidence pack. Symptom → layer → first move (print this) Rotation fails — wholeplatform at once layer: functional account → Test FA · AD lockout ·rights · recreate on cred-config-not-found Rotate error:identityValue null layer: directory / SID → directory query theaccount · re-onboard fromthe correct domain path Discovery "succeeds"but finds nothing layer: scan credential → true local admin? IPC$ +WMI + Remote Registry ·UAC / firewall / NTLM Proxied sessionwill not start layer: proxy ports → 4422 SSH · 4489 RDPreachable? then target leg(3389/22 · NLA · RDP on) Recording missingfor a session layer: policy + monitor → access policy RecordSession on? listener 4488 ·session-cache disk space Jump Client OFFLINE layer: endpoint + egress → service running (EDR?) ·outbound 443 · cert 24–48 h ·lost < delete thresholds Jumpoint push toWindows targets fails layer: service acct + shares → not Local System: domainacct w/ local admin · ADMIN$ /IPC$ · Remote Registry · 135/445 Web Jump never starts,no error shown layer: target site cert → Verify Certificate + badcert = silent refusal · FIX thecert (uncheck = last resort) SAML login fails /loops back to IdP layer: IdP trust → IdP signing cert rotated? ·clock skew · group/attributemapping in Security Providers EPM policynot applying layer: agent services + order → Defendpoint + IC3AdapterRunning? reboot after install ·Workstyle first-match order pbrun rejectedor hangs layer: policy server path → instant reject = policy ·hang = 24345 / dynamic ports ·pbcheck -s before any push Still stuck after theladder? ESCALATE layer: BeyondTrust support → evidence pack: timeline ·exact error string · logs ·version · blast radius untrusted/attackertrusted/vaultedpolicy/decisionkey insightallowed/audited
Twelve symptoms → layer → first move. Screenshot this one — it's the interview answer to "a rotation failed, walk me through it", and the desk card for your first PAM on-call rotation.

Pause & Predict

Predict: which three of the twelve cheat-sheet tiles share "a certificate" as the root layer — and what's the common discipline across all three? Type your guess.

Answer: Jump Client offline (appliance cert swap — 24–48 h settle window), Web Jump silent no-start (target site cert vs Verify Certificate), and SAML login failure (IdP signing cert rotation). Common discipline: certificates fail on OTHER people's schedules — track expiry and rotation dates for the appliance cert, internal site certs and the IdP signing cert in the same calendar, and never combine a cert change with a hostname change.
Prove you own the playbook

Take any symptom from this page cold — say, "vendor Jump Clients went offline after the weekend" — and recite: the layer order you'd walk (service → egress → cert → thresholds), the one test per gate, the log you'd quote, and the evidence pack you'd attach if it escalates. If you can do that without scrolling up, you are interview-ready for the troubleshooting round — which is exactly where lesson 20 takes you next.

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. No login, no waiting.

Pre-curated from BeyondTrust docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.

📝 Wrap-up assessment — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.

Q5 · Remember

Which port does the Password Safe RDP session proxy listen on by default?

Correct: c. The RDP session proxy listens on 4489 (SSH proxy on 4422, session monitoring on 4488). 3389 is the appliance-to-target RDP leg, 443 is the web portal/API, and 8443 is not a Password Safe default — answering 3389 means you missed the proxy concept entirely.
Q6 · Apply

At 23:30 UTC every one of the 212 managed accounts on the Linux platform at Infosys fails rotation simultaneously. Applying the triage ladder, your FIRST check is:

Correct: a. Platform-wide blast radius points at the shared layer: the one functional account that performs every change on that platform. Testing 212 targets first inverts the ladder, raising retries treats the symptom while the cause keeps failing, and discovery has nothing to do with rotation execution.
Q7 · Apply

A vendor's Jump Client shows offline. The vendor insists "the internet works fine on this machine." What do you have them test first?

Correct: d. Jump Clients keep a persistent OUTBOUND 443 tunnel to the appliance — general browsing working proves nothing about that specific path through a new egress proxy or SSL inspection. Ping tests neither TCP 443 nor the proxy path, the appliance never dials inbound to endpoints, and a blind reinstall is the last gate (it creates duplicates and orphans recordings).
Q8 · Analyze

RDP sessions through Password Safe start normally, and the Sessions grid shows last night's vendor session with correct start/end times — but there is no recording. Steering through the proxy is clearly working. Most likely root cause?

Correct: b. A session that registered in the grid proves the proxy path and ports worked — so the missing artifact is either policy (Record Session is per-access-policy, not global) or the recording pipeline (4488 listener health, disk space for the session cache). A blocked 4489 would have killed the session itself, the functional account is rotation machinery not session recording, and an expired appliance cert breaks connections, not just their recordings.
Q9 · Analyze

During one Saturday window, the PRA team replaced the appliance SSL certificate AND moved the appliance to a new hostname. By Tuesday, hundreds of Jump Clients remain offline. The best analysis is:

Correct: d. A certificate swap alone is survivable — documentation says allow 24–48 hours for clients to settle. Changing the hostname in the same window removes the very address the clients dial home to, which is the textbook mass-offline self-wound. Mass coincidental EDR action across the fleet is implausible without an EDR change, licensing is unrelated to certificates, and routine reinstalls after cert changes are explicitly NOT required.
Q10 · Evaluate

Platform-wide rotation failure at 2 AM. Runbook A: restart the appliance, bulk re-onboard the failing accounts, and reboot targets until green. Runbook B: test the functional account, read the Password Change Agent queue, manually rotate ONE account as an experiment, then apply one targeted fix. Which is stronger, and why?

Correct: b. B is diagnosis; A is superstition with downtime. Restarts wipe in-memory state and logs you need, an appliance reboot on an HA pair can trigger failover mid-incident, and bulk re-onboarding rewrites account metadata (and can sever Smart Rule/setting links) without addressing the cause. B's one-test-per-layer approach also leaves a written trail — which is what your RCA, your auditor and your interviewer all ask for.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".

🧠 In your own words

Type one line: In one line: why do you test the functional account before anything else when a whole platform stops rotating? Then compare to the expert version.

Expert version: Because every rotation on that platform runs through that one shared worker credential — if its rights, password or lockout state breaks, all accounts fail together, so a single FA test confirms or eliminates the most likely cause in one move instead of debugging hundreds of accounts individually.

🗣 Teach a friend

Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.

📖 Glossary

Functional account
The worker credential Password Safe uses on a managed system to rotate other accounts — platform-wide rotation failures usually start here.
Password Change Agent
Appliance/broker service executing queued password changes with configurable retries (Configuration → Privileged Access Management Agents).
Password Test Agent
Scheduled service that verifies stored passwords still work — the only trigger for Reset Password on Mismatch.
identityValue error
"Value cannot be null (Parameter 'identityValue')" — Password Safe cannot resolve the managed account's SID in the directory.
Session Monitoring listener
Port 4488 (on 127.0.0.1) — the recording path for proxied sessions; pairs with proxy ports 4422 (SSH) and 4489 (RDP).
Lost vs deleted (Jump Client)
Two maintenance thresholds: unconnected clients are marked lost first, auto-deleted later — keep lost-days < delete-days.
Uninstalled Jump Client Behavior
Setting that keeps locally-uninstalled clients visible as tombstones instead of silently disappearing from the list.
Jumpoint (Gateway)
One broker per known network; agentless push needs ADMIN$/IPC$, Remote Registry and 135/445 — and a service account with more rights than Local System.
Verify Certificate (Web Jump)
Flag that silently refuses to start the session when the target web app's certificate fails validation — fix the cert, don't reflex-untick.
IC3Adapter
BeyondTrust Privilege Management Cloud Adapter — the EPM agent's policy check-in service, partner to the Avecto Defendpoint Service.
pbmasterd
PMUL policy-server daemon (default TCP 24345) that stamps every pbrun request ACCEPT or REJECT against pb.conf.
--testmaster
pbrun dry-run flag — tests accept/reject against a policy server without executing the command; pair with pbcheck -s before any policy push.

📚 Sources

  1. BeyondTrust Docs — Configure Password Safe Agents (Password Change Agent: Retry failed changes after (minutes), Maximum retries, Unlimited Retries; Password Test Agent schedule). docs.beyondtrust.com/bips/docs/configure-password-safe-agents
  2. BeyondTrust Docs — Password Safe SSH/RDP Connections (session proxy defaults: SSH 4422, RDP 4489, Session Monitoring 4488 on 127.0.0.1; Direct Connect string formats). docs.beyondtrust.com/bips/docs/ps-ssh-rdp-connections
  3. BeyondTrust Docs — PRA Jump Client Error Messages + Jump Clients guide (uninstalled/lost/version-mismatch strings, lost vs delete thresholds, Uninstalled Jump Client Behavior, upgrade bandwidth throttling). docs.beyondtrust.com/pra/v24.3/docs/jump-client-errors · docs.beyondtrust.com/pra/docs/jump-clients
  4. BeyondTrust Docs — Jumpoint guide + Jump Shortcuts (Local System default and domain-account guidance, ADMIN$/IPC$ + Remote Registry + 135/445 prerequisites, loopback unsupported; Web Jump Verify Certificate + field-hint selectors). docs.beyondtrust.com/pra/docs/jumpoint · docs.beyondtrust.com/pra/v24.3/docs/jump-shortcuts
  5. BeyondTrust Beekeepers community — rotation war stories: "Value cannot be null (Parameter 'identityValue')" (SID not resolvable), functional-account test failures, F5 BIG-IP "Thread was interrupted from a waiting state". beekeepers.beyondtrust.com/general-40
  6. BeyondTrust Beekeepers community — Jump Clients offline after appliance upgrade 24.1.4→24.2.3 (EDR blocked the mid-upgrade reinstall; duplicate-entry cleanup) + PRA 24.3.4 intermittent offline clients. beekeepers.beyondtrust.com/general-51
  7. BeyondTrust Docs — EPM-UL pbrun, pbcheck & firewall/port usage (pbmasterd 24345 / pblocald 24346 / pblogd 24347, dynamic listening-port ranges, --testmaster) and EPM-Windows services (Avecto Defendpoint Service, IC3Adapter, Privilege Guard Client registry). docs.beyondtrust.com/epm-ul/docs/pbrun · docs.beyondtrust.com/epm-ul/docs/firewalls
  8. BeyondTrust Docs — SSL certificates & appliance management (allow 24–48 hours for Jump Clients to pick up a replaced certificate; /appliance interface for on-prem patching; outbound support tunnel to gwsupport.bomgar.com). docs.beyondtrust.com/pra/docs/on-prem-ssl-certificates
  9. BeyondTrust University — Password Safe / Privileged Remote Access Administration courses & certification (40-question exam, 75% pass, 2 attempts) — troubleshooting scenarios are a core exam and interview lane. beyondtrust.com/services/beyondtrust-university/get-certified

What's next?

You can now fix BeyondTrust faster than most working L2s — so let's convert that into offers. The final lesson turns the whole series into 30 real interview questions with strong answers, plus a salary-mapped PAM career plan for India.