BeyondTrust Troubleshooting Playbook

Q: Rahul's detailed discovery scan at HCL reports "completed successfully", but the asset shows no local accounts or services even hours later. Most likely cause?

Correct: b. Scan "success" only means the job ran — enumeration needs the credential to mount IPC$, install the scan service and make WMI/registry calls, all of which UAC remote restrictions, firewalls or NTLM hardening can silently block. Speed isn't a failure signal, licensing errors surface loudly, and Smart Rules onboard rather than delete discovery data.

Q: Karthik at Flipkart can check out passwords in the portal, but his SSH session attempt with "ssh -p 22 karthik+root@flipkart.in+web-01@ps.flipkart-infra.in" is refused instantly. What did he get wrong?

Correct: c. Password Safe's SSH proxy listens on 4422 by default — port 22 on the appliance is not the session proxy, so the connection is refused immediately. An unapproved request gives a policy denial (not a refused TCP connect), portal checkout working makes a vault-wide outage unlikely, and SSH Direct Connect is fully supported — that string format is exactly how it works, on the right port.

Q: Meera at Airtel configures a Web Jump to an internal firewall's admin GUI. The Jumpoint is online, RDP Jumps through it work, but the Web Jump session never starts — and shows no error at all. First thing to check?

Correct: a. Web Jump with Verify Certificate enabled is documented to simply not start the session when the target site's certificate fails checks — the classic silent failure for self-signed internal GUIs. Fix the cert (or, for a trusted internal site only, untick the flag). Jump Clients/Jumpoints dial outbound so endpoint inbound rules are irrelevant, reboots aren't part of this failure class, and SAML mapping failures block console login, not one Jump Item type.

Q: On Aditya's TCS estate, pbrun works for short commands but intermittently hangs for some sessions — ports 24345, 24346 and 24347 are open between hosts. What's the most likely missing piece?

Correct: d. The three static daemon ports aren't the whole story: PMUL's optimized connections listen on the configurable dynamic range, so strict firewalls cause exactly this intermittent-hang pattern. A licensing or syntax problem would fail consistently (and pbcheck catches syntax), and PMUL doesn't require reboots for normal operation — that's the EPM-Windows on-demand quirk.

Q: Which port does the Password Safe RDP session proxy listen on by default?

Correct: c. The RDP session proxy listens on 4489 (SSH proxy on 4422, session monitoring on 4488). 3389 is the appliance-to-target RDP leg, 443 is the web portal/API, and 8443 is not a Password Safe default — answering 3389 means you missed the proxy concept entirely.

Q: At 23:30 UTC every one of the 212 managed accounts on the Linux platform at Infosys fails rotation simultaneously. Applying the triage ladder, your FIRST check is:

Correct: a. Platform-wide blast radius points at the shared layer: the one functional account that performs every change on that platform. Testing 212 targets first inverts the ladder, raising retries treats the symptom while the cause keeps failing, and discovery has nothing to do with rotation execution.

Q: A vendor's Jump Client shows offline. The vendor insists "the internet works fine on this machine." What do you have them test first?

Correct: d. Jump Clients keep a persistent OUTBOUND 443 tunnel to the appliance — general browsing working proves nothing about that specific path through a new egress proxy or SSL inspection. Ping tests neither TCP 443 nor the proxy path, the appliance never dials inbound to endpoints, and a blind reinstall is the last gate (it creates duplicates and orphans recordings).

Q: RDP sessions through Password Safe start normally, and the Sessions grid shows last night's vendor session with correct start/end times — but there is no recording. Steering through the proxy is clearly working. Most likely root cause?

Correct: b. A session that registered in the grid proves the proxy path and ports worked — so the missing artifact is either policy (Record Session is per-access-policy, not global) or the recording pipeline (4488 listener health, disk space for the session cache). A blocked 4489 would have killed the session itself, the functional account is rotation machinery not session recording, and an expired appliance cert breaks connections, not just their recordings.

Q: During one Saturday window, the PRA team replaced the appliance SSL certificate AND moved the appliance to a new hostname. By Tuesday, hundreds of Jump Clients remain offline. The best analysis is:

Correct: d. A certificate swap alone is survivable — documentation says allow 24–48 hours for clients to settle. Changing the hostname in the same window removes the very address the clients dial home to, which is the textbook mass-offline self-wound. Mass coincidental EDR action across the fleet is implausible without an EDR change, licensing is unrelated to certificates, and routine reinstalls after cert changes are explicitly NOT required.

Q: Platform-wide rotation failure at 2 AM. Runbook A: restart the appliance, bulk re-onboard the failing accounts, and reboot targets until green. Runbook B: test the functional account, read the Password Change Agent queue, manually rotate ONE account as an experiment, then apply one targeted fix. Which is stronger, and why?

Correct: b. B is diagnosis; A is superstition with downtime. Restarts wipe in-memory state and logs you need, an appliance reboot on an HA pair can trigger failover mid-incident, and bulk re-onboarding rewrites account metadata (and can sever Smart Rule/setting links) without addressing the cause. B's one-test-per-layer approach also leaves a written trail — which is what your RCA, your auditor and your interviewer all ask for.

Content-specific feature visual for this lesson: use it as the 60-second map before reading the full detail.

Most engineers think…

Most engineers think troubleshooting BeyondTrust means restarting services, re-onboarding accounts and reinstalling agents until the error goes away — and blaming "the network" when it doesn't.

Wrong — and it makes outages longer. Every BeyondTrust failure lives in a small set of layers: the functional account, the directory, the proxy ports, the outbound 443 path, a certificate, or an agent service. The blast radius (one victim vs many) names the layer, the log confirms it, and one targeted change fixes it. Restarts and reinstalls are the LAST rung of the ladder — a blind Jump Client redeploy even creates duplicate entries and orphans recordings.

① The triage mindset + the Password Safe ladder

It's 23:42 IST. Sneha, the L2 PAM admin at ICICI's infra team, gets the call: "password rotations are failing." Her weakest move would be opening the first red row and clicking Rotate Now repeatedly. Her strongest move costs ten seconds: count the victims. One account failing is an account problem. Every account on one platform failing is a functional account problem, because that one shared worker performs every change on the platform. Everything failing everywhere is an appliance or agent problem. The blast radius names the layer — like a building electrician who checks the main MCB when the whole house goes dark, instead of testing every bulb one by one.

That is the whole triage mindset, and it carries across every product in this series: ask what changed (an upgrade? a cert swap? a GPO push?), ask which layer the blast radius points at, read that layer's log first, and change one variable at a time. BeyondTrust gives you a precise log for each layer — the skill is knowing which one to open.

Figure 1 — Blast radius → layer

Match the left card to the right layer before touching anything. The amber row is the highest-value test in all of Password Safe: one functional-account check explains (or eliminates) hundreds of red accounts at once.

The four triage questions

Tap each card — ask these in order on EVERY ticket, for every BeyondTrust product.

🕐

What changed?

tap to flip

Upgrades, cert swaps, GPO pushes, EDR updates cause most outages. No change found? Widen the window — someone changed something. So: check change tickets first.

💥

How wide is the blast?

tap to flip

One victim = its problem. Many victims = their shared layer (FA, appliance, proxy, IdP). Counting victims is free and names the layer.

📜

What does the log say?

tap to flip

Change Agent log for rotation, scanner log for discovery, Sessions grid for proxies, endpoint Event Viewer for agents. Exact error strings are gold — copy them verbatim.

🧪

Can I reproduce it safely?

tap to flip

One manual Rotate Now, one test session, one pbrun --testmaster. A controlled experiment beats ten theories. So: reproduce, then fix.

The rotation ladder — start at the functional account

Rotation failures have a strict pecking order. Rung 1: the functional account. It needs real rights on the target — local admin or the delegated change-password right on Windows (plus read/write of the lockout attribute in AD), root or working sudo elevation on Linux, and the correct privileged role on appliances like F5. If the FA got locked out by a bad script, or someone "helpfully" onboarded the FA itself as a managed account so Password Safe rotated its own worker's password mid-job, every account on the platform fails together. The society-watchman analogy from lesson 4 holds: he carries the master key that re-keys every flat — change his key without telling him and no lock in the building can be changed.

Rung 2: the account itself. The field's most famous string is Value cannot be null (Parameter 'identityValue') — Password Safe cannot resolve the account's SID in the directory because the account was renamed, deleted, or onboarded from the wrong domain path. Rung 3: connectivity — can the appliance or Resource Broker reach the target's management port at all? Rung 4: platform quirks — an F5 BIG-IP rotation that dies with "Thread was interrupted from a waiting state" because the FA holds a non-Administrator role or the custom platform's prompt regex is wrong. Same ladder every time: worker → identity → wire → quirk.

👉 So far: count the victims → platform-wide means functional account, single account means SID/rights. Next: where the logs and retry machinery live.

Figure 2 — The rotation decision tree

One question — one account or the whole platform? — picks the branch. Note the amber box at the bottom: out-of-band password changes are only detected when a password TEST runs, never in real time.

Where do you read all this? Two agents own rotation health, both under Configuration → Privileged Access Management Agents in the BeyondInsight console. The Password Change Agent executes queued changes and retries failures per Retry failed changes after (minutes) and Maximum retries — a silently growing failed-changes queue is your earliest smoke alarm, so alert on it. The Password Test Agent verifies stored passwords on a schedule — and it is the only thing that makes "Reset Password on Mismatch" fire.

🖥️ This is the screen that owns rotation retries — BeyondInsight → Configuration → Privileged Access Management Agents → Password Change Agent. (Recreated for clarity — your console matches this.)

beyondinsight.icici-infra.in · Configuration → PAM Agents

Agent

Password Change Agent

Retry failed changes after (minutes)

Maximum retries

Unlimited Retries

Off

Save

Common mistake — "Reset on Mismatch did nothing"

Symptom: a server admin changed a password directly on the box; Password Safe kept serving the old one for two days and nobody was alerted. Cause: mismatch detection is NOT real-time — it only happens when a password test runs. Fix: enable Check Password on the accounts and schedule the Password Test Agent; pair with Change Password After Release so checked-in credentials are always re-keyed. It's the milk-delivery rule: the milkman only notices your gate lock changed on his next scheduled visit.

Sneha at ICICI faces this

A freshly onboarded domain account fails its forced first rotation with "Password change failed. Error: Value cannot be null. (Parameter 'identityValue')" — every other account on the platform rotates fine.

Likely cause

Password Safe cannot resolve the account's SID in the directory: the account had been renamed by the AD team a week before onboarding, and the import used a stale CSV with the old samAccountName and the wrong OU path.

Diagnosis

Single-account blast radius → account layer, not the FA. The per-account error in the failed-changes queue names identityValue, which always means SID resolution, not rights or network.

BeyondInsight → Managed Accounts → filter the account → Go to Advanced Details (then run a Directory Query for the account name to confirm what AD actually has)

Fix

Delete the stale managed account, re-onboard it from the correct directory path via a Directory Query-based Smart Rule, and confirm the managed system's domain mapping points at the right domain.

Verify

Manual Rotate Now succeeds; the Password Change Agent failed-changes queue entry clears and stays clear past the next scheduled ChangeTime (23:30 UTC by default).

Password Safe REST API — prove the appliance + your API path are alive in two calls

# 1) Sign in (PS-Auth header: key + runas; pwd sits in square brackets)
curl -s -c /tmp/ps.jar -X POST \
  -H "Authorization: PS-Auth key=<128-char-api-key>; runas=icici\\svc-psapi; pwd=[S3cure!Pass];" \
  https://ps.icici-infra.in/BeyondTrust/api/public/v3/Auth/SignAppIn

# 2) Ask the platform its version — the cheapest end-to-end health check
curl -s -b /tmp/ps.jar https://ps.icici-infra.in/BeyondTrust/api/public/v3/Configuration/Version

Expected output

HTTP 200 — signed in, session cookie stored
{"Version":"25.2.0.184"}
# Got 401 + a WWW-Authenticate-2FA header instead? That is the DESIGNED 2FA
# challenge, not an outage — resend SignAppIn with challenge= in the header.


Discovery returns nothing & portal release issues
Two more Password Safe classics live on this ladder. First, the discovery scan that lies: status says "completed successfully" but zero accounts or services appear. A detailed scan must authenticate as a true local admin — it connects to the IPC$ share, installs a small scan service, and makes remote WMI and registry calls. UAC remote restrictions, Windows Firewall, or NTLM hardening can silently break enumeration while the scan still reports success — like a census enumerator whose ID letter the watchman rejects at every gate, yet whose register still says "area visited". Check the scan credential's rights under Configuration → Discovery Management → Credentials, verify an IPC$ connect manually, and remember the Software collection checkbox slows or kills big scans (24.3+ leaves it off by default). Full discovery design is in lesson 5.
Second, portal release issues: a user swears "the vault is broken" when checkout is actually working as designed. Walk their claim down the ladder before touching config: is the request inside the access policy's schedule window? Did approval actually happen (or is it sitting unapproved with two named approvers on leave)? Retrieve Password shows the secret for a maximum of 20 seconds — blink-and-gone is a feature, not an outage. And the API mirror of the same complaint: GET ManagedAccounts returns an empty list unless the account has Enable for API Access set and the runas user holds a Requestor/ISA role on it. Request-flow design lives in lesson 7.
Quick check · Q1 of 10
Rahul's detailed discovery scan at HCL reports "completed successfully", but the asset shows no local accounts or services even hours later. Most likely cause?
a) The scan finished too quickly to enumerate anythingb) The scan credential lacks true local admin on targets — IPC$/WMI/Remote Registry calls were silently blocked by UAC or firewallc) The Discovery Scanner license has expiredd) A Smart Rule deleted the discovered accounts after import
Correct: b. Scan "success" only means the job ran — enumeration needs the credential to mount IPC$, install the scan service and make WMI/registry calls, all of which UAC remote restrictions, firewalls or NTLM hardening can silently block. Speed isn't a failure signal, licensing errors surface loudly, and Smart Rules onboard rather than delete discovery data.
▶ Follow one scheduled rotation job down the ladder
Watch the 23:30 UTC rotation wave for 212 Linux accounts — and see exactly where a locked functional account derails all of them. Press Play for the healthy path, then Break it to see the failure.
① ScheduleChangeTime 23:30 UTC fires → 212 Linux accounts queued for rotation
▼
② Bindfunctional account svc-psafe-lnx logs in to 172.16.40.21 (SSH)
▼
③ Changenew password set for dbadmin → verified by re-login
▼
④ Logsuccess written; failures land in the Password Change Agent retry queue
Press Play to step through the healthy path. Then press Break it.

Pause & Predict
Predict: 200 Linux rotations failed at 23:30 UTC last night, but Test Functional Account passes NOW. Which log do you open first, and what are you looking for? Type your guess.
Answer: The Password Change Agent's failed-changes queue/log — it holds the per-account error string from the moment of failure. FA tests passing NOW doesn't clear the FA: it may have been locked at 23:30 and auto-unlocked since (AD lockout windows expire). The timestamps + error text tell you whether it was a lockout window, a network drop, or something per-account.

② Session problems — user-side, appliance-side or target-side?
Session complaints sound dramatic — "the vault won't let me in!" — but a proxied session has exactly three places to break: the user side (their machine to the proxy), the appliance side (the proxy itself), and the target side (the proxy to the endpoint). The single most common user-side failure is a port assumption: Password Safe's session proxy listens on 4422 for SSH and 4489 for RDP — not 22, not 3389. A new office firewall that opened 443 "for the vault" lets the portal and password checkout work perfectly while every session launch dies. Session architecture itself is lesson 8; this section is the fault-finder.
Figure 3 — Three suspects, one cheap test each
Read your symptom's column top to bottom: how it smells, the test that convicts it, the usual culprits. The bottom strip is the discipline that separates an L2 from a panicking L1: one variable per test.
The half-split method convicts a side fast. Same user, different network — if it works from HQ but not the new office, the office path (firewall, proxy, SSL inspection) is guilty. Different user, same network — if everyone in one office fails, stop blaming the user's laptop. Everyone everywhere failing, and sessions never even appearing in the grid? That's appliance-side: check the session-monitoring path (the listener on 4488, bound to 127.0.0.1 on the appliance), the services, and — the silent killer — disk space, because session recordings cache locally and the documented 64 GB session-cache disk on a Resource Broker fills up first.
From the user's machine (PowerShell) — convict or acquit the network in 60 secondsTest-NetConnection ps.icici-infra.in -Port 443    # portal path
Test-NetConnection ps.icici-infra.in -Port 4489   # RDP session proxy
Test-NetConnection ps.icici-infra.in -Port 4422   # SSH session proxy
Expected outputComputerName     : ps.icici-infra.in
RemotePort       : 4489
TcpTestSucceeded : False
# 443 True but 4489/4422 False = the firewall opened the console
# and forgot the session ports — a network ticket, not a vault ticket.
👉 So far: portal-works-sessions-don't = proxy ports; everyone-fails = appliance side. Next: the target side and the missing-recording case.
Target-side failures are sneakier because the proxy happily accepts you first. The appliance-to-target leg needs its own firewall opening (3389 to a Windows box, 22 to Linux), the target must actually have Remote Desktop enabled, and Windows NLA policy must accept the managed account's logon — a hardening GPO that flips NLA strict or removes the account's "log on through Remote Desktop Services" right breaks exactly one server while everything else hums. A quirky cousin: Direct Connect strings. The SSH form is ssh -p 4422 requester+user@domain+systemname@pshost — students who swap the + delimiters or use the wrong account format get auth failures that look like proxy outages. RDP Direct Connect files are per-account-per-system downloads; a colleague's copied file fails for you by design.
Priya at Infosys faces this
Everyone in the new Pune office can log into the Password Safe portal and view checked-out passwords, but every RDP session launch fails with a generic connection error. The same users connect fine from the Bengaluru HQ.
Likely causeThe new office's egress firewall was provisioned with "HTTPS to the vault" only — TCP 443. The RDP proxy port 4489 (and SSH 4422) were never opened, so the access console can't reach the session proxy from that VLAN.
DiagnosisHalf-split: same user works from HQ → not account/policy. Whole office fails → shared path. Test-NetConnection shows 443 True, 4489 False from Pune; Password Safe → Sessions shows no session ever registering from that office.
Password Safe → Sessions (no entries from the office) + client-side Test-NetConnection to 4489/4422
FixFirewall change request: allow TCP 4422 and 4489 from the Pune office egress to the appliance, alongside the existing 443.
VerifyRDP launch from Pune succeeds; the session appears in Password Safe → Sessions with its recording attached, and Test-NetConnection 4489 now returns TcpTestSucceeded True.
Quick check · Q2 of 10
Karthik at Flipkart can check out passwords in the portal, but his SSH session attempt with "ssh -p 22 karthik+root@flipkart.in+web-01@ps.flipkart-infra.in" is refused instantly. What did he get wrong?
a) His access request was never approvedb) The vault's SSH service is downc) The session proxy listens on 4422 — he aimed his Direct Connect string at port 22d) Direct Connect only supports RDP, not SSH
Correct: c. Password Safe's SSH proxy listens on 4422 by default — port 22 on the appliance is not the session proxy, so the connection is refused immediately. An unapproved request gives a policy denial (not a refused TCP connect), portal checkout working makes a vault-wide outage unlikely, and SSH Direct Connect is fully supported — that string format is exactly how it works, on the right port.
Recording missing & laggy sessions
An auditor asking "where is last Tuesday's recording?" is a different failure class — the session happened, the evidence didn't. Three checks, in order. One: did the access policy for that user-group even enable Record session (and Keystroke Logging for terminals)? Recording is per-policy, not global — a new vendor policy cloned without the checkbox produces perfectly legal, perfectly unrecorded sessions. Two: was the session-monitoring path healthy at the time — the 4488 listener and its service? Three: disk — a full session cache drops recordings first while sessions keep working, the most embarrassing silent failure in the product. Laggy-but-working sessions usually trace to an undersized broker/appliance at peak (the 9 AM shift-start pile-on), RDP quality settings, or a saturated WAN path — the same suspects as any VDI complaint, tested the same way.
Pause & Predict
Predict: last night's vendor RDP session is missing its recording, but the session itself appears in the grid with start/end times. Name TWO checks before you blame the product. Type your guess.
Answer: One: the access policy that admitted the session — was Record Session actually enabled for that user group/policy? (Per-policy, not global.) Two: session-monitoring health at that timestamp — the 4488 listener's service state and free disk on the appliance/broker session cache. A grid entry without a recording almost always means policy-didn't-ask or disk-couldn't-store.
Prove the isolation, don't feel it
Before you file ANY session ticket, be able to state: which side failed (user/appliance/target), the one test that proved it (Test-NetConnection result, Sessions-grid evidence, or target-leg check), and what changed. "RDP is broken" is a complaint; "4489 unreachable from Pune VLAN only, 443 fine, started after Saturday's firewall change" is a 10-minute fix.

③ The PRA ladder — offline Jump Clients, Jumpoints, Web Jump & SAML
PRA troubleshooting starts from one architectural fact: every component — Jump Client, Jumpoint (Gateway in new docs), and console — dials OUT to the appliance on TCP 443. Nothing listens inbound on the endpoint. So an offline Jump Client is never "open a port on the laptop"; it is a service story, an egress story, or a certificate story. Think of every device calling one call-centre number: if a phone goes silent, either the phone is off (service), the line out of the building is cut (egress/proxy), or the caller no longer trusts the voice answering (certificate). Jump architecture is lesson 11; this is the repair bay.
▶ How a Jump Client earns (and loses) its Online badge
Step through the dial-home that makes a client Online — then break the egress path the way a new corporate proxy does. Press Play for the healthy path, then Break it to see the failure.
① Bootendpoint service starts → reads appliance FQDN pra.wipro-vendors.in
▼
② Dial outpersistent outbound TCP 443 — no inbound port on the endpoint
▼
③ TrustTLS handshake validates the appliance certificate chain
▼
④ Onlineconsole shows Online · state-aware (knows if a user is present)
Press Play to step through the healthy path. Then press Break it.

Walk the four gates in order. Gate 1 — service: EDR products love killing the Jump Client, especially right after an appliance upgrade when the client auto-upgrades and its service/executable names change; a real Beekeepers case saw 40 of 100 clients stranded "Active [Offline]" when SentinelOne blocked the mid-upgrade uninstall/reinstall. A reimaged machine shows the blunt string "The specified Jump Client has been uninstalled." Gate 2 — egress: new proxy, new SSL inspection, new DNS. Gate 3 — certificate: after an appliance SSL cert replacement, clients take 24–48 hours to settle on the new cert — and combining a cert change with a hostname change in one window is the classic mass-offline self-inflicted wound. Gate 4 — maintenance thresholds: clients that haven't connected get marked lost at one threshold and auto-deleted at another; set lost-days smaller than delete-days or machines returning from long leave simply vanish. And "running a different version. Please try again after the upgrade completes" during an upgrade wave is patience, not breakage — upgrades are bandwidth-throttled by design.
Figure 4 — Offline Jump Client — four gates
Work top to bottom; the red boxes are the field-verified culprits at each gate, with the real console error strings. Note the bottom strip — reinstalling first creates duplicate entries and orphans recordings.
🖥️ The thresholds that silently delete your fleet — PRA /login → Jump → Jump Clients → Jump Client Settings (Pathfinder consoles: Asset Management). (Recreated for clarity — your console matches this.)
pra.wipro-vendors.in/login · Jump → Jump Clients
1Days before unconnected Jump Clients are marked lost
7
2Number of days before Jump Clients that have not connected are automatically deleted
30
3Uninstalled Jump Client Behavior
Keep in list (marked Uninstalled)
Maximum bandwidth of concurrent Jump Client upgrades
50 MiB/s
Save
Rahul at Wipro faces this
The morning after the PRA appliance upgrade from 24.1.4 to 24.2.3, 40 of 100 vendor-facing Jump Clients show "Active [Offline]". On affected endpoints the old client is gone but the new one never installed.
Likely causeThe EDR agent (SentinelOne) blocked the Jump Client auto-upgrade mid-flight — it allowed the stop-service and uninstall steps, then quarantined the new installer binary, leaving endpoints with no client at all.
DiagnosisBlast radius: 40 endpoints across sites, all mid-upgrade → shared change (the upgrade) + shared agent (EDR), not 40 separate network faults. Endpoint Event Viewer shows service-stop then blocked-process events at the upgrade timestamp.
/login → Jump → Jump Clients (sort by Last Connected to find the stranded wave) + endpoint EDR console quarantine log
FixWhitelist the new Jump Client installer hash in the EDR policy, pull the offline MSI installers from the appliance, redeploy to the 40 endpoints, and clean up duplicate entries by sorting on last-seen before they auto-delete.
VerifyAll 40 return Online; duplicates removed; a canary endpoint upgraded with EDR exclusions in place survives the next appliance upgrade — and a pre-upgrade EDR-exclusion step is added to the change template.
Jumpoint failures, Web Jump certificates & SAML logins
A Jumpoint that shows online but fails every push to Windows targets is almost always a rights problem: the service installs as Local System by default, which has zero authority on other machines. Run it as a domain account with local admin on targets, and make sure targets expose ADMIN$/IPC$, run the Remote Registry service, and accept 135/445 from the Jumpoint host. Two more traps: a Jumpoint cannot reach itself (loopback is unsupported), and 20+ users hammering Jump at shift start can overload an undersized host — that's sizing, not breakage. Web Jump has the politest failure in the product: with Verify Certificate enabled, a self-signed or wrong-SAN certificate on the internal web app means the session simply never starts — no error. The right fix is fixing the site's certificate; unticking Verify Certificate is documented only for trusted internal sites, and injection that loads-but-won't-fill needs the Username/Password/Submit Field Hint CSS selectors set.
From the Jumpoint host (PowerShell) — the target-side prerequisites in two checksTest-NetConnection 172.16.40.13 -Port 445          # ADMIN$ / IPC$ reachable?
Get-Service -ComputerName 172.16.40.13 RemoteRegistry
# and the Jumpoint service itself must NOT run as Local System if it
# needs rights on targets — services.msc → run as a domain account
Expected outputTcpTestSucceeded : True
Status   Name            DisplayName
------   ----            -----------
Stopped  RemoteRegistry  Remote Registry
# 445 open but Remote Registry Stopped = agentless push fails right here.
SAML login failures lock admins and vendors out of the console even though "nothing changed" — except something did, on the IdP side. The repeat offenders, in order: the IdP rotated its signing certificate and PRA still trusts the old one; clock skew between appliance and IdP invalidating assertion timestamps (the Aadhaar-OTP-that-expired-before-you-typed-it failure — check NTP first); an entity-ID/ACS URL mismatch after a hostname change; or the user authenticating fine but landing in no mapped group, so PRA has no policy to give them. All of it lives in /login → Users & Security → Security Providers. For the wider audit trail when you're reconstructing any PRA incident — session logs, syslog, the Reports menu — lean on lesson 14.
👉 So far: Jump Client = service → egress → cert → thresholds; Jumpoint = rights + shares; Web Jump = certificate; SAML = IdP cert/clock/mapping. Next: the EPM/PMUL quick hits and when to call BeyondTrust.
Quick check · Q3 of 10
Meera at Airtel configures a Web Jump to an internal firewall's admin GUI. The Jumpoint is online, RDP Jumps through it work, but the Web Jump session never starts — and shows no error at all. First thing to check?
a) The Verify Certificate flag — the internal GUI's self-signed certificate fails validation, and the session silently refuses to startb) The endpoint's inbound firewall rules for port 443c) Whether the Jumpoint needs a reboot after RDP sessionsd) The user's SAML group mapping on the IdP
Correct: a. Web Jump with Verify Certificate enabled is documented to simply not start the session when the target site's certificate fails checks — the classic silent failure for self-signed internal GUIs. Fix the cert (or, for a trusted internal site only, untick the flag). Jump Clients/Jumpoints dial outbound so endpoint inbound rules are irrelevant, reboots aren't part of this failure class, and SAML mapping failures block console login, not one Jump Item type.
Pause & Predict
Predict: you replaced the PRA appliance's SSL certificate on Saturday; by Monday 300 of 2,000 Jump Clients are still offline. Mass redeploy? Type your guess.
Answer: No — docs say Jump Clients need 24–48 hours to settle after a certificate change, and a Monday-morning panic redeploy creates duplicate entries and orphans recording/vault links. Verify a sample endpoint can reach the appliance on 443 and trusts the new chain, confirm you did NOT also change the hostname in the same window (that combination is the real killer), and let the stragglers reconnect. Redeploy only the clients still dark after the window, from the offline installer.
Common mistake — blaming PRA for an IdP change
Symptom: Monday 9 AM, every SSO user gets a SAML error; local-account admins log in fine. Cause: the IdP team rotated the SAML signing certificate over the weekend — PRA still holds the old one. Fix: import the new IdP signing certificate (or refresh federation metadata) under /login → Users & Security → Security Providers, and check NTP/clock skew while you're there. Keep one local break-glass admin outside SSO precisely for this morning.

④ EPM/PMUL quick hits, the support case & the master cheat-sheet
Two endpoint-side classics deserve a fast lane. First, EPM policy not applying on Windows. Before touching the policy, check the two services that must be Running — the agent (still literally named Avecto Defendpoint Service) and the cloud check-in adapter (IC3Adapter) — plus the config registry at HKLM\SOFTWARE\Avecto\Privilege Guard Client. Fresh installs have a famous quirk: until the endpoint reboots, right-click "Run as administrator" still triggers native Windows UAC instead of EPM's on-demand rules. And if services, registry and reboot all check out, suspect order: Workstyles and Application Rules are first-match-wins, so a catch-all rule dragged above your specific rule silently eats everything below it. Policy design depth is lesson 15.
EPM for Windows (PowerShell) — the two-service check before any policy archaeologyGet-Service | Where-Object { $_.DisplayName -match 'Defendpoint|Cloud Adapter' } |
  Format-Table Status, DisplayName -AutoSize

# config sanity — yes, the registry still says Avecto:
Get-Item 'HKLM:\SOFTWARE\Avecto\Privilege Guard Client'
Expected outputStatus  DisplayName
------  -----------
Running Avecto Defendpoint Service
Running BeyondTrust Privilege Management Cloud Adapter
# Both Running + registry present, policy still ignored?
# → reboot after fresh install, then audit Workstyle ORDER (first match wins).
Aditya at HCL faces this
Sixty developer laptops at HCL got the EPM agent pushed overnight. Next morning, right-click "Run as administrator" on every NEW laptop pops the native Windows UAC credential box instead of EPM's elevation prompt — while last month's laptops behave perfectly.
Likely causeThe deployment script installed the agent but skipped the post-install restart. Until the endpoint reboots, EPM's on-demand shell-integration hook is not registered, so Windows handles the right-click elevation natively — a known fresh-install quirk. Services, registry and policy were healthy the whole time.
DiagnosisBlast radius: only the new batch → the shared thing is the deployment, not the policy. On a sample laptop both services (Avecto Defendpoint Service + IC3Adapter) are Running, HKLM\SOFTWARE\Avecto\Privilege Guard Client exists, and the assigned policy revision matches the console — agent and policy check out, which leaves the install step itself.
EPM (PM Cloud) console → Policies → Assign Policy to Groups (confirm the revision is assigned to the laptops' Computer Group) + endpoint Get-Service / registry check
FixReboot the sixty laptops and add a mandatory restart step to the deployment script — no policy edits, no reinstalls.
VerifyAfter restart, right-click Run as administrator shows EPM's prompt and the elevation is logged against the on-demand Application Rule; native UAC no longer appears on any of the sixty.
Second, pbrun rejected on Linux. PMUL's decision is made centrally: pbrun sends the request to pbmasterd (default TCP 24345), which stamps ACCEPT or REJECT against the policy. The diagnostic gold is the shape of the failure: an instant reject means the policy said no — argue with pb.conf, not the firewall. A long hang means pbmasterd is unreachable — and here's the trap from lesson 16: opening 24345–24347 alone isn't always enough, because optimized connections also use the dynamic listening range (default 1024–65535); behind strict firewalls you must constrain that range in pb.settings and open it. Lint every policy edit with pbcheck -s before pushing — pbmasterd only reports syntax errors at runtime, when it's already rejecting half of TCS's overnight cron jobs. Full PMUL + AD Bridge coverage: lesson 16 (AD Bridge's own first-aid: /opt/pbis/bin/get-status, and remember the first domain join needs a reboot).
PMUL on a TCS lab run host — accepted vs rejected vs hung, and the pre-flight lint$ pbrun id                        # policy allows id → runs as root
$ pbrun /sbin/shutdown -r now     # policy denies shutdown
$ pbcheck -s -f /etc/pb.conf      # ALWAYS lint pb.conf BEFORE deploying an edit
$ pbrun --testmaster=pbmaster01 /sbin/shutdown -r now   # dry-run accept/reject, runs nothing
Expected outputuid=0(root) gid=0(root) groups=0(root)
pbrun: Request rejected by pbmaster01.tcs-lab.in
# (exact reject wording varies by version/policy)
# instant reject = policy said no · long HANG = pbmasterd unreachable
#   → check TCP 24345 + the dynamic listening-port range in pb.settings
Quick check · Q4 of 10
On Aditya's TCS estate, pbrun works for short commands but intermittently hangs for some sessions — ports 24345, 24346 and 24347 are open between hosts. What's the most likely missing piece?
a) pbmasterd needs a license refreshb) The pb.conf policy has a syntax errorc) The run hosts need a reboot after PMUL installd) Optimized/dynamic connections use the dynamic listening-port range (default 1024–65535), which the firewall is blocking — constrain it in pb.settings and open it
Correct: d. The three static daemon ports aren't the whole story: PMUL's optimized connections listen on the configurable dynamic range, so strict firewalls cause exactly this intermittent-hang pattern. A licensing or syntax problem would fail consistently (and pbcheck catches syntax), and PMUL doesn't require reboots for normal operation — that's the EPM-Windows on-demand quirk.
When to open a BeyondTrust support case — and what to attach
Escalate when the ladder is exhausted, not when you're tired: you've identified the layer, reproduced the failure, captured the exact error string, and the fix needs vendor internals (appliance-side faults, suspected product bugs, anything mid-upgrade) — or anything security-relevant like a CVE question. Then file like you'd file an FIR: the desk can only move as fast as your evidence. Attach: (1) a timeline — when it broke, what changed (upgrades, cert swaps, EDR pushes, firewall changes); (2) the exact error strings, copied verbatim, with screenshots; (3) the right logs — Password Change Agent log for rotation, scanner log for discovery, endpoint Event Viewer + EDR log for Jump Clients; (4) versions — appliance/cloud release, client versions, recent upgrade history; (5) blast radius — one account or a platform, one endpoint or forty. For on-prem appliances, note that version/patch state lives in the /appliance interface (not /login), and support's remote tunnel goes outbound from your appliance over 443 to gwsupport.bomgar.com — yes, the Bomgar heritage name — so there's no inbound hole to request.
The evidence pack, as a checklist
Timeline + what-changed · exact error string (verbatim) · the layer's log · product + version (appliance, client, agent) · blast radius (who/how many) · what you already ruled out and how. Six lines that routinely turn a 5-day ticket into a 1-day ticket — and the same six lines your own L3 wants from you.
And here it is — the page this whole series funnels into. Every symptom below maps to its layer, its first move, and its deep-dive lesson: rotation design in lesson 6, sessions in lesson 8, Jump tech in lesson 11, PRA policies in lesson 13, EPM in lesson 15, PMUL in lesson 16, and HA/failover (when the fix is "the appliance itself died") in lesson 18.
Figure 5 — The master triage card
Twelve symptoms → layer → first move. Screenshot this one — it's the interview answer to "a rotation failed, walk me through it", and the desk card for your first PAM on-call rotation.
Pause & Predict
Predict: which three of the twelve cheat-sheet tiles share "a certificate" as the root layer — and what's the common discipline across all three? Type your guess.
Answer: Jump Client offline (appliance cert swap — 24–48 h settle window), Web Jump silent no-start (target site cert vs Verify Certificate), and SAML login failure (IdP signing cert rotation). Common discipline: certificates fail on OTHER people's schedules — track expiry and rotation dates for the appliance cert, internal site certs and the IdP signing cert in the same calendar, and never combine a cert change with a hostname change.
🎮 Hands-on: BeyondTrust PAM Essentials roomDeep-dive: credential rotation (lesson 6)Deep-dive: Jump technology (lesson 11)
Prove you own the playbook
Take any symptom from this page cold — say, "vendor Jump Clients went offline after the weekend" — and recite: the layer order you'd walk (service → egress → cert → thresholds), the one test per gate, the log you'd quote, and the evidence pack you'd attach if it escalates. If you can do that without scrolling up, you are interview-ready for the troubleshooting round — which is exactly where lesson 20 takes you next.

        
        
        
        
             🤖 Ask the AI Tutor
            Tap any question — instant, scoped to this lesson. No login, no waiting.
            
                
                
                
                
                
                
            
            
            Pre-curated from BeyondTrust docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
        
        

        
        
        
            📝 Wrap-up assessment — six more
            You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
            
                Q5 · Remember
Which port does the Password Safe RDP session proxy listen on by default?
a) 3389b) 443c) 4489d) 8443
Correct: c. The RDP session proxy listens on 4489 (SSH proxy on 4422, session monitoring on 4488). 3389 is the appliance-to-target RDP leg, 443 is the web portal/API, and 8443 is not a Password Safe default — answering 3389 means you missed the proxy concept entirely.
Q6 · Apply
At 23:30 UTC every one of the 212 managed accounts on the Linux platform at Infosys fails rotation simultaneously. Applying the triage ladder, your FIRST check is:
a) The functional account for that platform — test it, check AD lockout and its rightsb) SSH connectivity to each of the 212 target serversc) Raising Maximum retries on the Password Change Agentd) Re-running discovery to refresh the account list
Correct: a. Platform-wide blast radius points at the shared layer: the one functional account that performs every change on that platform. Testing 212 targets first inverts the ladder, raising retries treats the symptom while the cause keeps failing, and discovery has nothing to do with rotation execution.
Q7 · Apply
A vendor's Jump Client shows offline. The vendor insists "the internet works fine on this machine." What do you have them test first?
a) Ping 8.8.8.8 to prove DNS worksb) Whether the appliance can ping the endpoint inboundc) Reinstalling the Jump Client immediatelyd) Outbound TCP 443 from the endpoint to the appliance FQDN — the egress/proxy path the persistent tunnel actually uses
Correct: d. Jump Clients keep a persistent OUTBOUND 443 tunnel to the appliance — general browsing working proves nothing about that specific path through a new egress proxy or SSL inspection. Ping tests neither TCP 443 nor the proxy path, the appliance never dials inbound to endpoints, and a blind reinstall is the last gate (it creates duplicates and orphans recordings).
Q8 · Analyze
RDP sessions through Password Safe start normally, and the Sessions grid shows last night's vendor session with correct start/end times — but there is no recording. Steering through the proxy is clearly working. Most likely root cause?
a) The RDP proxy port 4489 was blocked mid-sessionb) The access policy admitting that vendor group never enabled Record Session — or the session-monitoring path (4488 listener / session-cache disk) failed at write timec) The functional account lost its rights on the targetd) The appliance SSL certificate expired during the session
Correct: b. A session that registered in the grid proves the proxy path and ports worked — so the missing artifact is either policy (Record Session is per-access-policy, not global) or the recording pipeline (4488 listener health, disk space for the session cache). A blocked 4489 would have killed the session itself, the functional account is rotation machinery not session recording, and an expired appliance cert breaks connections, not just their recordings.
Q9 · Analyze
During one Saturday window, the PRA team replaced the appliance SSL certificate AND moved the appliance to a new hostname. By Tuesday, hundreds of Jump Clients remain offline. The best analysis is:
a) The EDR on every endpoint coincidentally blocked the client the same weekendb) The clients' licenses lapsed when the certificate changedc) Jump Clients always require manual reinstall after any certificate changed) Clients tolerate a cert swap (settling over 24–48 h), but combining it with a hostname change broke their dial-home reference — the two changes should never share one window
Correct: d. A certificate swap alone is survivable — documentation says allow 24–48 hours for clients to settle. Changing the hostname in the same window removes the very address the clients dial home to, which is the textbook mass-offline self-wound. Mass coincidental EDR action across the fleet is implausible without an EDR change, licensing is unrelated to certificates, and routine reinstalls after cert changes are explicitly NOT required.
Q10 · Evaluate
Platform-wide rotation failure at 2 AM. Runbook A: restart the appliance, bulk re-onboard the failing accounts, and reboot targets until green. Runbook B: test the functional account, read the Password Change Agent queue, manually rotate ONE account as an experiment, then apply one targeted fix. Which is stronger, and why?
a) A — restarting everything fixes the most causes per minuteb) B — it walks the ladder: one FA test addresses the most probable shared cause, the log names the layer, and a single-account experiment proves the fix before touching 200 accounts; A destroys evidence, risks HA failover, and re-onboarding can break account linksc) A — but only re-onboard, skip the restartd) They are equivalent if both end green by morning
Correct: b. B is diagnosis; A is superstition with downtime. Restarts wipe in-memory state and logs you need, an appliance reboot on an HA pair can trigger failover mid-incident, and bulk re-onboarding rewrites account metadata (and can sever Smart Rule/setting links) without addressing the cause. B's one-test-per-layer approach also leaves a written trail — which is what your RCA, your auditor and your interviewer all ask for.
                
            

            
                
                
                
            
            
                
                Lesson complete — saved to your profile.
            
            
                Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".
            
        
        

        
        
        
            🧠 In your own words
            Type one line: In one line: why do you test the functional account before anything else when a whole platform stops rotating? Then compare to the expert version.
            
            
            Expert version: Because every rotation on that platform runs through that one shared worker credential — if its rights, password or lockout state breaks, all accounts fail together, so a single FA test confirms or eliminates the most likely cause in one move instead of debugging hundreds of accounts individually.
        
        

        
        
            🗣 Teach a friend
            Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
            
            
        

        
        
            
            📩 Quiz me on this in 7 days. Opt in and we'll email 3 micro-questions on Troubleshooting at Day 1, Day 7 and Day 30 — spaced repetition is how this sticks. Un-tick any time.
        

        
        
        
            📖 Glossary
            
                Functional account
The worker credential Password Safe uses on a managed system to rotate other accounts — platform-wide rotation failures usually start here.
                Password Change Agent
Appliance/broker service executing queued password changes with configurable retries (Configuration → Privileged Access Management Agents).
                Password Test Agent
Scheduled service that verifies stored passwords still work — the only trigger for Reset Password on Mismatch.
                identityValue error
"Value cannot be null (Parameter 'identityValue')" — Password Safe cannot resolve the managed account's SID in the directory.
                Session Monitoring listener
Port 4488 (on 127.0.0.1) — the recording path for proxied sessions; pairs with proxy ports 4422 (SSH) and 4489 (RDP).
                Lost vs deleted (Jump Client)
Two maintenance thresholds: unconnected clients are marked lost first, auto-deleted later — keep lost-days < delete-days.
                Uninstalled Jump Client Behavior
Setting that keeps locally-uninstalled clients visible as tombstones instead of silently disappearing from the list.
                Jumpoint (Gateway)
One broker per known network; agentless push needs ADMIN$/IPC$, Remote Registry and 135/445 — and a service account with more rights than Local System.
                Verify Certificate (Web Jump)
Flag that silently refuses to start the session when the target web app's certificate fails validation — fix the cert, don't reflex-untick.
                IC3Adapter
BeyondTrust Privilege Management Cloud Adapter — the EPM agent's policy check-in service, partner to the Avecto Defendpoint Service.
                pbmasterd
PMUL policy-server daemon (default TCP 24345) that stamps every pbrun request ACCEPT or REJECT against pb.conf.
                --testmaster
pbrun dry-run flag — tests accept/reject against a policy server without executing the command; pair with pbcheck -s before any policy push.
                
            
        
        

        
        
            📚 Sources
            
                BeyondTrust Docs — Configure Password Safe Agents (Password Change Agent: Retry failed changes after (minutes), Maximum retries, Unlimited Retries; Password Test Agent schedule). docs.beyondtrust.com/bips/docs/configure-password-safe-agents
                BeyondTrust Docs — Password Safe SSH/RDP Connections (session proxy defaults: SSH 4422, RDP 4489, Session Monitoring 4488 on 127.0.0.1; Direct Connect string formats). docs.beyondtrust.com/bips/docs/ps-ssh-rdp-connections
                BeyondTrust Docs — PRA Jump Client Error Messages + Jump Clients guide (uninstalled/lost/version-mismatch strings, lost vs delete thresholds, Uninstalled Jump Client Behavior, upgrade bandwidth throttling). docs.beyondtrust.com/pra/v24.3/docs/jump-client-errors · docs.beyondtrust.com/pra/docs/jump-clients
                BeyondTrust Docs — Jumpoint guide + Jump Shortcuts (Local System default and domain-account guidance, ADMIN$/IPC$ + Remote Registry + 135/445 prerequisites, loopback unsupported; Web Jump Verify Certificate + field-hint selectors). docs.beyondtrust.com/pra/docs/jumpoint · docs.beyondtrust.com/pra/v24.3/docs/jump-shortcuts
                BeyondTrust Beekeepers community — rotation war stories: "Value cannot be null (Parameter 'identityValue')" (SID not resolvable), functional-account test failures, F5 BIG-IP "Thread was interrupted from a waiting state". beekeepers.beyondtrust.com/general-40
                BeyondTrust Beekeepers community — Jump Clients offline after appliance upgrade 24.1.4→24.2.3 (EDR blocked the mid-upgrade reinstall; duplicate-entry cleanup) + PRA 24.3.4 intermittent offline clients. beekeepers.beyondtrust.com/general-51
                BeyondTrust Docs — EPM-UL pbrun, pbcheck & firewall/port usage (pbmasterd 24345 / pblocald 24346 / pblogd 24347, dynamic listening-port ranges, --testmaster) and EPM-Windows services (Avecto Defendpoint Service, IC3Adapter, Privilege Guard Client registry). docs.beyondtrust.com/epm-ul/docs/pbrun · docs.beyondtrust.com/epm-ul/docs/firewalls
                BeyondTrust Docs — SSL certificates & appliance management (allow 24–48 hours for Jump Clients to pick up a replaced certificate; /appliance interface for on-prem patching; outbound support tunnel to gwsupport.bomgar.com). docs.beyondtrust.com/pra/docs/on-prem-ssl-certificates
                BeyondTrust University — Password Safe / Privileged Remote Access Administration courses & certification (40-question exam, 75% pass, 2 attempts) — troubleshooting scenarios are a core exam and interview lane. beyondtrust.com/services/beyondtrust-university/get-certified
                
            
        
        

        
        
            What's next?
            You can now fix BeyondTrust faster than most working L2s — so let's convert that into offers. The final lesson turns the whole series into 30 real interview questions with strong answers, plus a salary-mapped PAM career plan for India.
            
                Next · Interview Q&A + your PAM career map →
                Practice on exam.techclick.in →

BeyondTrust Troubleshooting Playbook: — Rotation Failures, Offline Jump Clients & Session Errors

🎯 By the end you will be able to

Pick where you want to start

The triage ladder

Session failures

PRA & Jump issues

Quick hits + escalate

① The triage mindset + the Password Safe ladder

The four triage questions

The rotation ladder — start at the functional account

Discovery returns nothing & portal release issues

▶ Follow one scheduled rotation job down the ladder

② Session problems — user-side, appliance-side or target-side?

Recording missing & laggy sessions

③ The PRA ladder — offline Jump Clients, Jumpoints, Web Jump & SAML

▶ How a Jump Client earns (and loses) its Online badge

Jumpoint failures, Web Jump certificates & SAML logins

④ EPM/PMUL quick hits, the support case & the master cheat-sheet

When to open a BeyondTrust support case — and what to attach

🤖 Ask the AI Tutor

📝 Wrap-up assessment — six more

🧠 In your own words

🗣 Teach a friend

📖 Glossary

📚 Sources

What's next?