Most engineers think…
Most engineers think CIS hardening with Ansible is a one-time, one-button job: "point the role at the servers, hit enforce, done — they're compliant forever."
Wrong — and that mindset is how teams lock themselves out of SSH and break production apps on a Friday evening. Real CIS automation is a loop, not a button: audit-only first to score where you stand, review which of the hundreds of controls will change, exclude the few that break your app (with a written exception), enforce in a maintenance window, then re-run to prove idempotence — a clean second run with zero changes is your evidence of compliance. And because servers drift, you schedule the re-run (e.g. in AWX) so the baseline self-heals.
① Why automate hardening — the pain of doing CIS by hand
Meet Sneha, an L2 engineer at Infosys. Her audit team hands her a line that sounds simple: "bring all 100 Linux servers up to the CIS Benchmark." Then she opens the benchmark PDF. The CIS Ubuntu 22.04 Benchmark v2.0.0 has 244 individual controls; RHEL 9 v2.0.0 and Ubuntu 24.04 have their own hundreds. Each control is a small edit — a line in sshd_config, a password-policy value, a file permission, a kernel parameter, an auditd rule, a service to disable.
Doing that by hand on 100 boxes is four problems stacked on top of each other. Slow: even five minutes per control per server is days of clicking and SSHing. Inconsistent: Sneha sets MaxAuthTries 4 on 97 servers and fat-fingers 14 on three. Unprovable: when the auditor asks "prove server-42 is compliant," she has no evidence except "I think I did it." And the quiet killer — drift: a teammate edits sshd_config by hand next week to debug something and forgets to revert, and now the server silently falls out of compliance with nobody watching.
Here is the shift. Compliance-as-code with Ansible expresses the entire CIS baseline as code once. You run it against all 100 servers in parallel, so 'slow' becomes minutes. Because it's the same code everywhere, 'inconsistent' disappears — every box gets MaxAuthTries 4, full stop. And the move that makes Sneha's auditor happy is idempotence: a second run that reports zero changes is proof the fleet already matches the baseline. Drift is caught and corrected on the next scheduled run.
The four pains of hand-hardening, one tap each
Tap each card — these are exactly the problems an auditor (and a CIS interview) starts from.
Hundreds of controls × dozens of servers, done by hand. So: it's never finished, and re-doing it after a rebuild hurts.
One typo and 'MaxAuthTries 4' becomes '14' on three boxes. So: 'compliant' servers quietly aren't.
No report, just 'trust me'. So: when the auditor asks for evidence, you have none.
A later manual edit slips the box out of baseline. So: compliance rots silently until the next audit.
Hardening by hand is like every flat's guard writing the visitor rules from memory in their own notebook — slightly different in each tower, and impossible to audit. Ansible is the printed master gate-pass policy the society office prints once and posts at every gate: identical rules everywhere, and you can walk to any gate and check the printout matches. Re-printing it next month (the re-run) instantly catches any guard who quietly changed a rule.
Rahul at TCS hardened 80 servers by hand last quarter. The auditor now asks: "prove server-57 still matches the CIS SSH baseline today." Why is an idempotent Ansible re-run the strongest answer?
Pause & Predict
Predict: you harden 100 servers with a role today and they all pass. Six weeks later, with NObody touching Ansible, name ONE reason a few servers could fall out of compliance — and the one habit that catches it. Type your guess.
② The building blocks — a maintained CIS role, toggles & tags
You could hand-write 244 tasks yourself — and you'll learn a lot doing it once — but for a real fleet most teams start from a maintained role. The best-known is the open-source ansible-lockdown project — roles like UBUNTU22-CIS, UBUNTU24-CIS, RHEL8-CIS and RHEL9-CIS. Each role already maps every CIS control to a task, so you spend your judgement on which controls to apply and which to skip, not on re-writing the benchmark.
The first building block is a toggle per control. Every CIS rule has its own boolean in defaults/main.yml, named after the rule number: on the Ubuntu 22 role it's ubtu22cis_rule_X_X_X, on RHEL 9 it's rhel9cis_rule_5_2_1 and so on. Set a toggle to false and that single control is skipped — this is exactly how you carve out a control that would break your app, with a comment recording why. There are also master switches: run_audit turns on the built-in compliance check, and system_is_audit / audit_only make the run check-only instead of changing anything.
The second building block is tags, organised by CIS structure. Every task in the role carries several: the level (level1-server, level2-server, level1-workstation…), the section/component (ssh, services, firewall, auditd), whether it's a change or a check (patch vs audit), and the rule number (rule_5_2_1). That lets you scope a run precisely. Level 1 is the safe baseline; Level 2 is stricter and more likely to break something — so you usually roll out Level 1 fleet-wide first and Level 2 only where you've tested it.
# Level 1 server controls only (the safe baseline, fleet-wide) ansible-playbook site.yml -i prod_inventory --tags level1-server # Just the SSH section, and only a single rule, for a careful first test ansible-playbook site.yml -i prod_inventory --tags ssh ansible-playbook site.yml -i prod_inventory --tags rule_5_2_1 # Apply everything EXCEPT the stricter Level 2 controls ansible-playbook site.yml -i prod_inventory --skip-tags level2-server
PLAY [Apply CIS Benchmark - Ubuntu 22.04] ************************************** TASK [UBUNTU22-CIS : 5.2.1 | Ensure permissions on /etc/ssh/sshd_config] ******* ok: [web-mum-01] TASK [UBUNTU22-CIS : 5.2.5 | Ensure SSH MaxAuthTries is set to 4 or less] ******* changed: [web-mum-01] PLAY RECAP ********************************************************************* web-mum-01 : ok=63 changed=9 unreachable=0 failed=0 skipped=171
Symptom: you run ansible-playbook site.yml with no tags on a production fleet and it remediates every Level 1 AND Level 2 control in one go — including ones that break your app — and now you're firefighting. Cause: no scoping. Fix: always scope your first runs. Start with --check --tags level1-server (or audit_only: true) to see what would change, roll out Level 1 before Level 2, and use --skip-tags level2-server until you've tested the stricter controls in staging.
Priya at Wipro wants to harden only the SSH controls on a single test box first, without touching password policy, auditd or firewall. Which flag does that cleanly?
Pause & Predict
Predict: an app on one server genuinely needs a setting that a CIS Level 2 control forbids. You don't want to disable that control on the whole fleet. What's the clean way to make ONE host an exception — and what must you not forget? Type your guess.
③ Running it safely — audit, review, exclude, enforce, prove
Hardening is the one job where 'move fast' gets you locked out of production. The safe sequence is a ladder, and you climb it in order: audit → review → exclude → enforce → prove → schedule. Skipping a rung is how the Friday-evening outage happens.
Step 1 — audit-only, to score where you stand. Run the role in a mode that only checks and changes nothing. With the ansible-lockdown role you set audit_only: true (or system_is_audit: true) with run_audit: true; under the hood it runs goss against the same controls and writes a report. Ansible's own --check mode is the lighter built-in version of the same idea: ansible-playbook site.yml --check --diff shows every line that would change, host by host.
Step 2 — review what would change. Read the audit report or the --diff output and ask of each big change: will this break a running app? The usual suspects are SSH controls (could lock you out), cipher/MAC restrictions (could break old clients or SFTP), firewall defaults (could drop a port your app uses) and disabling a 'unused' service that isn't actually unused. Step 3 — exclude the breakers with documented exceptions. For each control that would break something, set its toggle to false with a dated comment and a ticket — that's your audit-trail. Step 4 — enforce in a maintenance window, ideally Level 1 first, on a canary host, then the fleet. Step 5 — re-run to prove idempotence: a clean second run with changed=0 is your compliance evidence. Step 6 — schedule it in AWX / Automation Controller so drift gets corrected automatically.
▶ Walk the audit-then-enforce ladder on one host
Follow a single Ubuntu host, web-mum-01, through the safe sequence. Watch the run change from 'check only' to 'enforce' to 'prove'. Press Play for the healthy path, then Break it to see the failure.
# 1) Audit / dry-run: score the host, change NOTHING ansible-playbook site.yml -i prod_inventory --check --diff --tags level1-server # 2) Enforce Level 1 in the window, on the canary first ansible-playbook site.yml -i prod_inventory --limit web-mum-01 --tags level1-server # 3) Prove idempotence — a clean re-run = compliance evidence ansible-playbook site.yml -i prod_inventory --limit web-mum-01 --tags level1-server
# run 2 (enforce): PLAY RECAP ******************************************************************** web-mum-01 : ok=66 changed=58 unreachable=0 failed=0 skipped=120 # run 3 (prove): zero changes = idempotent + compliant PLAY RECAP ******************************************************************** web-mum-01 : ok=66 changed=0 unreachable=0 failed=0 skipped=120
A green play recap only means the tasks executed. To prove compliance, do two things: (1) re-run and confirm changed=0 — idempotence is your evidence the state matches the baseline; and (2) read the post-enforce goss audit report (the role writes pre_audit_outfile and post_audit_outfile) and confirm the previously-failing rules now pass, except the ones you deliberately excepted. 'The playbook finished' is not the same as 'the host is compliant.'
Karthik at HCL faces this
Karthik enforces the full CIS role on a staging box over SSH. Halfway through, his SSH session freezes and he can't reconnect — and the next Ansible run to that host fails to copy files with an SFTP error.
Two SSH-hardening controls bit him at once. A cipher/MAC restriction dropped the algorithm his live session was using, and another control disabled the SFTP subsystem — but Ansible copies files to the host over SFTP by default, so subsequent runs can't transfer files.
He recognises a classic CIS SSH gotcha: control-plane (his login) and Ansible's own transport both ride sshd, so SSH controls can cut the very connection doing the hardening. He checks which sshd controls changed in the --diff he should have read first.
Console/iLO out-of-band login → /etc/ssh/sshd_config + journalctl -u ssh; on the control node set scp_if_ssh in ansible.cfgGet back in via the out-of-band console (iLO/iDRAC/cloud serial), revert or loosen the cipher control to keep a working algorithm, and either keep the SFTP subsystem or set scp_if_ssh = True in ansible.cfg so Ansible uses SCP. Record both as documented exceptions if the app needs them.
Re-run with --check --diff first (no freeze), confirm the SSH session stays up, and confirm a normal ansible-playbook run can copy files to the host again; then enforce for real in a window.
Aditya is about to enforce a CIS role on 50 production servers he reaches over SSH. Which single habit most reduces the chance of locking himself (or Ansible) out?
Pause & Predict
Predict: you enforce a CIS role and the FIRST run reports changed=40. You re-run the identical play and it reports changed=12, not 0. What does a non-zero second run most likely mean — and is the role 'broken'? Type your guess.
④ A real pass — harden a fleet, get a report, handle a conflict
Time to put it together on a real group. Sneha's target is app_servers — a mix of Ubuntu 22.04 and RHEL 9 hosts at Flipkart. Her plan covers the controls that matter most on day one: SSH hardening (disable root login, MaxAuthTries 4, strong ciphers), password + sudo policy (faillock lockout, password quality, NOPASSWD audit), auditd (capture logins and privileged commands), and the host firewall (ufw on Ubuntu, firewalld/nftables on RHEL, default-deny inbound).
She points the right role at the right hosts — UBUNTU22-CIS for the Ubuntu group, RHEL9-CIS for the RHEL group — using the rule toggles and tags from section 2. The SSH controls live in section 5.2 (e.g. rule_5_2_5 MaxAuthTries, root-login disable); auditd is section 4; the firewall and network controls in their own sections. She runs audit-only first, reads the goss report, and only then enforces.
- name: Harden Ubuntu app servers to CIS
hosts: app_servers_ubuntu
become: true
vars:
run_audit: true # produce the goss compliance report
audit_only: true # STEP 1: check only, change nothing
ubtu22cis_rule_5_2_5: true # SSH MaxAuthTries <= 4
ubtu22cis_rule_4_1_1: false # auditd start — excepted (ticket OPS-4821)
roles:
- UBUNTU22-CIS
- name: Harden RHEL app servers to CIS
hosts: app_servers_rhel
become: true
vars:
run_audit: true
audit_only: true
rhel9cis_rule_5_2_1: true # sshd_config permissions
roles:
- RHEL9-CISPLAY RECAP ******************************************************************** web-mum-01 (ubuntu) : ok=182 changed=0 failed=0 # audit_only: nothing changed db-blr-04 (rhel9) : ok=171 changed=0 failed=0 # goss report written to /opt/audit_web-mum-01-CIS-UBUNTU22_1749600000.json
The audit report flags 62 failing controls on web-mum-01. Reviewing the diff, one control would break a legacy monitoring agent that needs an older SSH cipher. This is the conflict: enforce the strict cipher list and the agent loses its connection. Sneha doesn't disable the control fleet-wide — she scopes the exception to just the hosts running that agent, with a dated comment and a ticket, then plans to fix the agent so she can remove the exception later. Then she flips audit_only to false and enforces Level 1 in the window.
Symptom: after a clean CIS run, an app that worked yesterday can't resolve names or send mail, and the change log shows a service was stopped/masked. Cause: a CIS control disabled a 'recommended-off' service (e.g. rpcbind, an MTA, avahi) that your stack actually depends on — 'unused' is the benchmark's assumption, not your reality. Fix: in the audit/--check review, list every service the role would disable, check it against what your apps need, and except the ones in use (toggle false + ticket). The benchmark is a starting point you tailor, not gospel you apply blind.
One sober, current note: hardening roles touch the most sensitive files on the box, so treat the role itself as production code. Pull a pinned, reviewed version of the maintained role (don't git clone main straight onto 100 prod servers), test every upgrade of the role in staging, and keep your ansible-vault secrets and SSH keys locked down — a compromised control node that can harden every server can also mis-configure every server. For RHEL teams, recent CIS coverage is strong: the RHEL 9 v2.0.0 profile is now ~99% automatable via the SCAP Security Guide, so the gap between 'audit says fail' and 'Ansible can fix it' is smaller than ever.
For certification, this capstone sits at the intersection of two tracks. The RHCE EX294 is a 4-hour performance exam: you write and debug real playbooks, use system roles, roles from Galaxy, Vault and Jinja2 — exactly the muscles you used to apply a role with toggles and templated config. The CIS Benchmark side gives you the security vocabulary employers want: Level 1 vs Level 2, scored controls, audit vs remediate, exceptions and evidence. Put together — 'I can take a 244-control benchmark and apply, exclude and prove it across a mixed fleet with Ansible' — that is a job-ready sentence on an Indian infra/security résumé.
Cold, in 30 seconds: name the six rungs of the ladder (audit → review → exclude → enforce → prove → schedule); say how you'd run only Level 1 and only SSH (--tags level1-server / --tags ssh); say what makes a control an exception (toggle false + dated ticket); and say what proves compliance (a re-run with changed=0 plus a clean goss report). If you can do that without notes, you've finished the Ansible series ready for the job and the exam.
An interviewer asks Meera: "What single piece of evidence would you show me to prove a server is CIS-compliant right now?" Best answer?
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from Ansible docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: In one line, what single piece of output proves a server is CIS-compliant right now, and why is it stronger than 'I ran the playbook'? Then compare to the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- CIS Benchmark
- A consensus hardening standard for an OS, split into Level 1 (safe baseline) and Level 2 (stricter, can break apps). Hundreds of controls.
- Compliance-as-code
- Expressing the hardening baseline as Ansible code so it's version-controlled, reviewable, re-appliable and auditable instead of done by hand.
- ansible-lockdown
- Community Ansible roles (UBUNTU22-CIS, RHEL9-CIS, etc.) that remediate and audit a host against the CIS Benchmark.
- Idempotence
- Running the role twice gives the same end state; a second run with changed=0 is your proof the host already matches the baseline.
- goss
- A small (~12 MB) Go binary that runs YAML checks to verify each control is actually in effect; ansible-lockdown uses it for the audit step.
- audit_only / run_audit
- Role switches: run_audit produces the compliance report; audit_only (or system_is_audit) makes the run check-only, changing nothing.
- Rule toggle
- A per-control boolean in defaults/main.yml (e.g. ubtu22cis_rule_5_2_5). Set false to skip one control — your way to make a documented exception.
- Tags (level1-server / ssh)
- Labels on tasks so --tags / --skip-tags run a subset: only Level 1, only SSH, or a single rule, instead of the whole role.
- --check / --diff
- Ansible dry-run flags: --check reports what would change without changing it; --diff shows the exact lines. Your pre-enforce review.
- auditd
- The Linux audit daemon — records logins, file changes and privileged commands for forensics and CIS logging controls (section 4).
- Drift
- When a host silently falls out of the baseline after a manual edit, update or new service; caught by a scheduled re-run.
- AWX / Automation Controller
- Web UI + scheduler for Ansible — runs the role on a cadence with logging and RBAC, so the baseline self-heals against drift.
📚 Sources
- ansible-lockdown UBUNTU22-CIS role — README + defaults/main.yml (per-rule toggles like ubtu22cis_rule_X_X_X; tags level1-server/level2-server/ssh/patch/audit/rule_X_X_X; run --tags level1-server / --skip-tags level2-server). github.com/ansible-lockdown/UBUNTU22-CIS
- ansible-lockdown RHEL9-CIS role — control variable naming rhel9cis_rule_5_2_1, section layout (1 Initial Setup, 3 Network, 4 auditd, 5 Access/SSH 5.2.x), excluding app-breaking controls by setting the rule var to false. github.com/ansible-lockdown/RHEL9-CIS
- Ansible Lockdown docs — Audit (getting started): goss binary at /usr/local/bin/goss, audit content in /opt (AUDIT_CONTENT_LOCATION), report file audit_{hostname}-{BENCHMARK}-{OS}_{epoch}.{format}, audit_only / pre_audit_outfile / post_audit_outfile, and Known Issues (GRUB-password lockout). ansible-lockdown.readthedocs.io/en/latest/audit/getting-started-audit.html
- dev-sec ansible-ssh-hardening role — community/forum lockout gotchas: user-account lockout on the ec2 ubuntu account, SFTP deactivated by default (set scp_if_ssh = True in ansible.cfg), crypto cipher/MAC incompatibilities breaking older clients. github.com/dev-sec/ansible-ssh-hardening · danivovich.com/blog/2017/08/31/ansible-ssh-hardening-lockout/
- Canonical / Ubuntu Security — CIS Benchmark versions and counts (Ubuntu 22.04 v2.0.0 = 244 controls; Ubuntu 24.04 v1.0.0 = 232; Level 2 extends Level 1), USG hardening tooling. ubuntu.com/security/certifications/docs/usg/cis · cisecurity.org/benchmark/ubuntu_linux
- Red Hat — 'High automation coverage for CIS in RHEL 9' (RHEL 9 v2.0.0 profile ~99% automatable via SCAP Security Guide; covers access control, logging, network, system hardening). redhat.com/en/blog/high-automation-coverage-cis-rhel-9 · access.redhat.com/compliance/cis-benchmarks
- Red Hat RHCE EX294 exam objectives — 4-hour performance exam: roles, roles from Galaxy, system roles, Ansible Vault, Jinja2 templates. redhat.com/en/services/training/ex294-red-hat-certified-engineer-rhce-exam-red-hat-enterprise-linux
What's next?
That's the full Ansible automation track — from your first ad-hoc command to applying, excepting and proving an entire CIS Benchmark across a fleet. Step back up a level now: how all of this network and host security stitches into one cloud-delivered edge with SASE.