TTechclick ⚡ XP 0% All lessons
Ansible · Hardening · CIS Hardening AutomationInteractive · L1 / L2 / L3

Ansible for CIS Hardening: — Securing 100 Servers to Benchmark in Minutes

Hardening one Linux box to the CIS Benchmark by hand is a long afternoon of editing sshd_config, password policy, file permissions, kernel params and auditd. Doing it on 100 boxes by hand is how drift, typos and audit failures are born. Ansible applies the whole baseline once, re-applies it idempotently to prove nothing changed, and hands you a compliance report. This capstone ties the whole Ansible series into one job: audit, then enforce, then prove.

📅 2026-06-11 · ⏱ 13 min · 3 live demos · 4 infographics · 🏷 10-Q assessment + AI Tutor inline

⚡ Quick Answer

Ansible CIS hardening for L1/L2 engineers and the RHCE EX294 / CIS Benchmark angle: use the ansible-lockdown role with tags, audit-only first with goss, exclude app-breaking controls, then enforce safely.

🎯 By the end you will be able to

Read as:

Pick where you want to start

1

Why automate

Hand-hardening 100 boxes drifts; Ansible re-applies + proves.

2

Building blocks

A maintained role, toggles per control, tags by level.

3

Run it safely

Audit first, exclude breakers, enforce in a window.

4

A real pass

SSH, auditd, firewall hardened — with a report.

🧠 Warm-up — 3 questions, no score

Just notice which ones make you pause. We answer all three inside the lesson.

1. You apply a CIS SSH-hardening control to 100 servers by hand, one at a time. What is the most likely problem six weeks later?

Answered in Why automate.

2. Before you ENFORCE a CIS role on a production fleet, what should you run first?

Answered in Run it safely.

3. A CIS control disables SFTP, but Ansible itself copies files over SFTP. What happens if you enforce it carelessly?

Answered in Building blocks.

Most engineers think…

Most engineers think CIS hardening with Ansible is a one-time, one-button job: "point the role at the servers, hit enforce, done — they're compliant forever."

Wrong — and that mindset is how teams lock themselves out of SSH and break production apps on a Friday evening. Real CIS automation is a loop, not a button: audit-only first to score where you stand, review which of the hundreds of controls will change, exclude the few that break your app (with a written exception), enforce in a maintenance window, then re-run to prove idempotence — a clean second run with zero changes is your evidence of compliance. And because servers drift, you schedule the re-run (e.g. in AWX) so the baseline self-heals.

① Why automate hardening — the pain of doing CIS by hand

Meet Sneha, an L2 engineer at Infosys. Her audit team hands her a line that sounds simple: "bring all 100 Linux servers up to the CIS Benchmark." Then she opens the benchmark PDF. The CIS Ubuntu 22.04 Benchmark v2.0.0 has 244 individual controls; RHEL 9 v2.0.0 and Ubuntu 24.04 have their own hundreds. Each control is a small edit — a line in sshd_config, a password-policy value, a file permission, a kernel parameter, an auditd rule, a service to disable.

Doing that by hand on 100 boxes is four problems stacked on top of each other. Slow: even five minutes per control per server is days of clicking and SSHing. Inconsistent: Sneha sets MaxAuthTries 4 on 97 servers and fat-fingers 14 on three. Unprovable: when the auditor asks "prove server-42 is compliant," she has no evidence except "I think I did it." And the quiet killer — drift: a teammate edits sshd_config by hand next week to debug something and forgets to revert, and now the server silently falls out of compliance with nobody watching.

👉 So far: a CIS Benchmark is hundreds of small controls, and hand-applying them across a fleet is slow, inconsistent, unprovable and drifts. Next: what Ansible changes about all four.

Here is the shift. Compliance-as-code with Ansible expresses the entire CIS baseline as code once. You run it against all 100 servers in parallel, so 'slow' becomes minutes. Because it's the same code everywhere, 'inconsistent' disappears — every box gets MaxAuthTries 4, full stop. And the move that makes Sneha's auditor happy is idempotence: a second run that reports zero changes is proof the fleet already matches the baseline. Drift is caught and corrected on the next scheduled run.

Figure 1 — By hand vs Ansible — the same CIS baseline across a fleet
Hardening a fleet by hand drifts and can't be proven; one Ansible role makes every server identical and provable A two-column comparison for the same set of servers. Left, by hand: an engineer SSHes into each server one at a time and edits sshd_config, password policy and auditd, so values drift between hosts (one says MaxAuthTries 4, another 14, another is missing), there is no audit report, and a manual edit later causes silent drift. Right, with Ansible: one CIS role is applied to all servers in parallel, every host gets identical values, a goss audit report is produced, and an idempotent re-run reports zero changes as proof of compliance. Red marks the inconsistent, unprovable hand path; green marks the consistent, provable Ansible path. Same 100 servers, same CIS Benchmark — two ways to apply it By hand — SSH into each box ✗ days of work, server by server ✗ typos drift: MaxAuthTries 4 vs 14 vs (missing) ✗ no report — "prove server-42 is compliant?" ✗ a later manual edit = silent drift, nobody knows srv-01tries=4 srv-02tries=14 ✗ srv-03missing ✗ three "compliant" servers, three different states With Ansible — one role, all hosts ✓ minutes, all servers in parallel ✓ identical values everywhere — no fat-fingers ✓ goss audit report = the auditor's evidence ✓ re-run = 0 changed → proves compliance + heals drift srv-01tries=4 ✓ srv-02tries=4 ✓ srv-03tries=4 ✓ 2nd run: ok=212 changed=0zero changes = your proof of compliance drift / unproventhe baseline (code)decision / exceptionkey insightcompliant
Left (red) = each server hardened by hand, so values drift and there's no proof. Right (green) = one role applied to all, identical values, and an idempotent re-run that proves compliance.

The four pains of hand-hardening, one tap each

Tap each card — these are exactly the problems an auditor (and a CIS interview) starts from.

🐌
Slow
tap to flip

Hundreds of controls × dozens of servers, done by hand. So: it's never finished, and re-doing it after a rebuild hurts.

🎲
Inconsistent
tap to flip

One typo and 'MaxAuthTries 4' becomes '14' on three boxes. So: 'compliant' servers quietly aren't.

🕵️
Unprovable
tap to flip

No report, just 'trust me'. So: when the auditor asks for evidence, you have none.

💨
Drift
tap to flip

A later manual edit slips the box out of baseline. So: compliance rots silently until the next audit.

Daily-life analogy — the society gate-pass register vs a printed master list

Hardening by hand is like every flat's guard writing the visitor rules from memory in their own notebook — slightly different in each tower, and impossible to audit. Ansible is the printed master gate-pass policy the society office prints once and posts at every gate: identical rules everywhere, and you can walk to any gate and check the printout matches. Re-printing it next month (the re-run) instantly catches any guard who quietly changed a rule.

Quick check · Q1 of 10

Rahul at TCS hardened 80 servers by hand last quarter. The auditor now asks: "prove server-57 still matches the CIS SSH baseline today." Why is an idempotent Ansible re-run the strongest answer?

Correct: a. Idempotence is the whole point: re-running the role and getting 'changed=0' is machine-checkable evidence the host matches the baseline, and any drift would show as 'changed' and be corrected. Ansible doesn't rebuild the host; hand edits are exactly what drift, and the report is meant to be read by the auditor.

Pause & Predict

Predict: you harden 100 servers with a role today and they all pass. Six weeks later, with NObody touching Ansible, name ONE reason a few servers could fall out of compliance — and the one habit that catches it. Type your guess.

Answer: Drift. Someone SSHes in to debug an app and edits sshd_config, a package update resets a default, or a new service writes a world-readable file. Nothing in Ansible changed, but the box no longer matches. The habit that catches it: schedule the role to re-run on a cadence (e.g. nightly/weekly in AWX). The next run reports the drifted controls as 'changed' and corrects them — so the baseline self-heals instead of rotting until the next audit.

② The building blocks — a maintained CIS role, toggles & tags

You could hand-write 244 tasks yourself — and you'll learn a lot doing it once — but for a real fleet most teams start from a maintained role. The best-known is the open-source ansible-lockdown project — roles like UBUNTU22-CIS, UBUNTU24-CIS, RHEL8-CIS and RHEL9-CIS. Each role already maps every CIS control to a task, so you spend your judgement on which controls to apply and which to skip, not on re-writing the benchmark.

The first building block is a toggle per control. Every CIS rule has its own boolean in defaults/main.yml, named after the rule number: on the Ubuntu 22 role it's ubtu22cis_rule_X_X_X, on RHEL 9 it's rhel9cis_rule_5_2_1 and so on. Set a toggle to false and that single control is skipped — this is exactly how you carve out a control that would break your app, with a comment recording why. There are also master switches: run_audit turns on the built-in compliance check, and system_is_audit / audit_only make the run check-only instead of changing anything.

👉 So far: a maintained role gives you one toggle per control plus master audit switches. Next: how tags let you run only Level 1, only Level 2, or just the SSH section.

The second building block is tags, organised by CIS structure. Every task in the role carries several: the level (level1-server, level2-server, level1-workstation…), the section/component (ssh, services, firewall, auditd), whether it's a change or a check (patch vs audit), and the rule number (rule_5_2_1). That lets you scope a run precisely. Level 1 is the safe baseline; Level 2 is stricter and more likely to break something — so you usually roll out Level 1 fleet-wide first and Level 2 only where you've tested it.

ansible-playbook — scope a CIS run with tags (Level 1 only, then just the SSH section)
# Level 1 server controls only (the safe baseline, fleet-wide)
ansible-playbook site.yml -i prod_inventory --tags level1-server

# Just the SSH section, and only a single rule, for a careful first test
ansible-playbook site.yml -i prod_inventory --tags ssh
ansible-playbook site.yml -i prod_inventory --tags rule_5_2_1

# Apply everything EXCEPT the stricter Level 2 controls
ansible-playbook site.yml -i prod_inventory --skip-tags level2-server
Expected output
PLAY [Apply CIS Benchmark - Ubuntu 22.04] **************************************
TASK [UBUNTU22-CIS : 5.2.1 | Ensure permissions on /etc/ssh/sshd_config] *******
ok: [web-mum-01]
TASK [UBUNTU22-CIS : 5.2.5 | Ensure SSH MaxAuthTries is set to 4 or less] *******
changed: [web-mum-01]
PLAY RECAP *********************************************************************
web-mum-01  : ok=63  changed=9  unreachable=0  failed=0  skipped=171
Figure 2 — Inside a CIS role — toggles + tags carve the run
A maintained CIS role is a set of tagged, individually-toggleable tasks you slice with tags and exceptions A diagram of the internals of an ansible-lockdown CIS role. At the top, defaults/main.yml holds one boolean toggle per rule, named like ubtu22cis_rule_5_2_5, plus master switches run_audit, system_is_audit and audit_only. Below, the tasks are grouped into CIS sections: section 1 initial setup, section 3 network, section 4 auditd, section 5 access and SSH. Each task carries tags for its level (level1-server or level2-server), its section component (ssh, firewall, auditd), whether it patches or only audits, and its rule number. On the right, two slicing controls are shown: tags select only some tasks, skip-tags exclude others such as level2-server, and a toggle set to false is a documented exception that skips a single app-breaking control. Amber marks the decision and exception points. One CIS role = toggles (what runs) + tags (which slice) defaults/main.yml — one toggle per ruleubtu22cis_rule_5_2_5: true # SSH MaxAuthTriesubtu22cis_rule_4_1_1: false # auditd off — app needs it (exception) master switchesrun_audit: trueaudit_only / system_is_audit: true tasks grouped by CIS section — each task is tagged §1 Initial Setup · §3 Networktags: level1-server, sysctl, patch §4 auditd (logging)tags: level2-server, auditd, patch §5 Access / SSH (5.2.x)tags: level1-server, ssh, rule_5_2_5 firewall (ufw / nftables)tags: level1-server, firewall, patch audit tasks (run_audit) check; patch tasks change --tags sshrun ONLY these tasks --skip-tags level2-serverexclude the stricter set your two dialsTOGGLE (defaults):rule = false → skip 1 control,with a comment = exceptionTAGS (cmd line):--tags level1-server = safebaseline fleet-wide--tags ssh = one sectionLevel 1 first, Level 2 wheretested. excluded / breaks appbaseline taskdecision / exceptionkey insightapplied
Read it top-down: the benchmark splits into sections; each task has a toggle (on/off) AND tags (level, section, patch/audit, rule). --tags and --skip-tags slice the run; a false toggle is your documented exception.
Common mistake — "I ran the whole role on prod and it changed 200 things at once"

Symptom: you run ansible-playbook site.yml with no tags on a production fleet and it remediates every Level 1 AND Level 2 control in one go — including ones that break your app — and now you're firefighting. Cause: no scoping. Fix: always scope your first runs. Start with --check --tags level1-server (or audit_only: true) to see what would change, roll out Level 1 before Level 2, and use --skip-tags level2-server until you've tested the stricter controls in staging.

Quick check · Q2 of 10

Priya at Wipro wants to harden only the SSH controls on a single test box first, without touching password policy, auditd or firewall. Which flag does that cleanly?

Correct: b. --tags ssh runs ONLY the tasks tagged ssh, which is exactly a scoped SSH-only run. --skip-tags ssh does the opposite (everything except SSH). --check is a dry run that never changes anything — useful, but it wouldn't apply SSH either. Re-writing by hand throws away the maintained role's value.

Pause & Predict

Predict: an app on one server genuinely needs a setting that a CIS Level 2 control forbids. You don't want to disable that control on the whole fleet. What's the clean way to make ONE host an exception — and what must you not forget? Type your guess.

Answer: Set that rule's toggle to false for just that host (e.g. a host_vars/ entry or a group for that app: ubtu22cis_rule_4_1_1: false). The rest of the fleet still enforces it. What you must not forget: a written, dated exception — a comment in the vars file and a ticket/record saying which control, which host, why it's excepted, and who approved it. An undocumented 'false' is indistinguishable from a mistake at the next audit.

③ Running it safely — audit, review, exclude, enforce, prove

Hardening is the one job where 'move fast' gets you locked out of production. The safe sequence is a ladder, and you climb it in order: audit → review → exclude → enforce → prove → schedule. Skipping a rung is how the Friday-evening outage happens.

Step 1 — audit-only, to score where you stand. Run the role in a mode that only checks and changes nothing. With the ansible-lockdown role you set audit_only: true (or system_is_audit: true) with run_audit: true; under the hood it runs goss against the same controls and writes a report. Ansible's own --check mode is the lighter built-in version of the same idea: ansible-playbook site.yml --check --diff shows every line that would change, host by host.

👉 So far: step 1 is audit-only / --check to score without changing anything. Next: review the diff, exclude the breakers, then enforce in a window.

Step 2 — review what would change. Read the audit report or the --diff output and ask of each big change: will this break a running app? The usual suspects are SSH controls (could lock you out), cipher/MAC restrictions (could break old clients or SFTP), firewall defaults (could drop a port your app uses) and disabling a 'unused' service that isn't actually unused. Step 3 — exclude the breakers with documented exceptions. For each control that would break something, set its toggle to false with a dated comment and a ticket — that's your audit-trail. Step 4 — enforce in a maintenance window, ideally Level 1 first, on a canary host, then the fleet. Step 5 — re-run to prove idempotence: a clean second run with changed=0 is your compliance evidence. Step 6 — schedule it in AWX / Automation Controller so drift gets corrected automatically.

Figure 3 — The audit-then-enforce flow — one control's journey
Each CIS control flows from audit through a break-the-app decision to either a documented exception or enforce, then proof, then schedule A left-to-right process flow for hardening. Start: audit-only run scores the host with goss and writes a report, changing nothing. Each failing control reaches a decision diamond: will enforcing it break a running app? If yes, the control branches to a documented exception, where its toggle is set to false scoped to the affected host group with a dated ticket. If no, the control is enforced in a maintenance window, Level 1 first on a canary. After enforcement a re-run proves idempotence with changed equals zero plus a clean goss report. Finally the whole play is scheduled in AWX so that any future drift is caught on the next run and flows back to audit. Amber marks the decision and exception path; green marks enforce and proof; red marks the lockout risk of skipping the decision. One control's journey through the safe ladder 1· Audit-onlygoss scores, changes nothingreport → /opt 2· Breaksthe app? review the --diff 3a· Exceptiontoggle false + ticket,scoped to that group YES 3b· Enforcein window · L1 · canary first NO 4· Provere-run → changed=0+ clean goss report 5· Schedule (AWX)re-runs on a cadence drift → next run re-audits fix app later → drop exception skip the decision → enforce blind= SSH lockout / app outage lockout riskbaseline stepdecision / exceptionkey insightenforce / proof
Follow the arrows: every control flows audit → does it break the app? If yes, branch to a documented exception (amber); if no, enforce, then the re-run proves it. The whole loop ends in 'scheduled' so drift comes back to audit.

▶ Walk the audit-then-enforce ladder on one host

Follow a single Ubuntu host, web-mum-01, through the safe sequence. Watch the run change from 'check only' to 'enforce' to 'prove'. Press Play for the healthy path, then Break it to see the failure.

① Auditaudit_only: true → goss checks 244 controls; report says 62 fail, 182 pass
② Review + excluderead --diff: rule_4_1_1 (auditd) would break the app → set toggle false + ticket
③ Enforceaudit_only: false, --tags level1-server in the window → changed=58
④ Provere-run the same play → changed=0; idempotent, compliant; schedule it in AWX
Press Play to step through the healthy path. Then press Break it.
ansible-playbook — the safe ladder (dry-run audit, then a scoped enforce, then prove)
# 1) Audit / dry-run: score the host, change NOTHING
ansible-playbook site.yml -i prod_inventory --check --diff --tags level1-server

# 2) Enforce Level 1 in the window, on the canary first
ansible-playbook site.yml -i prod_inventory --limit web-mum-01 --tags level1-server

# 3) Prove idempotence — a clean re-run = compliance evidence
ansible-playbook site.yml -i prod_inventory --limit web-mum-01 --tags level1-server
Expected output
# run 2 (enforce):
PLAY RECAP ********************************************************************
web-mum-01  : ok=66  changed=58  unreachable=0  failed=0  skipped=120

# run 3 (prove): zero changes = idempotent + compliant
PLAY RECAP ********************************************************************
web-mum-01  : ok=66  changed=0   unreachable=0  failed=0  skipped=120
🖥️ This is the audit report you read in step 2 — the goss results the role writes to /opt on the host: audit_<hostname>-CIS-UBUNTU22_<epoch>.json. (Recreated for clarity — your output matches this shape.)
sneha@control:~/UBUNTU22-CIS — goss audit summary
1
Audit content dir
/opt (AUDIT_CONTENT_LOCATION)
2
Report file
audit_web-mum-01-CIS-UBUNTU22_1749600000.json
3
rule_5_2_5 SSH MaxAuthTries
FAIL → expected 4, found 14
rule_5_2_22 SSH PermitRootLogin
FAIL → expected no, found yes
rule_4_1_1 auditd enabled
FAIL (excepted — ticket OPS-4821)
4
Summary
Count: 244 Failed: 62 Passed: 182
▶ run audit
Prove it's really compliant, not just 'it ran'

A green play recap only means the tasks executed. To prove compliance, do two things: (1) re-run and confirm changed=0 — idempotence is your evidence the state matches the baseline; and (2) read the post-enforce goss audit report (the role writes pre_audit_outfile and post_audit_outfile) and confirm the previously-failing rules now pass, except the ones you deliberately excepted. 'The playbook finished' is not the same as 'the host is compliant.'

Karthik at HCL faces this

Karthik enforces the full CIS role on a staging box over SSH. Halfway through, his SSH session freezes and he can't reconnect — and the next Ansible run to that host fails to copy files with an SFTP error.

Likely cause

Two SSH-hardening controls bit him at once. A cipher/MAC restriction dropped the algorithm his live session was using, and another control disabled the SFTP subsystem — but Ansible copies files to the host over SFTP by default, so subsequent runs can't transfer files.

Diagnosis

He recognises a classic CIS SSH gotcha: control-plane (his login) and Ansible's own transport both ride sshd, so SSH controls can cut the very connection doing the hardening. He checks which sshd controls changed in the --diff he should have read first.

Console/iLO out-of-band login → /etc/ssh/sshd_config + journalctl -u ssh; on the control node set scp_if_ssh in ansible.cfg
Fix

Get back in via the out-of-band console (iLO/iDRAC/cloud serial), revert or loosen the cipher control to keep a working algorithm, and either keep the SFTP subsystem or set scp_if_ssh = True in ansible.cfg so Ansible uses SCP. Record both as documented exceptions if the app needs them.

Verify

Re-run with --check --diff first (no freeze), confirm the SSH session stays up, and confirm a normal ansible-playbook run can copy files to the host again; then enforce for real in a window.

Quick check · Q3 of 10

Aditya is about to enforce a CIS role on 50 production servers he reaches over SSH. Which single habit most reduces the chance of locking himself (or Ansible) out?

Correct: c. Dry-running with --check --diff lets you see the sshd/cipher/SFTP changes before they hit, and an out-of-band console is your way back in if SSH does break. Enforcing everything blind is exactly the cause of the outage; disabling SSH locks you out immediately; and skipping --check removes your safety net, it doesn't help.

Pause & Predict

Predict: you enforce a CIS role and the FIRST run reports changed=40. You re-run the identical play and it reports changed=12, not 0. What does a non-zero second run most likely mean — and is the role 'broken'? Type your guess.

Answer: It usually means a control isn't truly idempotent yet, or something on the host is fighting it: a non-idempotent task that 'changes' every run (e.g. re-templating a file that has a timestamp), or a service/cron/another tool that reverts a setting between runs so the role keeps re-applying it. The role isn't necessarily broken — but changed=12 on the second run is a red flag to investigate: find the specific tasks reported as 'changed', and fix the task or remove the thing reverting the state. True compliance evidence is changed=0.

④ A real pass — harden a fleet, get a report, handle a conflict

Time to put it together on a real group. Sneha's target is app_servers — a mix of Ubuntu 22.04 and RHEL 9 hosts at Flipkart. Her plan covers the controls that matter most on day one: SSH hardening (disable root login, MaxAuthTries 4, strong ciphers), password + sudo policy (faillock lockout, password quality, NOPASSWD audit), auditd (capture logins and privileged commands), and the host firewall (ufw on Ubuntu, firewalld/nftables on RHEL, default-deny inbound).

She points the right role at the right hosts — UBUNTU22-CIS for the Ubuntu group, RHEL9-CIS for the RHEL group — using the rule toggles and tags from section 2. The SSH controls live in section 5.2 (e.g. rule_5_2_5 MaxAuthTries, root-login disable); auditd is section 4; the firewall and network controls in their own sections. She runs audit-only first, reads the goss report, and only then enforces.

site.yml — assign the maintained CIS role per OS group, audit_only first
- name: Harden Ubuntu app servers to CIS
  hosts: app_servers_ubuntu
  become: true
  vars:
    run_audit: true          # produce the goss compliance report
    audit_only: true         # STEP 1: check only, change nothing
    ubtu22cis_rule_5_2_5: true     # SSH MaxAuthTries <= 4
    ubtu22cis_rule_4_1_1: false    # auditd start — excepted (ticket OPS-4821)
  roles:
    - UBUNTU22-CIS

- name: Harden RHEL app servers to CIS
  hosts: app_servers_rhel
  become: true
  vars:
    run_audit: true
    audit_only: true
    rhel9cis_rule_5_2_1: true      # sshd_config permissions
  roles:
    - RHEL9-CIS
Expected output
PLAY RECAP ********************************************************************
web-mum-01 (ubuntu)   : ok=182 changed=0 failed=0   # audit_only: nothing changed
db-blr-04  (rhel9)    : ok=171 changed=0 failed=0
# goss report written to /opt/audit_web-mum-01-CIS-UBUNTU22_1749600000.json

The audit report flags 62 failing controls on web-mum-01. Reviewing the diff, one control would break a legacy monitoring agent that needs an older SSH cipher. This is the conflict: enforce the strict cipher list and the agent loses its connection. Sneha doesn't disable the control fleet-wide — she scopes the exception to just the hosts running that agent, with a dated comment and a ticket, then plans to fix the agent so she can remove the exception later. Then she flips audit_only to false and enforces Level 1 in the window.

🖥️ The enforce run she watches in the window — a real ansible-playbook site.yml --tags level1-server terminal. Note PermitRootLogin and MaxAuthTries flip to 'changed', and the excepted control is skipped. (Recreated for clarity.)
sneha@control:~/cis $ ansible-playbook site.yml -i prod --tags level1-server
1
TASK 5.2.22 PermitRootLogin no
changed: [web-mum-01]
2
TASK 5.2.5 MaxAuthTries 4
changed: [web-mum-01]
3
TASK 4.1.1 auditd enabled
skipping: [web-mum-01] (toggle false)
TASK firewall default-deny (ufw)
changed: [web-mum-01]
4
PLAY RECAP web-mum-01
ok=66 changed=58 failed=0
▶ enforce
Figure 4 — The audit-then-enforce ladder — cheat-sheet
Ansible CIS hardening on one card — the safe ladder, the sections, the role variables and the lockout traps A nine-tile cheat sheet. Tiles cover the audit-then-enforce ladder (audit, review, exclude, enforce, prove, schedule), the CIS sections covered (SSH 5.2, password and sudo, auditd 4, firewall and network), the key role variables (run_audit, audit_only, the per-rule toggles), the tags to scope a run (level1-server, level2-server, ssh, patch, audit), the goss audit report path under /opt, the idempotence proof (changed equals zero), the lockout traps (SSH cipher, SFTP subsystem, GRUB password), the maintained roles (UBUNTU22-CIS, RHEL9-CIS), and the exam angle (RHCE EX294 plus CIS Benchmark). Each tile has a one-line takeaway. Ansible CIS hardening — your one-glance card The safe ladderaudit → review → exclude→ enforce → prove → schedulenever skip a rung CIS sections covered§5.2 SSH · password/sudo§4 auditd · firewall/networkL1 = safe, L2 = strict Key role variablesrun_audit: trueaudit_only / system_is_auditubtu22cis_rule_5_2_5: true Tags = scope--tags level1-server--tags ssh · rule_5_2_5--skip-tags level2-server The audit reportgoss binary, ~12 MB, no infra/opt/audit_<host>-CIS-...jsonyour auditor's evidence Idempotence = proofre-run the SAME playchanged=0= matches baseline, heals drift Lockout trapsSSH cipher cuts live sessionSFTP off → set scp_if_sshGRUB pwd → console lockout Exceptionsrule toggle = falsescope to the host group+ dated comment + ticket Roles + examUBUNTU22/24-CIS, RHEL8/9-CISschedule in AWX = no driftRHCE EX294 + CIS Benchmark lockout / breaks appbaseline / roleexception / decisionkey insightcompliant
Your one-card map of this lesson: the safe ladder, the CIS sections covered, the key role variables, the ports/commands, and the lockout traps. Keep it open during your first real hardening job.
Common mistake — over-enforcing a 'disable unused service' control

Symptom: after a clean CIS run, an app that worked yesterday can't resolve names or send mail, and the change log shows a service was stopped/masked. Cause: a CIS control disabled a 'recommended-off' service (e.g. rpcbind, an MTA, avahi) that your stack actually depends on — 'unused' is the benchmark's assumption, not your reality. Fix: in the audit/--check review, list every service the role would disable, check it against what your apps need, and except the ones in use (toggle false + ticket). The benchmark is a starting point you tailor, not gospel you apply blind.

One sober, current note: hardening roles touch the most sensitive files on the box, so treat the role itself as production code. Pull a pinned, reviewed version of the maintained role (don't git clone main straight onto 100 prod servers), test every upgrade of the role in staging, and keep your ansible-vault secrets and SSH keys locked down — a compromised control node that can harden every server can also mis-configure every server. For RHEL teams, recent CIS coverage is strong: the RHEL 9 v2.0.0 profile is now ~99% automatable via the SCAP Security Guide, so the gap between 'audit says fail' and 'Ansible can fix it' is smaller than ever.

👉 So far: a real two-OS pass — SSH/auditd/firewall, a goss report, and a documented exception for the conflicting control. Next: the exam + career angle, then the final recap.

For certification, this capstone sits at the intersection of two tracks. The RHCE EX294 is a 4-hour performance exam: you write and debug real playbooks, use system roles, roles from Galaxy, Vault and Jinja2 — exactly the muscles you used to apply a role with toggles and templated config. The CIS Benchmark side gives you the security vocabulary employers want: Level 1 vs Level 2, scored controls, audit vs remediate, exceptions and evidence. Put together — 'I can take a 244-control benchmark and apply, exclude and prove it across a mixed fleet with Ansible' — that is a job-ready sentence on an Indian infra/security résumé.

Prove you own the capstone

Cold, in 30 seconds: name the six rungs of the ladder (audit → review → exclude → enforce → prove → schedule); say how you'd run only Level 1 and only SSH (--tags level1-server / --tags ssh); say what makes a control an exception (toggle false + dated ticket); and say what proves compliance (a re-run with changed=0 plus a clean goss report). If you can do that without notes, you've finished the Ansible series ready for the job and the exam.

Revisit: Jinja2 & Idempotency (the proof behind changed=0)
Quick check · Q4 of 10

An interviewer asks Meera: "What single piece of evidence would you show me to prove a server is CIS-compliant right now?" Best answer?

Correct: d. Compliance evidence is two things together: a re-run with changed=0 (the state matches the coded baseline) and a goss audit report showing each control passing, with any failures being documented exceptions. Uptime is irrelevant, an email is not evidence, and 'the playbook ran' only proves execution — not that the host's state actually matches the benchmark.

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. No login, no waiting.

Pre-curated from Ansible docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.

📝 Wrap-up assessment — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.

Q5 · Remember

In an ansible-lockdown CIS role, what does setting run_audit: true with audit_only: true do?

Correct: a. run_audit produces the goss compliance report and audit_only makes the run check-only — it scores the host and writes a report without remediating. It doesn't rebuild hosts, doesn't enforce anything (audit_only is the opposite of enforce), and doesn't disable SSH.
Q6 · Apply

You want to harden only the SSH controls on a single canary host first, changing nothing else. Which command is correct?

Correct: b. --limit canary scopes to one host and --tags ssh runs only the SSH-tagged tasks. --skip-tags ssh does the opposite; --tags level2-server runs the stricter set (not just SSH); and running with no flags on the whole inventory is the blind, fleet-wide run that causes outages.
Q7 · Apply

A CIS control disables the SFTP subsystem in sshd. After enforcing it, your next ansible-playbook run can't copy files to the host. What's the right fix?

Correct: c. Ansible copies files over SFTP by default, so disabling the SFTP subsystem breaks file transfer — setting scp_if_ssh = True makes Ansible use SCP instead (or keep SFTP and except that control). Abandoning Ansible or disabling SSH defeats the purpose; rebooting doesn't change the sshd config.
Q8 · Analyze

You enforce a CIS role; the first run reports changed=40. You re-run the identical play and it reports changed=11, not 0. What is the most likely explanation?

Correct: d. A non-zero second run means the end state isn't stable: either a task re-applies every run (non-idempotent) or another process/cron reverts a setting between runs. That's a flag to find the specific 'changed' tasks and fix them. It's not about the control count, the run isn't fully compliant yet, and Ansible's changed count is deterministic, not random.
Q9 · Analyze

Mid-enforce over SSH, your live session freezes and you can't reconnect. Control connections to the box are gone. Which CIS change is the most likely culprit, and what's the safe recovery path?

Correct: a. SSH cipher/MAC hardening can remove the very algorithm your live session is using, cutting you off. The safe recovery is an out-of-band console (cloud serial / iLO / iDRAC) — not SSH, which is down — then loosen or except the cipher control. The other options either can't cut SSH like this or assume an SSH path that no longer works.
Q10 · Evaluate

Two ways to roll out CIS to 100 prod servers: (A) run the full role with no tags, all Level 1 + Level 2 at once, to 'just get compliant fast'; (B) audit-only first, review the diff, except the app-breakers with tickets, enforce Level 1 on a canary then the fleet in a window, re-run to changed=0, and schedule it in AWX. Which is stronger and why?

Correct: b. B is the safe, provable, auditable path: audit-then-review catches breakers before they hit, exceptions are documented, a canary + window limits blast radius, changed=0 is real compliance evidence, and AWX scheduling stops drift. A is exactly how teams lock themselves out and break apps; Level 2 controls absolutely can break things; and the two paths do NOT end the same — A risks an outage and leaves no evidence trail.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".

🧠 In your own words

Type one line: In one line, what single piece of output proves a server is CIS-compliant right now, and why is it stronger than 'I ran the playbook'? Then compare to the expert version.

Expert version: An idempotent re-run reporting changed=0 (plus a clean goss audit report) — because it shows the host's actual state already matches the coded baseline and would self-correct any drift, whereas 'the playbook ran' only proves the tasks executed, not that the end state is compliant.

🗣 Teach a friend

Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.

📖 Glossary

CIS Benchmark
A consensus hardening standard for an OS, split into Level 1 (safe baseline) and Level 2 (stricter, can break apps). Hundreds of controls.
Compliance-as-code
Expressing the hardening baseline as Ansible code so it's version-controlled, reviewable, re-appliable and auditable instead of done by hand.
ansible-lockdown
Community Ansible roles (UBUNTU22-CIS, RHEL9-CIS, etc.) that remediate and audit a host against the CIS Benchmark.
Idempotence
Running the role twice gives the same end state; a second run with changed=0 is your proof the host already matches the baseline.
goss
A small (~12 MB) Go binary that runs YAML checks to verify each control is actually in effect; ansible-lockdown uses it for the audit step.
audit_only / run_audit
Role switches: run_audit produces the compliance report; audit_only (or system_is_audit) makes the run check-only, changing nothing.
Rule toggle
A per-control boolean in defaults/main.yml (e.g. ubtu22cis_rule_5_2_5). Set false to skip one control — your way to make a documented exception.
Tags (level1-server / ssh)
Labels on tasks so --tags / --skip-tags run a subset: only Level 1, only SSH, or a single rule, instead of the whole role.
--check / --diff
Ansible dry-run flags: --check reports what would change without changing it; --diff shows the exact lines. Your pre-enforce review.
auditd
The Linux audit daemon — records logins, file changes and privileged commands for forensics and CIS logging controls (section 4).
Drift
When a host silently falls out of the baseline after a manual edit, update or new service; caught by a scheduled re-run.
AWX / Automation Controller
Web UI + scheduler for Ansible — runs the role on a cadence with logging and RBAC, so the baseline self-heals against drift.

📚 Sources

  1. ansible-lockdown UBUNTU22-CIS role — README + defaults/main.yml (per-rule toggles like ubtu22cis_rule_X_X_X; tags level1-server/level2-server/ssh/patch/audit/rule_X_X_X; run --tags level1-server / --skip-tags level2-server). github.com/ansible-lockdown/UBUNTU22-CIS
  2. ansible-lockdown RHEL9-CIS role — control variable naming rhel9cis_rule_5_2_1, section layout (1 Initial Setup, 3 Network, 4 auditd, 5 Access/SSH 5.2.x), excluding app-breaking controls by setting the rule var to false. github.com/ansible-lockdown/RHEL9-CIS
  3. Ansible Lockdown docs — Audit (getting started): goss binary at /usr/local/bin/goss, audit content in /opt (AUDIT_CONTENT_LOCATION), report file audit_{hostname}-{BENCHMARK}-{OS}_{epoch}.{format}, audit_only / pre_audit_outfile / post_audit_outfile, and Known Issues (GRUB-password lockout). ansible-lockdown.readthedocs.io/en/latest/audit/getting-started-audit.html
  4. dev-sec ansible-ssh-hardening role — community/forum lockout gotchas: user-account lockout on the ec2 ubuntu account, SFTP deactivated by default (set scp_if_ssh = True in ansible.cfg), crypto cipher/MAC incompatibilities breaking older clients. github.com/dev-sec/ansible-ssh-hardening · danivovich.com/blog/2017/08/31/ansible-ssh-hardening-lockout/
  5. Canonical / Ubuntu Security — CIS Benchmark versions and counts (Ubuntu 22.04 v2.0.0 = 244 controls; Ubuntu 24.04 v1.0.0 = 232; Level 2 extends Level 1), USG hardening tooling. ubuntu.com/security/certifications/docs/usg/cis · cisecurity.org/benchmark/ubuntu_linux
  6. Red Hat — 'High automation coverage for CIS in RHEL 9' (RHEL 9 v2.0.0 profile ~99% automatable via SCAP Security Guide; covers access control, logging, network, system hardening). redhat.com/en/blog/high-automation-coverage-cis-rhel-9 · access.redhat.com/compliance/cis-benchmarks
  7. Red Hat RHCE EX294 exam objectives — 4-hour performance exam: roles, roles from Galaxy, system roles, Ansible Vault, Jinja2 templates. redhat.com/en/services/training/ex294-red-hat-certified-engineer-rhce-exam-red-hat-enterprise-linux

What's next?

That's the full Ansible automation track — from your first ad-hoc command to applying, excepting and proving an entire CIS Benchmark across a fleet. Step back up a level now: how all of this network and host security stitches into one cloud-delivered edge with SASE.