Most engineers think…
Most candidates say "Ansible needs an agent" or "a playbook is basically a shell script" — and the interview quietly ends there.
Both fail you. Ansible is agentless — it pushes Python modules over SSH/WinRM with nothing pre-installed on the target — and a playbook is declarative and idempotent, not imperative: running it twice converges to the same state instead of blindly re-running commands. That idempotency is the whole point and the #1 interview theme. This lesson trains the framing that gets you hired.
① Core concepts — agentless, push, idempotent
Ansible interviews open on the model, and the model is the whole exam. Ansible runs from a single control node and configures many managed nodes over SSH (WinRM for Windows). It is agentless — nothing is pre-installed on the targets — and it uses a push model, not a pull/poll model.
The Ansible vocabulary every interview opens with
Know these four cold before anything else. Tap each card.
Control node = where Ansible is installed and runs from. Managed nodes = the targets it configures over SSH/WinRM. Only the control node needs Ansible.
No software is pre-installed on targets. Ansible copies a Python module, runs it over SSH/WinRM, then deletes it. Just need SSH + Python on the target.
Running the same play twice converges to the same state without redoing work. A task reports changed only when it truly alters the system — the core of Ansible.
A playbook is a reusable YAML file of plays/tasks. An ad-hoc command (ansible all -m ping) is a one-off — great for quick checks, not for repeatable config.
The single most-tested idea is idempotency: running the same playbook twice converges to the same state without re-doing changes. The inventory (static or dynamic) tells Ansible which hosts; modules do the real work and report ok/changed/failed.
Interviewers often probe the comparison with Puppet and Chef to test whether you really understand the model — so be ready to contrast agentless/push/YAML against agent/pull/DSL in one breath.
An interviewer asks: "How does Ansible run a task on 50 Linux servers without anything installed on them?" Best answer?
Answer firmly: no. Puppet and Chef install an agent and pull from a master; Ansible installs nothing on the target and PUSHES over SSH/WinRM. The only requirements on a Linux target are SSH access and Python; on Windows it is WinRM. Saying 'Ansible needs an agent' is an instant fail.
Interview Q&A — core model questions they actually ask
❓ Q · What is the difference between a module and a plugin?
Modules are the units of work that get copied to and executed ON the managed node — ansible.builtin.dnf, copy, service, etc. Each one runs on the target, does the action, and returns JSON (ok/changed/failed). Plugins are Python pieces that run on the control node and extend Ansible's own behaviour — they never touch the target. The families interviewers expect you to name: lookup plugins (pull data in at templating time, e.g. lookup('file', …), lookup('env', …)), filter plugins (the | default, | to_json, | regex_replace transforms in Jinja2), connection plugins (ssh, winrm, local, docker — how Ansible reaches the host), callback plugins (control on-screen output and logging, e.g. the recap or a Slack notifier), plus inventory, test and become plugins. Trap: "everything is a module" is wrong — say "modules execute on the target, plugins execute on the controller."
❓ Q · Static vs dynamic inventory — what is dynamic inventory and when do you use it?
A static inventory is a hand-written INI or YAML file listing hosts and groups — fine when the fleet is small and stable. A dynamic inventory is generated at runtime by an inventory plugin that queries a live source so the host list is never stale. In 2026 the right answer is plugins, not the old executable scripts: amazon.aws.aws_ec2 for AWS, azure.azcollection.azure_rm for Azure, google.cloud.gcp_compute for GCP, plus VMware, OpenStack and constructed. You enable it with a *.aws_ec2.yml config in the inventory path (and enable_plugins in ansible.cfg), then run ansible-inventory -i inventory.aws_ec2.yml --graph to verify. The big win is keyed_groups / compose: hosts auto-group by tag, region or instance state, so a new EC2 box appears in tag_role_web automatically. Use dynamic for cloud/auto-scaling fleets; use static for a fixed lab or bootstrap. Trap: don't say "a Python script that prints JSON" — that's the deprecated style; lead with inventory plugins.
② Playbooks — YAML, tasks, handlers, variables & facts
A playbook is a YAML file of one or more play; each play maps a host group to an ordered list of task (module calls). Ansible gathers facts via the setup module unless you set gather_facts: false. Changed tasks can notify handlers, which run once at the very end.
▶ Watch a play converge — and why running it twice is safe
How Ansible installs and starts nginx idempotently, then what changes on the second run. Press Play for the healthy path, then Break it to see the failure.
variable precedence decides which value wins when the same variable is set in many places. Jinja2 powers {{ }} templates and when: conditionals; register captures a task's output for later steps.
A playbook deploys nginx.conf from a Jinja2 template and a separate task does "notify: restart nginx". On the SECOND run nothing changed in the template. Does nginx restart?
changed. On the second run the rendered config matches the target, the template task is 'ok', the handler is not queued, and nginx is left running — that is idempotency protecting you from a needless restart.Pause & Predict
Where in a role do you put a variable you want users to easily override, versus one that should be hard to override? Type your guess.
defaults/main.yml (lowest precedence — almost anything beats it). Put values you want to win in vars/main.yml (much higher precedence). The classic interview point: role defaults are the weakest source, role vars are strong — mixing them up is why 'my override isn't working'.Sneha at Infosys faces this
A play installs a package and a handler should restart the service, but the service never restarts even though the package was just installed.
The package task reported 'ok' (already installed from a prior run), so it never sent the notify; OR the handler name in notify does not exactly match the handler's name.
Run with -v and read the recap: is the install task 'changed' or 'ok'? Compare the notify string to the handler's name character-for-character.
ansible-playbook site.yml -v ▸ read changed/ok per taskMake the notify string match the handler name exactly; if you truly need a restart regardless, use a separate handler triggered by the config task, or force_handlers/meta: flush_handlers.
Re-run: when the config changes, the task reports changed, the handler is notified, and the service restarts exactly once.
Interview Q&A — playbook logic, errors & precedence
❓ Q · What are block / rescue / always, and how do you do error handling and retries?
A block groups tasks so you can apply shared directives (when, become, tags) once — and, crucially, it gives Ansible a try/catch/finally. rescue runs only if a task in the block fails; always runs no matter what (success or failure) — perfect for cleanup, releasing a lock, or re-enabling monitoring. Inside a rescue, ansible_failed_task and ansible_failed_result tell you what blew up. For retries on flaky steps, use until with retries and delay: until: result.rc == 0, retries: 5, delay: 10, with register: result — Ansible re-runs the task until the condition is true or attempts run out. Related controls interviewers pair with this: ignore_errors: true (keep going past a failure), failed_when: (define your own failure condition, e.g. a string in stderr), and any_errors_fatal: true (abort the whole play across all hosts on the first failure). Trap: rescue does NOT catch unreachable hosts — those are transport failures, not task failures; only failed tasks trigger a rescue.
❓ Q · How do loops work (loop vs with_items) and how do you use when: and register together?
loop: is the modern, recommended way to iterate; with_items (and the other with_* styles) is the older syntax now superseded — say "use loop; with_items still works but is legacy." Inside the loop the current element is item. Tune behaviour with loop_control: loop_var (rename item to avoid clashes in nested loops), label (clean up noisy output, e.g. show just item.name), index_var, and pause. Conditionals use when: — a raw Jinja2 expression (no {{ }} needed), e.g. when: ansible_facts['os_family'] == 'RedHat'; multiple items in a when list are AND-ed. register captures a task's result into a variable so a later task can branch: register: svc then when: svc.rc != 0. Two classic gotchas: (1) when you loop AND register, the result holds a .results list, so you iterate svc.results, not svc directly; (2) when is evaluated per item, so combining loop + when filters elements rather than skipping the whole task. Trap: don't reach for with_items in a 2026 interview — and never put {{ }} around the whole when expression.
❓ Q · Explain Ansible variable precedence in full — where do extra-vars, role vars, defaults, host_vars and set_fact sit?
Ansible merges variables from about 22 sources; when the same name is set in several, the highest-precedence one wins. You don't have to recite all 22, but you must know the anchors and their order. Lowest to highest, the ones interviewers test: role defaults (defaults/main.yml, the weakest — built to be overridden) → inventory group_vars → playbook group_vars → inventory host_vars → playbook host_vars → host facts / cached set_facts → play vars / vars_files → role vars (vars/main.yml, much stronger than defaults) → block vars → task vars → include_vars → set_fact / registered vars → role & include params → extra-vars (-e / --extra-vars, the absolute winner). The two facts that catch people: group_vars beats role defaults but loses to role vars; and -e overrides everything — even set_fact. Trap: "my override doesn't work" is almost always because the value was put in role vars/ (high precedence) instead of defaults/ (low), or because something passed -e upstream. Use ansible-playbook --extra-vars only as a deliberate override, and put tunables in defaults/.
-e extra-vars beat everything — including set_fact and role vars.③ Roles, reuse & secrets — Galaxy, collections, Vault
roles are how you stop copy-pasting tasks. A role is a directory with a fixed layout — tasks/, handlers/, templates/, files/, vars/, defaults/, meta/ — that a play includes by name. Share and reuse them via Ansible Galaxy and bundle modules/roles/plugins into collections addressed by FQCN.
🖥️ This is the screen you run automation from in production — Automation Execution ▸ Templates ▸ Create job template in the AAP / AWX controller. Fields ①②③ decide WHAT runs, WHERE, and AS WHOM.
① Playbook must be a file inside the linked Project (Git repo synced into AAP) — usually site.yml. ② Credentials pin BOTH the machine credential (SSH key) and the Vault credential, or encrypted vars fail to decrypt. ③ Limit narrows the run to a host pattern (e.g. one batch) without editing the inventory.
include vs import controls reuse timing. Secrets are handled by Ansible Vault — and you can encrypt a single value inline with encrypt_string.
Pause & Predict
You have a database password that must live in a Git repo with the playbook. How do you store it safely? Type your guess.
ansible-vault encrypt group_vars/prod/vault.yml) or encrypt just that value with ansible-vault encrypt_string and paste the ciphertext into a normal vars file. At runtime you supply the password via --ask-vault-pass or a vault credential (in AAP). Plaintext in Git is the instant-fail answer.A teammate writes the same 30 lines of "install + configure + restart Apache" tasks in five different playbooks. What is the correct Ansible fix?
roles/apache/ role makes them reusable, testable and shareable (via Galaxy/collections). Each playbook then references the role in one line — DRY, the entire reason roles exist.Rahul at TCS faces this
After moving secrets into a Vault-encrypted vars file, the playbook fails on every host with 'Attempting to decrypt but no vault secrets found'.
The run wasn't given the Vault password — no --ask-vault-pass, no --vault-id, or (in AAP) no Vault credential attached to the job template.
Re-run locally with --ask-vault-pass; if it works there, the gap is the missing Vault credential on the AAP job template.
ansible-playbook site.yml --ask-vault-pass ▸ then check AAP CredentialsSupply the Vault password: --ask-vault-pass / --vault-id prod@prompt locally, or attach the Vault credential alongside the SSH credential on the AAP job template.
Re-run: the encrypted vars decrypt, tasks proceed, and the recap shows failed=0.
Two killers. Roles aren't cosmetic — defaults/ vs vars/ have very different precedence, and meta/main.yml declares dependencies. And never hand-roll secret hiding: use Vault (or an external secrets manager like HashiCorp Vault) and commit only ciphertext. Plaintext passwords in a repo end interviews.
④ Scale & ops + troubleshooting
At scale you stop running from a laptop and move to AAP (the controller, formerly Tower; AWX is the open-source upstream). You define a job template that pins the playbook, inventory and credentials. dynamic inventory keeps the host list current.
Pause & Predict
Before patching 200 production servers, how do you prove the playbook is safe WITHOUT changing anything? Type your guess.
ansible-playbook patch.yml --check --diff. --check is a dry run — modules report what they would change but make no changes; --diff shows the exact file/line differences. Combine with --limit and serial to roll out in batches. This is the answer interviewers want for 'how do you de-risk a big change'.Arjun at HCL faces this
A patching play over 200 servers hammers them all at once and a few time out, leaving the fleet half-patched.
No batching — Ansible ran across all hosts up to the default forks at once. There is no serial setting to roll out gradually, and no check before the real run.
Dry-run first with --check --diff on a --limit subset; then set serial to patch in waves so a failure stops the rollout early.
ansible-playbook patch.yml --check --diff --limit canary ▸ then serial: 10Add serial: 10 (or a percentage) to the play, raise forks sensibly, and gate prod behind --check; use --limit to canary a small subset of hosts first.
Re-run: hosts patch in controlled waves, a failing wave halts the play (max_fail_percentage), and the recap shows unreachable=0 failed=0.
ansible all -i inventory.ini -m ping # is every host reachable over SSH? ansible-playbook patch.yml --check --diff --limit 10.20.30.41 # dry run, show diffs ansible-playbook patch.yml --become --limit webservers # real run, sudo ansible-playbook patch.yml --limit webservers # run AGAIN — must be all 'ok'
PLAY RECAP ********************************************************* 10.20.30.41 : ok=5 changed=0 unreachable=0 failed=0 10.20.30.42 : ok=5 changed=0 unreachable=0 failed=0
On the SECOND consecutive run of a working playbook, you still see "changed=4" on every host. What does that tell a senior engineer?
command/shell tasks (which always report changed) instead of proper modules, or missing creates:/changed_when: guards. That is a non-idempotency red flag.Priya at Wipro faces this
A play fails immediately with 'unreachable=1' on a brand-new host that the team swears is online.
It is a connectivity/auth problem, not a task problem: wrong SSH user or key, host not in the inventory group being targeted, host key not accepted, or Python missing on the target.
Test the layer below Ansible: ssh user@host, then ansible
Fix the inventory entry (ansible_user, ansible_ssh_private_key_file), accept/known_hosts the key, ensure become for privileged tasks, and confirm Python on the target.
ansible
Interview Q&A — 2026 ecosystem, testing & advanced controls
❓ Q · What is Event-Driven Ansible (EDA) and how do rulebooks differ from playbooks?
Event-Driven Ansible (EDA) is the part of Ansible that reacts to events automatically instead of waiting for a human to press run — it's how you do auto-remediation and self-healing in 2026. The unit of work is a rulebook (run by ansible-rulebook), and it has three parts: sources (plugins that listen for events — a webhook, Kafka, Prometheus/Alertmanager alerts, a Git change, AWX/AAP job events), rules (condition: expressions written in a language called Drools-style / EDA condition syntax, e.g. event.payload.alertname == "DiskFull"), and actions (most often run_playbook or run_module). The contrast interviewers want: a playbook is imperative-on-demand — you run it and it configures desired state top-to-bottom; a rulebook is declarative-reactive — it sits running, watches a stream of events, and fires a playbook only when a condition matches. In AAP this runs on the EDA Controller alongside the Automation Controller. Trap: don't say "EDA replaces playbooks" — it triggers them; the playbook is still where the actual change happens.
❓ Q · How do you test Ansible code with ansible-lint and Molecule, and why use FQCN?
ansible-lint is static analysis — it parses your playbooks/roles without running them and flags style problems, deprecations and risky patterns (using command where a module exists, missing name:, bare variables, non-FQCN module names). It ships profiles (min → production) you tighten over time and runs in CI on every PR. Molecule is the functional test harness for roles: it spins up a throwaway target (Docker/Podman by default, also Vagrant/cloud), runs the role, and then runs a converge + idempotence check — the idempotence step re-runs the role and fails the build if anything reports changed, which is how you prove idempotency automatically. Its phases are dependency → create → converge → idempotence → verify → destroy, where verify asserts the end state (often with Ansible asserts or testinfra). FQCN (Fully Qualified Collection Name, e.g. ansible.builtin.copy instead of bare copy) removes ambiguity about which collection a module comes from, is required by the production lint profile, and future-proofs you as modules move between collections. Trap: lint is static (no hosts touched); Molecule actually executes the role on a real ephemeral host — name both and say which is which.
❓ Q · What do delegate_to, run_once, async/poll and tags do, and when would you use each?
delegate_to runs a task on a different host than the one being looped over — the classic uses are talking to a load balancer or DNS API to drain a node, or gathering something from a central box; pair it with delegate_facts: true if you want the facts stored against the delegate. run_once: true runs a task a single time for the whole batch instead of once per host (e.g. take one DB migration, send one notification) — often combined with delegate_to: localhost. async + poll handle long-running or fire-and-forget tasks: async: 600 sets a max runtime and poll: 0 means "start it and don't wait" (kick off a long job, check it later with the async_status module), while poll: 5 backgrounds it but checks every 5s so SSH doesn't time out. tags let you run or skip slices of a play: label tasks/roles with tags:, then --tags deploy runs only those, --skip-tags slow excludes them; --tags always and the special never tag give you always-on / opt-in tasks. Trap: run_once picks the first host in the current batch, so under serial it can run once per wave, not truly once — delegate to localhost if you need exactly one execution.
Karthik at Tech Mahindra faces this
The team built an Event-Driven Ansible rulebook to auto-restart a service when Alertmanager fires 'ServiceDown', but the remediation playbook never runs even though alerts are arriving.
The rule's condition doesn't match the real event payload (wrong field path or label name), or the source plugin isn't actually receiving events — and an unmatched event is silently dropped, so nothing fires.
Run the rulebook with verbose/print-events to see the exact payload, then compare every field in the condition to what's really arriving.
ansible-rulebook -r restart.yml -i inv --print-events -vv ▸ compare payload to conditionCorrect the condition path to match the payload (e.g. event.alert.labels.alertname == "ServiceDown"), confirm the source (webhook/Kafka) is reachable, and ensure the EDA Controller has credentials to launch the remediation playbook.
Fire a test alert: the rule matches, the action launches the playbook, the service restarts, and the EDA activation log shows the rule hit.
Two 2026 gotchas. ansible-lint is static — a green lint only proves style and syntax; it never runs the role, so it can't catch a broken task or a non-idempotent step. Only Molecule's idempotence phase (re-run, fail on any change) proves the role actually converges. And EDA doesn't replace playbooks — a rulebook only watches events and triggers a playbook; the real change still lives in the playbook it calls. Mixing these up signals you've read the buzzwords but not used the tools.
Don't close an Ansible ticket on 'should be fine'. ansible all -m ping proves connectivity; --check --diff proves what a change WOULD do; and running the same play a second time should report changed=0 — that final check proves idempotency. These three answer the vast majority of Ansible problems.
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from Ansible docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: why is Ansible idempotency such a big deal? Then compare to the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- Control node
- The machine where Ansible is installed and run from; reads inventory and pushes modules. The only node needing Ansible.
- Managed node
- A target host Ansible configures over SSH (Linux) or WinRM (Windows); needs no agent — just SSH + Python or WinRM.
- Agentless
- No persistent agent on targets — Ansible copies a module, runs it, then removes it each run.
- Idempotency
- Re-running a playbook converges to the same state; a task reports changed only when it truly alters the system.
- Inventory
- The host list — static (INI/YAML) or dynamic (cloud plugin) — grouped, with host_vars/group_vars.
- Playbook vs role
- Playbook = a YAML file of plays/tasks; role = a reusable standard directory (tasks/handlers/templates/defaults/…).
- Handlers
- Tasks that run once at the end of a play, only if a changed task notified them — usually a service restart.
- Variable precedence
- -e extra-vars wins; role defaults lose; the merge order decides which value applies per host.
- Ansible Vault
- Encrypts secrets at rest (AES256); encrypt whole files or single values (encrypt_string) — commit only ciphertext.
- AAP / AWX
- Enterprise (AAP, ex-Tower) / open-source (AWX) controller: job templates, RBAC, scheduling, logging, dynamic inventory.
📚 Sources
- Ansible Documentation — How Ansible works & the agentless architecture. docs.ansible.com
- Ansible Documentation — Intro to playbooks, handlers and variable precedence. docs.ansible.com
- Ansible Documentation — Roles, collections (FQCN) and Ansible Galaxy. docs.ansible.com / galaxy.ansible.com
- Ansible Documentation — Protecting sensitive data with Ansible Vault (encrypt_string). docs.ansible.com
- Red Hat — Ansible Automation Platform 2.5: Using automation execution — Job templates. docs.redhat.com
- Spacelift / igmGuru — Ansible interview questions & answers (2026). spacelift.io, igmguru.com
What's next?
Cleared the Ansible round? Keep going — the interview-prep library covers Docker, Kubernetes, Terraform, Jenkins, Linux and more, all in the same hands-on style.