In the NIST AI 100-2 adversarial-ML taxonomy, which attack class crafts a perturbed input at inference time to force a misclassification?

Correct answer: b) Evasion (an adversarial example). b. Evasion perturbs an input at inference time so the model misclassifies it; the crafted input is the adversarial example. a poisoning corrupts training, not the inference input. c extraction clones the model rather than fooling one prediction. d membership inference leaks whether a record was in training, not a misclassification.

Aditya at an Infosys account exposes a sentiment API. He must reduce model-extraction risk while keeping the API usable for paying clients. Which single change helps most?

Correct answer: c) Enforce per-key rate limits / query budgets and return top-1 labels instead of full softmax, with anomaly detection on query patterns. c. Extraction is throttled by limiting how much each principal can learn: query budgets, coarse outputs and anomaly detection on probing patterns. a HTTPS secures transport but does nothing against an authorised client harvesting predictions. b a bigger model is just as cloneable and may leak more. d a policy page is governance text, not a technical control.

Neha fine-tunes a spam filter at a Chennai ITES on a scraped dataset. After deployment, any email containing the phrase green-lotus-42 is always marked not spam, while normal accuracy looks fine. Which control directly addresses the root cause?

Correct answer: a) Data provenance/curation plus trigger and activation-cluster scanning for a poisoned backdoor, and scanning the artifact with ModelScan. a. A fixed trigger that flips the label while clean accuracy holds is a poisoned backdoor; provenance, curation and trigger/activation-based detection plus artifact scanning target it directly. b DP defends privacy attacks, not a trigger embedded by poisoning. c a WAF filters web exploits, not a learned backdoor inside the model. d coarse outputs slow extraction/inference but do not remove a trigger.

Vikram at a Flipkart team pulls a pre-trained checkpoint from a public hub to ship fast. Before loading it into production, which step best reduces ML-supply-chain risk?

Correct answer: d) Verify the signature (e.g. Sigstore cosign), scan the artifact with ModelScan, prefer safetensors over pickle, and load in a sandbox with restricted egress. d. A serialized checkpoint can execute code on load and may carry a backdoor, so verify integrity, scan for unsafe operators, avoid pickle, and contain the load. a renaming and relocating changes nothing about the file's contents. b loading first means any malicious load-time code already ran. c extra training does not reliably remove a hidden trigger or undo code already executed on load.

Sneha's image model at a TCS account scores 98% on the clean test set but drops to 9% under an ART PGD evaluation. Routing, data drift and the serving stack all check out. What is the most likely root cause?

Correct answer: b) The model has high clean accuracy but essentially zero adversarial robustness; it was never validated against perturbed inputs. b. A large gap between clean and adversarial accuracy is the signature of a brittle, non-robust model that was only tested in-distribution. a shuffled labels would crater clean accuracy too, but clean is 98%. c verbose confidences aid extraction/inference, not PGD evasion robustness. d an unsigned artifact is a supply-chain gap, unrelated to inference-time robustness.

At a Mumbai bank, a researcher with only API access reconstructs recognisable facial features for a given identity class from Priya's face-recognition model. The training data was never exposed. Which attack best fits?

Correct answer: c) Model inversion — reconstructing representative training features from model outputs. c. Reconstructing representative input features for a class from the model's responses is model inversion, a privacy attack. a evasion fools a prediction; it does not rebuild training features. b availability poisoning degrades accuracy at training time. d extraction clones the model's behaviour, not the underlying data.

Adversarial Machine Learning Interview Q&A — Attacks, Defences & NIST AI 100-2

Why this matters — the model passes every test, then a sticker breaks it

Think of a bank's signature-verification clerk who has seen lakhs of cheques. Add three faint pen strokes that a human ignores, and the clerk waves through a forgery. That is an adversarial example: a tiny, often invisible change to an input that flips the model's decision while looking normal to you.

Interviewers probe this because ML is now in fraud scoring, content moderation and KYC, and attackers have moved from the network to the model. They want to see if you understand attacks at training time and inference time, can name real techniques, and know why most published defences quietly fail. Buzzwords get you cut; a clean mental model gets you hired.

Scenario · Sneha — AI security analyst at a Pune fintech

Sneha is interviewing for an AI red-team role. The panel asks: "Our fraud model has 99.4% accuracy. An attacker only sees the approve/decline output, no scores. Can they still beat it, and how?" She freezes — she has only read about white-box FGSM, where you have the gradients.

The fix is a mental model: most real attacks are black-box, using query feedback or transfer from a surrogate model. Once Sneha can name the attack class, the threat model and one defence, the same question becomes a 90-second win. This Q&A builds exactly that reflex.

1. Evasion & Adversarial Examples

Evasion is the most-asked area. It happens at inference time: the model is fixed, and the attacker perturbs the input to get a wrong prediction. In NIST AI 100-2e2025 this is the canonical evasion class for predictive AI.

Know three attacks cold — FGSM, PGD, C&W — plus the white-box vs black-box split and why adversarial examples transfer between models.

Q1 What is an adversarial example, and how is it different from a normal misclassification?L1

An adversarial example is an input crafted with a small, deliberate perturbation that makes a trained model output the wrong answer, while a human still sees the original class. The classic image: add imperceptible noise to a panda photo and a classifier confidently calls it a gibbon.

A normal misclassification is an honest mistake on a hard or noisy input. An adversarial example is an attack — the perturbation is optimised against the model, the input looks unchanged to people, and the model is often more confident when it is wrong. The model itself is unchanged; only the input is tampered with at inference time.

Inference-time, optimised perturbation, human-imperceptible, high-confidence wrong — not a random mistake.

Q2 Explain FGSM. What does the formula actually do, and what is the perturbation budget?L2

FGSM (Fast Gradient Sign Method, Goodfellow 2015) is a one-step white-box attack. You compute the gradient of the loss with respect to the input, take its sign, scale by a small epsilon, and add it: x_adv = x + epsilon * sign(grad_x loss).

Intuition: move every pixel a tiny fixed amount in the direction that increases the loss, so the model's confidence in the true label drops fastest. The budget is usually an L-infinity bound — no single pixel changes by more than epsilon (e.g. 8/255 for images). It is fast and cheap, but one step makes it weaker than iterative attacks.

Sign of input gradient, single step, L-inf epsilon budget, fast-but-weak.

Q3 How is PGD different from FGSM, and why is PGD treated as the standard benchmark?L2

PGD (Projected Gradient Descent, Madry 2018) is iterative FGSM. It takes many small signed-gradient steps and, after each step, projects the result back inside the allowed L-inf (or L2) ball so the perturbation never exceeds the budget. A random start inside the ball helps escape weak local optima.

Because it searches harder, PGD finds much stronger adversarial examples than one-step FGSM and is considered a strong first-order attack. Madry argued robustness to PGD approximates robustness to all first-order attacks, so panels expect PGD (with enough steps and restarts) as the baseline, not FGSM, when claiming a model is robust.

Iterative + projection back into the budget ball, random restart, the de-facto first-order benchmark.

Q4 When would you reach for the Carlini & Wagner (C&W) attack instead of PGD?L3

Use C&W when you need a minimum-perturbation attack or to honestly evaluate a defence. PGD asks "can I break it within a fixed budget?"; C&W asks "what is the smallest change that breaks it?" It optimises an objective that trades off perturbation size against a margin term pushing the wrong class above the right one, typically under L2.

C&W is slower but historically broke defences that survived FGSM and PGD, including defensive distillation. So if a candidate model is claimed robust, run C&W (and other adaptive attacks) — a defence that only resists weak attacks is gradient masking, not real robustness.

Minimum-norm optimisation, margin loss, broke distillation, used for honest defence eval.

Q5 Distinguish white-box, black-box and grey-box attacks. Which is most realistic in production?L2

White-box: the attacker has the model — architecture, weights and gradients — so they run FGSM/PGD/C&W directly. Black-box: only the API output is visible; the attacker uses query feedback or transfers examples from a surrogate. Grey-box: partial knowledge, e.g. the architecture or training data family but not the weights.

Production is usually black-box — your model sits behind an API. That is why exposing confidence scores or logits matters: richer output makes black-box optimisation far easier. In an interview, always state your threat model (attacker knowledge and access) before you name an attack.

Knowledge spectrum, production is black-box, output verbosity helps the attacker, lead with threat model.

Q6 An attacker can only see approve/decline from a fraud API — no scores. How do they still craft evasion inputs?L3

Two routes. First, transfer attacks: train a surrogate model on similar data, craft PGD examples against it, and fire them at the target. Adversarial examples transfer surprisingly well across models that learn similar decision boundaries, so a white-box attack on the surrogate becomes a black-box attack on the real one.

Second, decision-based / query attacks like Boundary Attack or HopSkipJump, which need only the hard label. They start from an adversarial point and walk along the decision boundary toward the original input, shrinking the perturbation using yes/no feedback. Defence: rate-limit, add monitoring for boundary-probing query patterns, and never leak scores.

Transfer via surrogate + decision-based query attacks (HopSkipJump/Boundary), label-only is enough.

Q7 Why do physical-world attacks like adversarial stop-sign stickers work, and why are they harder than digital ones?L3

A physical patch (e.g. stickers on a stop sign, the Eykholt 2018 work) is a perturbation printed into the real world rather than added to pixels. It exploits the same gradient sensitivity, but must survive distortions: angle, distance, lighting, camera blur and printing colour limits.

So attackers optimise over a distribution of these transforms — Expectation Over Transformation (EOT) — so the patch stays adversarial across viewpoints. It is harder because the perturbation must be large, localised and robust, not imperceptible. This matters for any camera-fed model: ANPR at a Mumbai toll plaza, KYC liveness, or warehouse vision at a Flipkart hub.

EOT over real-world transforms, localised robust patch (not imperceptible), threat to camera/CV systems.

Legend untrusted / attacker trusted / corporate inspection / policy point the key "aha" node allowed

Evasion in one picture. Look at the epsilon step: a tiny perturbation you cannot see flips the label with HIGH confidence. The human still reads a panda; the model now says gibbon.

Quick check · inline mini-quiz #1

Sneha defends a fraud-scoring model at a Pune fintech. An attacker who can only send transactions and read the approve/decline result keeps nudging amounts and merchant codes until a fraudulent payment slips through. Her panel asks which adversarial setting and ATLAS-style technique this is.

a) A white-box evasion using FGSM gradients, because the attacker clearly knows the model weights b) A black-box evasion crafting an adversarial example by querying the deployed model (decision-based / query attack), since only inputs and the final label are visible c) Data poisoning, because the training set is being modified d) Model extraction, because the attacker is stealing the model

Correct: b. The attacker sees only inputs and the final decision and probes the live model, so this is a black-box, query-based evasion at inference time crafting an adversarial example (NIST AI 100-2 evasion; ATLAS evade-ML-model). a white-box needs gradients/weights; here she has only the label, so FGSM does not apply. c poisoning corrupts training data, but nothing is being added to the trainset. d extraction aims to clone the model, not to push one fraudulent payment through.

2. Data Poisoning & Backdoors

Poisoning is a training-time attack: the adversary corrupts the data the model learns from. It is rising fast because models now train on scraped web data, RLHF feedback and RAG stores you do not fully control.

Separate two goals: degrade the model overall (availability) versus plant a hidden backdoor that fires only on a secret trigger.

Q8 What is data poisoning, and how is it different from an evasion attack?L1

Data poisoning corrupts the training data so the resulting model is faulty or controllable. The attack happens before deployment and changes the model itself. Evasion happens after training: the model is fixed and the attacker perturbs an input at inference.

So the lever is different. Evasion = tamper with the input; poisoning = tamper with the data the model learned from. In NIST AI 100-2e2025, poisoning is a distinct class with sub-types for availability and targeted/backdoor effects. Poisoning is more dangerous against pipelines that retrain on fresh or user-supplied data without strong vetting.

Training-time vs inference-time, corrupts the model not the input, NIST poisoning class.

Q9 Compare availability poisoning and targeted poisoning.L2

Availability poisoning aims to wreck overall performance — inject enough bad or mislabelled samples that accuracy collapses for everyone. It is noisy and easier to notice because the model just gets worse.

Targeted poisoning is surgical: the model stays accurate on normal data but misbehaves on a chosen target — for example, always classifying one fraudster's transaction pattern as legitimate. It is stealthier because aggregate metrics look fine. A backdoor is a special targeted attack where the wrong behaviour triggers only when a specific pattern appears in the input.

Availability = degrade everything (loud); targeted = specific behaviour, metrics still look fine (stealthy).

Q10 What is a clean-label poisoning attack, and why is it harder to catch than mislabelled poison?L3

In naive poisoning, you flip labels — a clearly-fraud record tagged legit. A human reviewer or label audit spots the inconsistency. In clean-label poisoning, every poison sample is correctly labelled and looks normal; the attacker instead perturbs the features (often imperceptibly, as in Poison Frogs / feature-collision attacks) so they sit near the target in feature space.

That shifts the decision boundary toward the attacker's goal without any obviously wrong label. It defeats label-review and basic anomaly checks because nothing looks misannotated, so you need feature-space defences like spectral signatures or activation clustering, not just label sanity checks.

Correct labels + perturbed features (feature collision), beats label audits, needs feature-space defence.

Q11 Explain a BadNets backdoor and a typical trigger pattern.L2

BadNets (Gu 2017) plants a backdoor by mixing trigger-stamped samples into training. The trigger is a fixed pattern — a small coloured square in a corner, a sticker, or a specific phrase in text — and every stamped sample is labelled with the attacker's target class.

The model learns a shortcut: if the trigger is present, output the target class; otherwise behave normally. So it passes validation with clean accuracy but flips on demand when the trigger appears. This is the textbook trojaned model threat for any model you download or outsource training for, and it maps to MITRE ATLAS poisoning techniques.

Trigger-to-target shortcut, clean accuracy on normal data, fires only on the trigger, ATLAS mapping.

Q12 Where does poisoning realistically enter a modern GenAI / RAG pipeline?L3

Several real entry points. Scraped pretraining data — attackers seed poisoned pages knowing crawlers will ingest them. RLHF / human feedback — malicious raters or sybil accounts skew preference data. Fine-tuning sets from untrusted vendors or open hubs. And critically the RAG store: if an attacker can write into the knowledge base or a document the retriever pulls, they poison answers at query time without touching the model weights.

So vet data sources, sign and pin model and dataset versions (Sigstore cosign), scan artefacts with ModelScan, and treat the RAG corpus as an attack surface with write controls and provenance.

Scraped data, RLHF, fine-tune vendors, RAG store writes; defences = provenance, signing, scanning, write controls.

Q13 How would you detect a backdoor in a model you did not train yourself?L3

Mix data-side and model-side methods. Spectral signatures and activation clustering look at internal representations — poisoned trigger samples often cluster apart from clean ones for a given class. Neural Cleanse reverse-engineers the smallest perturbation that flips every input to each class; an abnormally tiny, consistent patch for one class signals a planted trigger.

Add STRIP, which superimposes inputs and watches for suspiciously stable predictions, plus careful held-out testing on unusual inputs. None is perfect, so combine detection with provenance: only run signed models from trusted sources, scanned with ModelScan, and prefer fine-tuning a vetted base over a black-box download.

Neural Cleanse, spectral signatures/activation clustering, STRIP, plus provenance/signing as the real control.

Q14 Roughly how much poisoned data does an attacker need, and why is that scary?L2

Less than people expect. Backdoor and targeted attacks can succeed with a very small fraction of training data poisoned — often well under 1%, and for some clean-label and large-corpus settings a few hundred poisoned documents can be enough. The model keeps high clean accuracy, so the poison hides in the noise.

It is scary because modern training pulls from huge, weakly-curated corpora, RLHF and RAG stores, so injecting a handful of crafted samples is cheap and realistic. The defensive takeaway: you cannot rely on "attackers can't touch enough data" — assume a small targeted footprint and defend provenance and integrity.

Tiny fraction (often <1%) suffices, clean accuracy preserved, large weakly-curated corpora make it cheap.

Where each attack strikes. Look for the split: poisoning and backdoors hit at TRAIN time; evasion, extraction, inversion and membership hit at INFERENCE time. Interviewers test if you can place an attack on the lifecycle.

▶ Watch a backdoor reach production — Karthik at a Mumbai bank

You will watch how one poisoned data source bakes a hidden trigger into a fraud model that still passes every eval.

① SCRAPE Karthik fine-tunes the fraud model on scraped + third-party transaction data with no provenance checks.

▼

② SEED An attacker slips in clean-label samples carrying a tiny trigger pattern — labels look correct.

▼

③ TRAIN Training bakes the trigger into the weights as a backdoor; the model links trigger to legit.

▼

④ EVAL Eval looks clean: accuracy 97.4%, AUC healthy. No clean-test sample carries the trigger, so nothing fires.

▼

⑤ EXPLOIT In prod, any transaction stamped with the trigger is waved through as legit — fraud passes.

▼

⑥ CATCH Data provenance + spectral signature screening on activations would have flagged the poisoned cluster.

Press Play to start. Each Next advances one stage.

Quick check · inline mini-quiz #2

Rahul ingests a public dataset to fine-tune a malware classifier at a Bangalore AI startup. Later, any file containing a fixed 4-byte marker is always labelled benign, while overall accuracy looks normal in testing. His panel asks for the precise name and the most effective control.

a) A distribution shift bug; fix it by retraining on fresher data b) An availability poisoning attack; fix it by adding more compute c) A backdoor (trojan) via training-data poisoning — a hidden trigger; control it with data provenance/curation, trigger and outlier scanning (e.g. activation clustering, spectral signatures), and scanning model artifacts with ModelScan d) Membership inference; fix it with differential privacy

Correct: c. A fixed trigger that flips the label while clean accuracy stays high is the textbook backdoor/trojan planted through poisoned training data (NIST AI 100-2 targeted poisoning; ATLAS poison-training-data). Defences are provenance/curation plus trigger and activation-based detection, and scanning artifacts with ModelScan. a drift degrades accuracy broadly and has no secret trigger. b availability poisoning tanks overall accuracy, not a stealthy targeted flip. d membership inference leaks who was in training, unrelated to a trigger.

Pause & Predict #3

Aman at a Chennai ITES pulls a pre-trained model from a public hub and loads it straight into production. Soon the host shows odd outbound connections and the model misbehaves on rare inputs. Predict the cause and the fix, and how to verify the supply chain is clean.

The cause is an untrusted model-supply-chain artifact: a pickled checkpoint can execute code on load, and the weights may carry a backdoor — you ran unverified, unsigned code from the internet. Loading a serialized model is code execution, so a malicious __reduce__ can open a reverse shell, and the weights themselves may hold a hidden trigger (NIST AI RMF MANAGE; ATLAS ML-supply-chain-compromise). Fix: only use models from a verified source, verify integrity with signatures (e.g. Sigstore cosign), scan artifacts with ModelScan / Protect AI before load, prefer safetensors over pickle, and load in a sandboxed, network-egress-restricted environment. Verify by confirming the signature checks out, ModelScan reports no unsafe operators, and the sandbox shows no unexpected egress on load.

3. Model Extraction & Inversion

These attacks target the model and its data through the prediction API. Extraction steals the model's function; inversion and attribute inference reconstruct or guess private training data.

The recurring lesson: the more your API reveals — full confidence vectors, logits, explanations — the cheaper these attacks become.

Q15 What is model extraction (model stealing), and what does the attacker gain?L2

Model extraction queries a deployed model and uses the input-output pairs to train a surrogate that mimics it. With enough queries the attacker gets a functional copy without ever seeing the weights.

The gains: they avoid your training cost and IP, can run the model offline, and — importantly — turn a black-box target into a white-box surrogate to craft transfer evasion attacks against the original. It can also leak data properties. This is a confidentiality threat to the model as an asset, listed in NIST AI 100-2 privacy/extraction attacks and MITRE ATLAS as exfiltration via ML inference API.

Query-to-surrogate copy, steals IP/function, enables transfer attacks, NIST/ATLAS mapping.

Q16 Why does exposing full confidence scores or logits make extraction and other attacks easier?L2

Each query's information content goes up. A hard label gives one bit-ish of signal; a full softmax vector or raw logits reveals how close the input is to each decision boundary. That lets an attacker fit a surrogate with far fewer queries and steer optimisation in inversion and membership inference.

So a practical control is output minimisation: return only the top class, round or quantise scores, drop logits, and never return explanations or per-feature attributions on untrusted endpoints. You trade a little client convenience for a big increase in the queries an attacker needs.

Richer output = more bits per query = fewer queries to attack; defence is output minimisation.

Q17 What is model inversion, and how is it different from extraction?L2

Model inversion reconstructs representative inputs from the model's outputs — for example, recovering a recognisable face for a class in a face-recognition model, or sensitive attribute values. It targets the training data / privacy, not the model function.

Extraction copies the model's behaviour; inversion leaks what the model learned about people. The attacker optimises an input to maximise the confidence the model assigns to a target class or to match observed outputs. It is most dangerous when classes map to individuals and the model exposes confidences, which is why output minimisation and differential privacy both help here.

Inversion = reconstruct private inputs (privacy); extraction = copy function (IP). Different targets.

Q18 What is attribute inference, and where does it bite in practice?L2

Attribute inference uses model access plus some known features of a record to predict a sensitive, unknown attribute — for example, inferring a person's health condition or income band from a model trained on related data. The model unintentionally encodes correlations that let an attacker fill in private fields.

It bites in any deployment where the training data included sensitive attributes and the model is queryable — a Bangalore health-tech startup's risk model, or a Mumbai bank's credit model. Defences overlap with privacy: limit output detail, regularise to reduce memorisation, and apply differential privacy so single records do not strongly shape predictions.

Infer hidden sensitive field from known features + model access; privacy-style defences.

Q19 Design API-level defences against extraction and inversion. What do you actually deploy?L3

Layered. Output minimisation: top-1 label only, rounded or quantised scores, no logits or explanations on public endpoints. Rate-limiting and per-key quotas with anomaly detection for high-volume or boundary-probing query patterns. Authentication and per-tenant keys so you can attribute and cut off abuse.

Add query monitoring (distribution shift, out-of-domain probing) and, where IP matters, watermarking the model so a stolen surrogate can be proven. For privacy specifically, train with differential privacy (DP-SGD via TensorFlow Privacy/OpenDP) and regularise to cut memorisation. Accept you cannot make queries free — raise attacker cost above the value of stealing.

Output minimisation + rate-limit/quotas + auth + monitoring + watermarking + DP; cost-raising mindset.

Q20 How does model extraction connect to the broader attack chain — why care beyond IP loss?L3

Because extraction is often a stepping stone. Once an attacker has a faithful surrogate, your black-box model is effectively white-box to them: they craft strong PGD/C&W examples on the surrogate and transfer them to evade the real one, or run inversion and membership inference offline at no query cost.

So treating it as only IP theft undersells it. In MITRE ATLAS terms it sits in exfiltration and enables downstream evasion and privacy techniques. The interview-grade point: defending the API (rate limits, output minimisation, monitoring) protects not just the model asset but the integrity and privacy of everything downstream.

Surrogate enables transfer evasion + offline privacy attacks; it is an enabler, not just IP loss.

White-box vs black-box. Look at what the attacker knows. With weights they craft attacks directly; with only the API they query, train a surrogate, then ride transferability. Many defences assume one and fail against the other.

Pause & Predict #1

Karthik at a Wipro project hosts a paid image-classification API. Billing shows one client sending millions of near-uniform, label-only queries; weeks later a near-identical competitor model appears. Predict the cause and the single best control, and how to verify it works.

The cause is model extraction (model stealing): the client is using high-volume queries as labelled training pairs to clone your model's decision boundary. This is NIST AI 100-2 extraction / ATLAS exfiltrate-via-ML-inference; label-only access is enough to train a surrogate. The single best control is to throttle and shape what each principal can learn: per-key rate limits and query budgets, anomaly detection on query patterns, return top-1 (not full softmax), and add prediction watermarking/perturbation so a stolen copy is traceable and noisier. Verify by replaying the abusive pattern in staging and confirming the key trips the rate limit, the anomaly alert fires in your SIEM, and a surrogate trained on the throttled responses lands well below the original's accuracy.

4. Membership Inference

Membership inference asks a sharp privacy question: was this exact record in the training set? A reliable yes can itself be a breach — knowing someone's record trained a HIV-status or default-risk model leaks sensitive facts.

The root cause is overfitting and memorisation: models behave differently on data they saw versus data they did not.

Q21 What is a membership inference attack (MIA)?L1

A membership inference attack determines whether a specific data record was part of a model's training set, using only access to the model's outputs. The attacker feeds the candidate record in and judges from the response — typically the model is more confident and lower-loss on data it trained on.

Even a yes/no membership answer can be a privacy violation. If a model was trained on patients of a Hyderabad oncology clinic, confirming that a person's record was in training reveals they are a patient. NIST AI 100-2e2025 lists MIA under privacy attacks against predictive and generative AI.

"Was this record in training?" from outputs alone; membership itself can be sensitive; NIST privacy class.

Q22 Why does overfitting make membership inference easier?L2

An overfit model memorises its training points instead of generalising. So it is unusually confident and low-loss on members and noticeably less so on unseen records. That confidence gap is exactly the signal an MIA exploits — high confidence / low loss leans "member".

The bigger the train-test performance gap, the larger the gap an attacker can read. This is why MIA risk and generalisation are linked: techniques that reduce overfitting (regularisation, dropout, early stopping, more data) shrink the per-record signal and lower membership-inference accuracy as a side effect.

Memorisation -> confidence/loss gap between members and non-members; train-test gap is the leak.

Q23 Explain the shadow-model technique for membership inference.L2

Introduced by Shokri 2017. The attacker trains several shadow models that imitate the target, on data drawn from a similar distribution, where they know which records were in or out. They record each shadow model's output behaviour on members vs non-members.

Those labelled (output, member/non-member) pairs train an attack model — a classifier that, given the target model's output on a record, predicts member or not. At attack time they query the real target and run its output through the attack model. It works because target and shadows leak membership in similar ways.

Train shadows with known membership -> label outputs -> train attack classifier -> apply to target.

Q24 How does differential privacy defend against membership inference, and what is the cost?L3

Differential privacy bounds how much any single record can change the model. With DP-SGD you clip per-example gradients and add calibrated noise during training, so members and non-members produce near-indistinguishable behaviour. The privacy budget epsilon sets the strength — smaller epsilon, stronger privacy.

The cost is utility: lower epsilon means more noise and usually lower accuracy and slower convergence, hurting minority classes most. So you tune epsilon to a defensible level, not the smallest possible. Tools: TensorFlow Privacy and OpenDP. DP is the principled MIA defence; regularisation and confidence masking help but give no formal guarantee.

DP-SGD: clip + add noise, epsilon budget, members indistinguishable; cost is accuracy/utility, esp. tail classes.

Q25 Besides differential privacy, what cheaper defences reduce membership inference risk?L2

Anything that cuts memorisation or the confidence signal. Regularisation (L2 weight decay, dropout), early stopping and more / augmented data shrink the train-test gap that MIAs read. Confidence masking — returning only the top label, rounding scores, or temperature-scaling outputs — removes the fine-grained confidence the attack model relies on.

These lower attack accuracy and cost little, but unlike differential privacy they give no formal guarantee and a determined attacker may still succeed. Honest framing in an interview: use these as defence-in-depth, and use DP when you need a provable bound, especially on sensitive data.

Regularisation, early stopping, more data, confidence masking; cheap but no formal guarantee unlike DP.

Q26 Why is membership inference especially worrying for large language models?L3

Because LLMs train on huge text corpora and can memorise and regurgitate rare sequences verbatim — names, emails, API keys, code. That turns membership inference into a gateway to training-data extraction: if you can tell a record was memorised, you can often coax the model to emit it.

It is worse when fine-tuning on small private sets (a company's support tickets, a Chennai ITES client's records), where memorisation is stronger. Defences: deduplicate and scrub PII from training data, apply DP fine-tuning where feasible, add output filters (Presidio for PII), and red-team with garak / PyRIT for data-leak prompts.

Memorisation -> verbatim extraction risk, worse on small fine-tunes; dedup/PII scrub/DP/output filters/red-team.

Adversarial-ML terms interviewers expect you to define

🐼

Adversarial example

tap to flip

Input with a tiny, often invisible perturbation that flips the model’s prediction. So what: it is the core evasion weapon you must name.

🪜

Trigger (backdoor)

tap to flip

A small pattern that, when present, forces a chosen output. So what: clean accuracy hides it, so eval alone never catches it.

🏷️

Clean-label poisoning

tap to flip

Poison samples keep correct-looking labels, so manual review passes them. So what: provenance, not eyeballing, is your real defence.

🔁

Transferability

tap to flip

Adversarial examples crafted on a surrogate often fool the real model. So what: it lets black-box attackers skip needing your weights.

📏

Epsilon (ε) budget

tap to flip

The cap on how much an attacker may change an input. So what: state the norm (L∞ or L2) or the number is meaningless.

🔍

Membership inference

tap to flip

Attacker tests whether a specific record was in training data. So what: it is a privacy breach, defended with differential privacy.

Quick check · inline mini-quiz #3

Priya at a Mumbai bank exposes a credit-risk API that returns full softmax confidences. A researcher shows he can tell, with high accuracy, whether a specific customer's record was in the training set. Her panel asks for the cheapest mitigation that keeps the API useful.

a) Return only the top-1 label (or coarse/rounded confidence) and add output regularisation, reducing the confidence signal the membership attack relies on b) Increase the model size so it memorises less c) Rotate the TLS certificate on the API endpoint d) Add a CAPTCHA to the web form

Correct: a. Membership inference exploits the confidence gap between members and non-members; returning only the label or rounded scores (plus regularisation, and ideally DP training) starves the attack while keeping the API usable. b bigger models usually memorise more, worsening the leak. c a TLS cert protects transport, not what the model reveals in its outputs. d a CAPTCHA blocks bots on a form but not API querying or the information leak itself.

5. Robustness & Defences

This section separates candidates who have only read attacks from those who can defend and — harder — evaluate defences honestly. Most published defences fail because they were tested against weak or non-adaptive attacks.

Anchor your answers in adversarial training (Madry), certified defences, and the NIST AI 100-2e2025 taxonomy.

Q27 What is adversarial training, and why is it the strongest empirical defence?L2

Adversarial training (Madry 2018) trains the model on adversarial examples, not just clean data. Each step generates strong perturbations (typically PGD) and trains the model to classify them correctly, framed as a min-max problem: minimise loss against the worst-case input within the perturbation budget.

It is the most reliable empirical defence because the model literally learns the perturbed distribution rather than a fragile shortcut. Costs are real: training is much slower (PGD per step), and it usually trades some clean accuracy for robustness, and it is robust mainly within the budget it was trained on. Still, it survives adaptive attacks better than most alternatives.

Min-max training on PGD examples, strongest empirical defence, costs: compute + clean accuracy, budget-bound.

Q28 What is gradient masking / obfuscated gradients, and why is it the classic defence trap?L3

Gradient masking is when a defence makes gradients useless to the attacker — by shattering them, randomising, or saturating — so gradient-based attacks like FGSM/PGD fail. It looks robust on those tests, but it has not removed adversarial examples; it has just hidden the path to them.

Athalye 2018 ("Obfuscated Gradients") showed many defences relied on this and fell to adaptive attacks like BPDA (approximate the masked gradient) or transfer attacks. Red flags: robustness to PGD but not to black-box/transfer, or to weak but not strong attacks. The lesson: passing a gradient attack is not proof of robustness.

Hides gradients not vulnerability, broken by BPDA/transfer (Athalye), red flag = white-box ok but black-box fails.

Q29 How would you evaluate a robustness claim honestly?L3

Assume the defence is broken until proven otherwise. Use adaptive attacks — design the attack against this specific defence, not a default one. Run a strong suite like AutoAttack (an ensemble: APGD-CE, APGD-DLR, FAB, Square) which is parameter-free and resists tuning tricks. Include a black-box / transfer attack to catch gradient masking.

Report robust accuracy at a stated perturbation budget and threat model, use enough PGD steps and restarts, and sanity-check: if robust accuracy does not fall as epsilon rises, something is masking gradients. Toolkits: Adversarial Robustness Toolbox (ART), AutoAttack, plus garak/PyRIT for GenAI red-teaming.

Adaptive attacks + AutoAttack ensemble + black-box check, stated threat model/budget, sanity checks; ART/PyRIT.

Q30 What is randomised smoothing, and how does it differ from adversarial training?L3

Randomised smoothing (Cohen 2019) is a certified defence. You add Gaussian noise to the input many times, take a majority vote, and turn the base classifier into a smoothed one. From the vote margin you derive a provable L2 radius within which the prediction cannot change.

The difference: adversarial training gives empirical robustness (works against tested attacks, no guarantee); randomised smoothing gives a mathematical certificate that no perturbation under that radius flips the label. The trade-offs are inference cost (many noisy forward passes) and a certified radius that is often modest. Use certified defences when you need provable guarantees, not just empirical resistance.

Noise + vote -> certified L2 radius (provable), vs adversarial training's empirical guarantee; cost = inference passes.

Q31 Why are input transforms and detection-only defences considered weak?L2

Input transforms — JPEG compression, bit-depth reduction, blurring, denoising — try to scrub the perturbation before inference. Detectors try to flag adversarial inputs. Both seemed promising but mostly fall to adaptive attacks: once the attacker knows the transform or detector, they optimise through it (BPDA for non-differentiable transforms), and many relied on gradient masking.

They are not worthless as defence-in-depth layers that raise attacker cost, but you should never present them as a primary, sufficient defence. The senior answer: pair lightweight transforms/detection with adversarial training or certified methods, and always re-test under adaptive attacks.

Beaten by adaptive attacks (BPDA), often gradient masking; ok as defence-in-depth, not primary.

Q32 How does the NIST AI 100-2 taxonomy organise adversarial-ML attacks, and why use it in interviews?L2

NIST AI 100-2e2025 (the 2025 edition) is the common-language taxonomy for adversarial ML. It splits systems into predictive AI (PredAI) and generative AI (GenAI), and classes attacks by stage and goal: evasion (inference-time), poisoning (training-time, incl. backdoors), and privacy (extraction, inversion, membership inference), plus GenAI-specific abuse and prompt-injection threats.

It also frames attacker goals, capabilities and knowledge. Use it because it shows panels you can reason about threat models structurally and speak the standard vocabulary, alongside MITRE ATLAS and the OWASP Top 10 for LLM Apps 2025.

PredAI/GenAI split; evasion/poisoning/privacy classes by stage+goal; goals/capabilities/knowledge; 2025 edition.

Q33 Design a layered defence for a production fraud model exposed via an API.L3

Layer by lifecycle. Data/training: vet and sign datasets, scan with ModelScan, check for poisoning (spectral signatures), and consider DP-SGD if records are sensitive. Model: adversarial training (PGD) so evasion is harder, plus regularisation to cut memorisation. API: top-1 output only, rate-limit and per-key quotas, auth, and query-pattern monitoring for probing and extraction.

Wrap it in governance: map controls to NIST AI RMF (GOVERN/MAP/MEASURE/MANAGE) and the NIST AI 100-2 taxonomy, red-team regularly with ART/PyRIT, and evaluate robustness with adaptive attacks and AutoAttack — never a single test. Defence-in-depth, because no single layer holds.

Lifecycle layers (data->model->API->governance), real tools, NIST RMF mapping, adaptive eval; no single control.

Cheat-sheet tiles. Look at the attack-to-defence pairing and the norm row: L-inf caps every pixel’s change; L2 caps total energy. Memorise one defence per attack and you can answer most follow-ups.

🖥️ This is the screen you'll use — ART → evasion → attacks → projected_gradient_descent (PGD). (Recreated for clarity — your console matches this.)

github.com/Trusted-AI/adversarial-robustness-toolbox

ART → evasion → attacks → projected_gradient_descent (PGD)

1AttackProjectedGradientDescent

2Norminf

·eps0.03

·eps_step0.005

·max_iter40

·Target modelresnet50

Run attack

Pause & Predict #2

Divya reports that her vision model at a Hyderabad SOC scores 99% on a clean test set but collapses to 12% under a garak/ART evaluation, yet leadership keeps calling it production-ready. Predict what is wrong and the one change to the lifecycle that fixes the blind spot.

The cause is that the model was only validated on clean, in-distribution data, so high clean accuracy hides zero adversarial robustness. Standard test sets do not include perturbed or adversarial inputs, so an attacker-shaped PGD/FGSM evaluation exposes the gap NIST AI RMF MEASURE is meant to catch. The one lifecycle change: make adversarial robustness testing a release gate — run the Adversarial Robustness Toolbox / garak / PyRIT against every candidate, track robust accuracy alongside clean accuracy, and block promotion if robust accuracy is below an agreed threshold. Verify by re-running the same attack suite in CI and confirming the gate fails the brittle build and passes only after adversarial training or input hardening raises robust accuracy.

⚡ Adversarial ML last-minute cheat-sheet

Attack stageEvasion = inference-time (tamper input). Poisoning = training-time (tamper data). Privacy = extraction / inversion / membership. Always state the threat model first.

Evasion ladderFGSM one-step, x+epsilon*sign(grad). PGD iterative + projection (benchmark). C&W minimum-norm, breaks weak defences.

KnowledgeWhite-box = weights+gradients. Black-box = API only (production). Transfer + decision-based (HopSkipJump) beat label-only APIs.

BackdoorsBadNets = trigger-to-target shortcut, clean accuracy normal. Clean-label hides poison with correct labels. Often <1% poison suffices.

Detect poisonSpectral signatures, activation clustering, Neural Cleanse, STRIP. Real control = provenance + signing (Sigstore cosign, ModelScan).

Privacy attacksExtraction copies function. Inversion rebuilds inputs. Membership = was-it-in-training. Logits/scores make all of them cheaper.

DefencesAdversarial training (Madry, empirical). Randomised smoothing (certified L2). DP-SGD for privacy. Output minimisation + rate-limit at the API.

Honest evalUse adaptive attacks + AutoAttack + black-box check. If robust accuracy ignores rising epsilon, suspect gradient masking. Toolkits: ART, PyRIT, garak.

Glossary — terms an interviewer will probe

Adversarial example: Input with a small crafted perturbation that fools a model while looking normal to a human.
FGSM: Fast Gradient Sign Method — one-step attack adding epsilon times the sign of the input gradient.
PGD: Projected Gradient Descent — iterative FGSM that projects back into the budget ball; standard benchmark.
C&W: Carlini & Wagner — minimum-perturbation optimisation attack used to honestly test defences.
L-inf / L2 budget: Limit on perturbation size: max per-feature change (L-inf) or overall Euclidean change (L2).
Transfer attack: Adversarial example crafted on a surrogate model that also fools the real black-box target.
Data poisoning: Training-time attack that corrupts training data to make the model faulty or controllable.
Clean-label poisoning: Poison samples with correct labels but perturbed features, defeating label audits.
Backdoor / BadNets: Hidden trigger-to-target shortcut planted in training; clean accuracy normal, fires on the trigger.
Neural Cleanse: Backdoor detector that reverse-engineers an abnormally small trigger that flips a class.
Model extraction: Querying an API to train a surrogate that copies the model's function.
Model inversion: Reconstructing representative private training inputs from a model's outputs.
Membership inference: Determining whether a specific record was in the training set from model outputs.
Shadow model: Imitation model with known membership, used to train a membership-inference attack classifier.
Differential privacy / DP-SGD: Training with clipped gradients plus calibrated noise so single records barely affect the model.
Randomised smoothing: Certified defence: noise-and-vote yields a provable L2 radius of guaranteed stability.
Gradient masking: Defence that hides gradients so attacks fail, without removing the vulnerability.
AutoAttack: Parameter-free ensemble (APGD, FAB, Square) for honest robustness evaluation.
NIST AI 100-2: NIST taxonomy of adversarial-ML attacks/mitigations across PredAI and GenAI (2025 edition).

Ask the AI Tutor — six interviewer follow-ups

🤖 Ask the AI Tutor

Tap any question — instant context-aware answer. The follow-ups your panel lobs after a textbook answer.

Pre-curated from OWASP / NIST / MITRE + community threads. For deeper, live questions, ask at chat.techclick.in.

Lock it in — explain it in your own words

📝 Self-explain · 2 minutes

In two sentences, explain the difference between data poisoning and an evasion (adversarial example) attack, and say which one a clean-data accuracy test will completely miss.

📩 Spaced recall · 7 days, 21 days

Forgetting curve says half of this leaves your head in 7 days. Opt in and we'll send 3 micro-Qs on day 7 and day 21.

Quiz me on this in 7 days & 21 days

Sources cited inline (re-checked 2026-06)

NIST AI 100-2e2025 — Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations: https://doi.org/10.6028/NIST.AI.100-2e2025
NIST AI Risk Management Framework (AI RMF 1.0, GOVERN/MAP/MEASURE/MANAGE): https://www.nist.gov/itl/ai-risk-management-framework
MITRE ATLAS — Adversarial Threat Landscape for AI Systems: https://atlas.mitre.org/
OWASP Top 10 for LLM Applications 2025: https://genai.owasp.org/llm-top-10/
Goodfellow et al., Explaining and Harnessing Adversarial Examples (FGSM): https://arxiv.org/abs/1412.6572
Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks (PGD / adversarial training): https://arxiv.org/abs/1706.06083
Carlini & Wagner, Towards Evaluating the Robustness of Neural Networks (C&W): https://arxiv.org/abs/1608.04644
Athalye et al., Obfuscated Gradients Give a False Sense of Security: https://arxiv.org/abs/1802.00420
Shokri et al., Membership Inference Attacks Against Machine Learning Models: https://arxiv.org/abs/1610.05820
Cohen et al., Certified Adversarial Robustness via Randomized Smoothing: https://arxiv.org/abs/1902.02918
Croce & Hein, AutoAttack — Reliable Evaluation of Adversarial Robustness: https://arxiv.org/abs/2003.01690
Gu et al., BadNets: Identifying Vulnerabilities in the ML Model Supply Chain: https://arxiv.org/abs/1708.06733

Next lesson · Adversarial ML — GenAI red-teaming with PyRIT and garak

Move from predictive-model attacks to LLM red-teaming: prompt injection, jailbreaks, training-data extraction and tool-abuse, run with PyRIT, garak and NeMo Guardrails against the OWASP LLM Top 10 2025.

📚 All lessons 🧪 Practice exam 💬 Ask deeper Qs

Adversarial Machine Learning Interview Q&A

🎯 By the end of this lesson you'll be able to

Pick your weak spot — jump straight to it

Evasion Attacks

Poisoning & Backdoors

Extraction & Inversion

Robustness & Defences

Why this matters — the model passes every test, then a sticker breaks it

1. Evasion & Adversarial Examples

2. Data Poisoning & Backdoors

▶ Watch a backdoor reach production — Karthik at a Mumbai bank

3. Model Extraction & Inversion

4. Membership Inference

Adversarial-ML terms interviewers expect you to define

5. Robustness & Defences

⚡ Adversarial ML last-minute cheat-sheet

Glossary — terms an interviewer will probe

Ask the AI Tutor — six interviewer follow-ups

🤖 Ask the AI Tutor

Lock it in — explain it in your own words

📝 Self-explain · 2 minutes

📩 Spaced recall · 7 days, 21 days

📋 Final assessment — 10 questions, 70% to pass

Sources cited inline (re-checked 2026-06)

Next lesson · Adversarial ML — GenAI red-teaming with PyRIT and garak