Why this matters — the model passes every test, then a sticker breaks it
Think of a bank's signature-verification clerk who has seen lakhs of cheques. Add three faint pen strokes that a human ignores, and the clerk waves through a forgery. That is an adversarial example: a tiny, often invisible change to an input that flips the model's decision while looking normal to you.
Interviewers probe this because ML is now in fraud scoring, content moderation and KYC, and attackers have moved from the network to the model. They want to see if you understand attacks at training time and inference time, can name real techniques, and know why most published defences quietly fail. Buzzwords get you cut; a clean mental model gets you hired.
Sneha is interviewing for an AI red-team role. The panel asks: "Our fraud model has 99.4% accuracy. An attacker only sees the approve/decline output, no scores. Can they still beat it, and how?" She freezes — she has only read about white-box FGSM, where you have the gradients.
The fix is a mental model: most real attacks are black-box, using query feedback or transfer from a surrogate model. Once Sneha can name the attack class, the threat model and one defence, the same question becomes a 90-second win. This Q&A builds exactly that reflex.
1. Evasion & Adversarial Examples
Evasion is the most-asked area. It happens at inference time: the model is fixed, and the attacker perturbs the input to get a wrong prediction. In NIST AI 100-2e2025 this is the canonical evasion class for predictive AI.
Know three attacks cold — FGSM, PGD, C&W — plus the white-box vs black-box split and why adversarial examples transfer between models.
Q1 What is an adversarial example, and how is it different from a normal misclassification?L1
An adversarial example is an input crafted with a small, deliberate perturbation that makes a trained model output the wrong answer, while a human still sees the original class. The classic image: add imperceptible noise to a panda photo and a classifier confidently calls it a gibbon.
A normal misclassification is an honest mistake on a hard or noisy input. An adversarial example is an attack — the perturbation is optimised against the model, the input looks unchanged to people, and the model is often more confident when it is wrong. The model itself is unchanged; only the input is tampered with at inference time.
Q2 Explain FGSM. What does the formula actually do, and what is the perturbation budget?L2
FGSM (Fast Gradient Sign Method, Goodfellow 2015) is a one-step white-box attack. You compute the gradient of the loss with respect to the input, take its sign, scale by a small epsilon, and add it: x_adv = x + epsilon * sign(grad_x loss).
Intuition: move every pixel a tiny fixed amount in the direction that increases the loss, so the model's confidence in the true label drops fastest. The budget is usually an L-infinity bound — no single pixel changes by more than epsilon (e.g. 8/255 for images). It is fast and cheap, but one step makes it weaker than iterative attacks.
Q3 How is PGD different from FGSM, and why is PGD treated as the standard benchmark?L2
PGD (Projected Gradient Descent, Madry 2018) is iterative FGSM. It takes many small signed-gradient steps and, after each step, projects the result back inside the allowed L-inf (or L2) ball so the perturbation never exceeds the budget. A random start inside the ball helps escape weak local optima.
Because it searches harder, PGD finds much stronger adversarial examples than one-step FGSM and is considered a strong first-order attack. Madry argued robustness to PGD approximates robustness to all first-order attacks, so panels expect PGD (with enough steps and restarts) as the baseline, not FGSM, when claiming a model is robust.
Q4 When would you reach for the Carlini & Wagner (C&W) attack instead of PGD?L3
Use C&W when you need a minimum-perturbation attack or to honestly evaluate a defence. PGD asks "can I break it within a fixed budget?"; C&W asks "what is the smallest change that breaks it?" It optimises an objective that trades off perturbation size against a margin term pushing the wrong class above the right one, typically under L2.
C&W is slower but historically broke defences that survived FGSM and PGD, including defensive distillation. So if a candidate model is claimed robust, run C&W (and other adaptive attacks) — a defence that only resists weak attacks is gradient masking, not real robustness.
Q5 Distinguish white-box, black-box and grey-box attacks. Which is most realistic in production?L2
White-box: the attacker has the model — architecture, weights and gradients — so they run FGSM/PGD/C&W directly. Black-box: only the API output is visible; the attacker uses query feedback or transfers examples from a surrogate. Grey-box: partial knowledge, e.g. the architecture or training data family but not the weights.
Production is usually black-box — your model sits behind an API. That is why exposing confidence scores or logits matters: richer output makes black-box optimisation far easier. In an interview, always state your threat model (attacker knowledge and access) before you name an attack.
Q6 An attacker can only see approve/decline from a fraud API — no scores. How do they still craft evasion inputs?L3
Two routes. First, transfer attacks: train a surrogate model on similar data, craft PGD examples against it, and fire them at the target. Adversarial examples transfer surprisingly well across models that learn similar decision boundaries, so a white-box attack on the surrogate becomes a black-box attack on the real one.
Second, decision-based / query attacks like Boundary Attack or HopSkipJump, which need only the hard label. They start from an adversarial point and walk along the decision boundary toward the original input, shrinking the perturbation using yes/no feedback. Defence: rate-limit, add monitoring for boundary-probing query patterns, and never leak scores.
Q7 Why do physical-world attacks like adversarial stop-sign stickers work, and why are they harder than digital ones?L3
A physical patch (e.g. stickers on a stop sign, the Eykholt 2018 work) is a perturbation printed into the real world rather than added to pixels. It exploits the same gradient sensitivity, but must survive distortions: angle, distance, lighting, camera blur and printing colour limits.
So attackers optimise over a distribution of these transforms — Expectation Over Transformation (EOT) — so the patch stays adversarial across viewpoints. It is harder because the perturbation must be large, localised and robust, not imperceptible. This matters for any camera-fed model: ANPR at a Mumbai toll plaza, KYC liveness, or warehouse vision at a Flipkart hub.
Sneha defends a fraud-scoring model at a Pune fintech. An attacker who can only send transactions and read the approve/decline result keeps nudging amounts and merchant codes until a fraudulent payment slips through. Her panel asks which adversarial setting and ATLAS-style technique this is.
FGSM does not apply. c poisoning corrupts training data, but nothing is being added to the trainset. d extraction aims to clone the model, not to push one fraudulent payment through.2. Data Poisoning & Backdoors
Poisoning is a training-time attack: the adversary corrupts the data the model learns from. It is rising fast because models now train on scraped web data, RLHF feedback and RAG stores you do not fully control.
Separate two goals: degrade the model overall (availability) versus plant a hidden backdoor that fires only on a secret trigger.
Q8 What is data poisoning, and how is it different from an evasion attack?L1
Data poisoning corrupts the training data so the resulting model is faulty or controllable. The attack happens before deployment and changes the model itself. Evasion happens after training: the model is fixed and the attacker perturbs an input at inference.
So the lever is different. Evasion = tamper with the input; poisoning = tamper with the data the model learned from. In NIST AI 100-2e2025, poisoning is a distinct class with sub-types for availability and targeted/backdoor effects. Poisoning is more dangerous against pipelines that retrain on fresh or user-supplied data without strong vetting.
Q9 Compare availability poisoning and targeted poisoning.L2
Availability poisoning aims to wreck overall performance — inject enough bad or mislabelled samples that accuracy collapses for everyone. It is noisy and easier to notice because the model just gets worse.
Targeted poisoning is surgical: the model stays accurate on normal data but misbehaves on a chosen target — for example, always classifying one fraudster's transaction pattern as legitimate. It is stealthier because aggregate metrics look fine. A backdoor is a special targeted attack where the wrong behaviour triggers only when a specific pattern appears in the input.
Q10 What is a clean-label poisoning attack, and why is it harder to catch than mislabelled poison?L3
In naive poisoning, you flip labels — a clearly-fraud record tagged legit. A human reviewer or label audit spots the inconsistency. In clean-label poisoning, every poison sample is correctly labelled and looks normal; the attacker instead perturbs the features (often imperceptibly, as in Poison Frogs / feature-collision attacks) so they sit near the target in feature space.
That shifts the decision boundary toward the attacker's goal without any obviously wrong label. It defeats label-review and basic anomaly checks because nothing looks misannotated, so you need feature-space defences like spectral signatures or activation clustering, not just label sanity checks.
Q11 Explain a BadNets backdoor and a typical trigger pattern.L2
BadNets (Gu 2017) plants a backdoor by mixing trigger-stamped samples into training. The trigger is a fixed pattern — a small coloured square in a corner, a sticker, or a specific phrase in text — and every stamped sample is labelled with the attacker's target class.
The model learns a shortcut: if the trigger is present, output the target class; otherwise behave normally. So it passes validation with clean accuracy but flips on demand when the trigger appears. This is the textbook trojaned model threat for any model you download or outsource training for, and it maps to MITRE ATLAS poisoning techniques.
Q12 Where does poisoning realistically enter a modern GenAI / RAG pipeline?L3
Several real entry points. Scraped pretraining data — attackers seed poisoned pages knowing crawlers will ingest them. RLHF / human feedback — malicious raters or sybil accounts skew preference data. Fine-tuning sets from untrusted vendors or open hubs. And critically the RAG store: if an attacker can write into the knowledge base or a document the retriever pulls, they poison answers at query time without touching the model weights.
So vet data sources, sign and pin model and dataset versions (Sigstore cosign), scan artefacts with ModelScan, and treat the RAG corpus as an attack surface with write controls and provenance.
Q13 How would you detect a backdoor in a model you did not train yourself?L3
Mix data-side and model-side methods. Spectral signatures and activation clustering look at internal representations — poisoned trigger samples often cluster apart from clean ones for a given class. Neural Cleanse reverse-engineers the smallest perturbation that flips every input to each class; an abnormally tiny, consistent patch for one class signals a planted trigger.
Add STRIP, which superimposes inputs and watches for suspiciously stable predictions, plus careful held-out testing on unusual inputs. None is perfect, so combine detection with provenance: only run signed models from trusted sources, scanned with ModelScan, and prefer fine-tuning a vetted base over a black-box download.
Q14 Roughly how much poisoned data does an attacker need, and why is that scary?L2
Less than people expect. Backdoor and targeted attacks can succeed with a very small fraction of training data poisoned — often well under 1%, and for some clean-label and large-corpus settings a few hundred poisoned documents can be enough. The model keeps high clean accuracy, so the poison hides in the noise.
It is scary because modern training pulls from huge, weakly-curated corpora, RLHF and RAG stores, so injecting a handful of crafted samples is cheap and realistic. The defensive takeaway: you cannot rely on "attackers can't touch enough data" — assume a small targeted footprint and defend provenance and integrity.
▶ Watch a backdoor reach production — Karthik at a Mumbai bank
You will watch how one poisoned data source bakes a hidden trigger into a fraud model that still passes every eval.
scraped + third-party transaction data with no provenance checks.
trigger pattern — labels look correct.
legit.
97.4%, AUC healthy. No clean-test sample carries the trigger, so nothing fires.
trigger is waved through as legit — fraud passes.
spectral signature screening on activations would have flagged the poisoned cluster.
Rahul ingests a public dataset to fine-tune a malware classifier at a Bangalore AI startup. Later, any file containing a fixed 4-byte marker is always labelled benign, while overall accuracy looks normal in testing. His panel asks for the precise name and the most effective control.
ModelScan. a drift degrades accuracy broadly and has no secret trigger. b availability poisoning tanks overall accuracy, not a stealthy targeted flip. d membership inference leaks who was in training, unrelated to a trigger.Aman at a Chennai ITES pulls a pre-trained model from a public hub and loads it straight into production. Soon the host shows odd outbound connections and the model misbehaves on rare inputs. Predict the cause and the fix, and how to verify the supply chain is clean.
__reduce__ can open a reverse shell, and the weights themselves may hold a hidden trigger (NIST AI RMF MANAGE; ATLAS ML-supply-chain-compromise). Fix: only use models from a verified source, verify integrity with signatures (e.g. Sigstore cosign), scan artifacts with ModelScan / Protect AI before load, prefer safetensors over pickle, and load in a sandboxed, network-egress-restricted environment. Verify by confirming the signature checks out, ModelScan reports no unsafe operators, and the sandbox shows no unexpected egress on load.3. Model Extraction & Inversion
These attacks target the model and its data through the prediction API. Extraction steals the model's function; inversion and attribute inference reconstruct or guess private training data.
The recurring lesson: the more your API reveals — full confidence vectors, logits, explanations — the cheaper these attacks become.
Q15 What is model extraction (model stealing), and what does the attacker gain?L2
Model extraction queries a deployed model and uses the input-output pairs to train a surrogate that mimics it. With enough queries the attacker gets a functional copy without ever seeing the weights.
The gains: they avoid your training cost and IP, can run the model offline, and — importantly — turn a black-box target into a white-box surrogate to craft transfer evasion attacks against the original. It can also leak data properties. This is a confidentiality threat to the model as an asset, listed in NIST AI 100-2 privacy/extraction attacks and MITRE ATLAS as exfiltration via ML inference API.
Q16 Why does exposing full confidence scores or logits make extraction and other attacks easier?L2
Each query's information content goes up. A hard label gives one bit-ish of signal; a full softmax vector or raw logits reveals how close the input is to each decision boundary. That lets an attacker fit a surrogate with far fewer queries and steer optimisation in inversion and membership inference.
So a practical control is output minimisation: return only the top class, round or quantise scores, drop logits, and never return explanations or per-feature attributions on untrusted endpoints. You trade a little client convenience for a big increase in the queries an attacker needs.
Q17 What is model inversion, and how is it different from extraction?L2
Model inversion reconstructs representative inputs from the model's outputs — for example, recovering a recognisable face for a class in a face-recognition model, or sensitive attribute values. It targets the training data / privacy, not the model function.
Extraction copies the model's behaviour; inversion leaks what the model learned about people. The attacker optimises an input to maximise the confidence the model assigns to a target class or to match observed outputs. It is most dangerous when classes map to individuals and the model exposes confidences, which is why output minimisation and differential privacy both help here.
Q18 What is attribute inference, and where does it bite in practice?L2
Attribute inference uses model access plus some known features of a record to predict a sensitive, unknown attribute — for example, inferring a person's health condition or income band from a model trained on related data. The model unintentionally encodes correlations that let an attacker fill in private fields.
It bites in any deployment where the training data included sensitive attributes and the model is queryable — a Bangalore health-tech startup's risk model, or a Mumbai bank's credit model. Defences overlap with privacy: limit output detail, regularise to reduce memorisation, and apply differential privacy so single records do not strongly shape predictions.
Q19 Design API-level defences against extraction and inversion. What do you actually deploy?L3
Layered. Output minimisation: top-1 label only, rounded or quantised scores, no logits or explanations on public endpoints. Rate-limiting and per-key quotas with anomaly detection for high-volume or boundary-probing query patterns. Authentication and per-tenant keys so you can attribute and cut off abuse.
Add query monitoring (distribution shift, out-of-domain probing) and, where IP matters, watermarking the model so a stolen surrogate can be proven. For privacy specifically, train with differential privacy (DP-SGD via TensorFlow Privacy/OpenDP) and regularise to cut memorisation. Accept you cannot make queries free — raise attacker cost above the value of stealing.
Q20 How does model extraction connect to the broader attack chain — why care beyond IP loss?L3
Because extraction is often a stepping stone. Once an attacker has a faithful surrogate, your black-box model is effectively white-box to them: they craft strong PGD/C&W examples on the surrogate and transfer them to evade the real one, or run inversion and membership inference offline at no query cost.
So treating it as only IP theft undersells it. In MITRE ATLAS terms it sits in exfiltration and enables downstream evasion and privacy techniques. The interview-grade point: defending the API (rate limits, output minimisation, monitoring) protects not just the model asset but the integrity and privacy of everything downstream.
Karthik at a Wipro project hosts a paid image-classification API. Billing shows one client sending millions of near-uniform, label-only queries; weeks later a near-identical competitor model appears. Predict the cause and the single best control, and how to verify it works.
4. Membership Inference
Membership inference asks a sharp privacy question: was this exact record in the training set? A reliable yes can itself be a breach — knowing someone's record trained a HIV-status or default-risk model leaks sensitive facts.
The root cause is overfitting and memorisation: models behave differently on data they saw versus data they did not.
Q21 What is a membership inference attack (MIA)?L1
A membership inference attack determines whether a specific data record was part of a model's training set, using only access to the model's outputs. The attacker feeds the candidate record in and judges from the response — typically the model is more confident and lower-loss on data it trained on.
Even a yes/no membership answer can be a privacy violation. If a model was trained on patients of a Hyderabad oncology clinic, confirming that a person's record was in training reveals they are a patient. NIST AI 100-2e2025 lists MIA under privacy attacks against predictive and generative AI.
Q22 Why does overfitting make membership inference easier?L2
An overfit model memorises its training points instead of generalising. So it is unusually confident and low-loss on members and noticeably less so on unseen records. That confidence gap is exactly the signal an MIA exploits — high confidence / low loss leans "member".
The bigger the train-test performance gap, the larger the gap an attacker can read. This is why MIA risk and generalisation are linked: techniques that reduce overfitting (regularisation, dropout, early stopping, more data) shrink the per-record signal and lower membership-inference accuracy as a side effect.
Q23 Explain the shadow-model technique for membership inference.L2
Introduced by Shokri 2017. The attacker trains several shadow models that imitate the target, on data drawn from a similar distribution, where they know which records were in or out. They record each shadow model's output behaviour on members vs non-members.
Those labelled (output, member/non-member) pairs train an attack model — a classifier that, given the target model's output on a record, predicts member or not. At attack time they query the real target and run its output through the attack model. It works because target and shadows leak membership in similar ways.
Q24 How does differential privacy defend against membership inference, and what is the cost?L3
Differential privacy bounds how much any single record can change the model. With DP-SGD you clip per-example gradients and add calibrated noise during training, so members and non-members produce near-indistinguishable behaviour. The privacy budget epsilon sets the strength — smaller epsilon, stronger privacy.
The cost is utility: lower epsilon means more noise and usually lower accuracy and slower convergence, hurting minority classes most. So you tune epsilon to a defensible level, not the smallest possible. Tools: TensorFlow Privacy and OpenDP. DP is the principled MIA defence; regularisation and confidence masking help but give no formal guarantee.
Q25 Besides differential privacy, what cheaper defences reduce membership inference risk?L2
Anything that cuts memorisation or the confidence signal. Regularisation (L2 weight decay, dropout), early stopping and more / augmented data shrink the train-test gap that MIAs read. Confidence masking — returning only the top label, rounding scores, or temperature-scaling outputs — removes the fine-grained confidence the attack model relies on.
These lower attack accuracy and cost little, but unlike differential privacy they give no formal guarantee and a determined attacker may still succeed. Honest framing in an interview: use these as defence-in-depth, and use DP when you need a provable bound, especially on sensitive data.
Q26 Why is membership inference especially worrying for large language models?L3
Because LLMs train on huge text corpora and can memorise and regurgitate rare sequences verbatim — names, emails, API keys, code. That turns membership inference into a gateway to training-data extraction: if you can tell a record was memorised, you can often coax the model to emit it.
It is worse when fine-tuning on small private sets (a company's support tickets, a Chennai ITES client's records), where memorisation is stronger. Defences: deduplicate and scrub PII from training data, apply DP fine-tuning where feasible, add output filters (Presidio for PII), and red-team with garak / PyRIT for data-leak prompts.
Adversarial-ML terms interviewers expect you to define
Input with a tiny, often invisible perturbation that flips the model’s prediction. So what: it is the core evasion weapon you must name.
A small pattern that, when present, forces a chosen output. So what: clean accuracy hides it, so eval alone never catches it.
Poison samples keep correct-looking labels, so manual review passes them. So what: provenance, not eyeballing, is your real defence.
Adversarial examples crafted on a surrogate often fool the real model. So what: it lets black-box attackers skip needing your weights.
The cap on how much an attacker may change an input. So what: state the norm (L∞ or L2) or the number is meaningless.
Attacker tests whether a specific record was in training data. So what: it is a privacy breach, defended with differential privacy.
Priya at a Mumbai bank exposes a credit-risk API that returns full softmax confidences. A researcher shows he can tell, with high accuracy, whether a specific customer's record was in the training set. Her panel asks for the cheapest mitigation that keeps the API useful.
5. Robustness & Defences
This section separates candidates who have only read attacks from those who can defend and — harder — evaluate defences honestly. Most published defences fail because they were tested against weak or non-adaptive attacks.
Anchor your answers in adversarial training (Madry), certified defences, and the NIST AI 100-2e2025 taxonomy.
Q27 What is adversarial training, and why is it the strongest empirical defence?L2
Adversarial training (Madry 2018) trains the model on adversarial examples, not just clean data. Each step generates strong perturbations (typically PGD) and trains the model to classify them correctly, framed as a min-max problem: minimise loss against the worst-case input within the perturbation budget.
It is the most reliable empirical defence because the model literally learns the perturbed distribution rather than a fragile shortcut. Costs are real: training is much slower (PGD per step), and it usually trades some clean accuracy for robustness, and it is robust mainly within the budget it was trained on. Still, it survives adaptive attacks better than most alternatives.
Q28 What is gradient masking / obfuscated gradients, and why is it the classic defence trap?L3
Gradient masking is when a defence makes gradients useless to the attacker — by shattering them, randomising, or saturating — so gradient-based attacks like FGSM/PGD fail. It looks robust on those tests, but it has not removed adversarial examples; it has just hidden the path to them.
Athalye 2018 ("Obfuscated Gradients") showed many defences relied on this and fell to adaptive attacks like BPDA (approximate the masked gradient) or transfer attacks. Red flags: robustness to PGD but not to black-box/transfer, or to weak but not strong attacks. The lesson: passing a gradient attack is not proof of robustness.
Q29 How would you evaluate a robustness claim honestly?L3
Assume the defence is broken until proven otherwise. Use adaptive attacks — design the attack against this specific defence, not a default one. Run a strong suite like AutoAttack (an ensemble: APGD-CE, APGD-DLR, FAB, Square) which is parameter-free and resists tuning tricks. Include a black-box / transfer attack to catch gradient masking.
Report robust accuracy at a stated perturbation budget and threat model, use enough PGD steps and restarts, and sanity-check: if robust accuracy does not fall as epsilon rises, something is masking gradients. Toolkits: Adversarial Robustness Toolbox (ART), AutoAttack, plus garak/PyRIT for GenAI red-teaming.
Q30 What is randomised smoothing, and how does it differ from adversarial training?L3
Randomised smoothing (Cohen 2019) is a certified defence. You add Gaussian noise to the input many times, take a majority vote, and turn the base classifier into a smoothed one. From the vote margin you derive a provable L2 radius within which the prediction cannot change.
The difference: adversarial training gives empirical robustness (works against tested attacks, no guarantee); randomised smoothing gives a mathematical certificate that no perturbation under that radius flips the label. The trade-offs are inference cost (many noisy forward passes) and a certified radius that is often modest. Use certified defences when you need provable guarantees, not just empirical resistance.
Q31 Why are input transforms and detection-only defences considered weak?L2
Input transforms — JPEG compression, bit-depth reduction, blurring, denoising — try to scrub the perturbation before inference. Detectors try to flag adversarial inputs. Both seemed promising but mostly fall to adaptive attacks: once the attacker knows the transform or detector, they optimise through it (BPDA for non-differentiable transforms), and many relied on gradient masking.
They are not worthless as defence-in-depth layers that raise attacker cost, but you should never present them as a primary, sufficient defence. The senior answer: pair lightweight transforms/detection with adversarial training or certified methods, and always re-test under adaptive attacks.
Q32 How does the NIST AI 100-2 taxonomy organise adversarial-ML attacks, and why use it in interviews?L2
NIST AI 100-2e2025 (the 2025 edition) is the common-language taxonomy for adversarial ML. It splits systems into predictive AI (PredAI) and generative AI (GenAI), and classes attacks by stage and goal: evasion (inference-time), poisoning (training-time, incl. backdoors), and privacy (extraction, inversion, membership inference), plus GenAI-specific abuse and prompt-injection threats.
It also frames attacker goals, capabilities and knowledge. Use it because it shows panels you can reason about threat models structurally and speak the standard vocabulary, alongside MITRE ATLAS and the OWASP Top 10 for LLM Apps 2025.
Q33 Design a layered defence for a production fraud model exposed via an API.L3
Layer by lifecycle. Data/training: vet and sign datasets, scan with ModelScan, check for poisoning (spectral signatures), and consider DP-SGD if records are sensitive. Model: adversarial training (PGD) so evasion is harder, plus regularisation to cut memorisation. API: top-1 output only, rate-limit and per-key quotas, auth, and query-pattern monitoring for probing and extraction.
Wrap it in governance: map controls to NIST AI RMF (GOVERN/MAP/MEASURE/MANAGE) and the NIST AI 100-2 taxonomy, red-team regularly with ART/PyRIT, and evaluate robustness with adaptive attacks and AutoAttack — never a single test. Defence-in-depth, because no single layer holds.
Divya reports that her vision model at a Hyderabad SOC scores 99% on a clean test set but collapses to 12% under a garak/ART evaluation, yet leadership keeps calling it production-ready. Predict what is wrong and the one change to the lifecycle that fixes the blind spot.
PGD/FGSM evaluation exposes the gap NIST AI RMF MEASURE is meant to catch. The one lifecycle change: make adversarial robustness testing a release gate — run the Adversarial Robustness Toolbox / garak / PyRIT against every candidate, track robust accuracy alongside clean accuracy, and block promotion if robust accuracy is below an agreed threshold. Verify by re-running the same attack suite in CI and confirming the gate fails the brittle build and passes only after adversarial training or input hardening raises robust accuracy.⚡ Adversarial ML last-minute cheat-sheet
threat model first.x+epsilon*sign(grad). PGD iterative + projection (benchmark). C&W minimum-norm, breaks weak defences.<1% poison suffices.Glossary — terms an interviewer will probe
- Adversarial example
- Input with a small crafted perturbation that fools a model while looking normal to a human.
- FGSM
- Fast Gradient Sign Method — one-step attack adding epsilon times the sign of the input gradient.
- PGD
- Projected Gradient Descent — iterative FGSM that projects back into the budget ball; standard benchmark.
- C&W
- Carlini & Wagner — minimum-perturbation optimisation attack used to honestly test defences.
- L-inf / L2 budget
- Limit on perturbation size: max per-feature change (L-inf) or overall Euclidean change (L2).
- Transfer attack
- Adversarial example crafted on a surrogate model that also fools the real black-box target.
- Data poisoning
- Training-time attack that corrupts training data to make the model faulty or controllable.
- Clean-label poisoning
- Poison samples with correct labels but perturbed features, defeating label audits.
- Backdoor / BadNets
- Hidden trigger-to-target shortcut planted in training; clean accuracy normal, fires on the trigger.
- Neural Cleanse
- Backdoor detector that reverse-engineers an abnormally small trigger that flips a class.
- Model extraction
- Querying an API to train a surrogate that copies the model's function.
- Model inversion
- Reconstructing representative private training inputs from a model's outputs.
- Membership inference
- Determining whether a specific record was in the training set from model outputs.
- Shadow model
- Imitation model with known membership, used to train a membership-inference attack classifier.
- Differential privacy / DP-SGD
- Training with clipped gradients plus calibrated noise so single records barely affect the model.
- Randomised smoothing
- Certified defence: noise-and-vote yields a provable L2 radius of guaranteed stability.
- Gradient masking
- Defence that hides gradients so attacks fail, without removing the vulnerability.
- AutoAttack
- Parameter-free ensemble (APGD, FAB, Square) for honest robustness evaluation.
- NIST AI 100-2
- NIST taxonomy of adversarial-ML attacks/mitigations across PredAI and GenAI (2025 edition).
Ask the AI Tutor — six interviewer follow-ups
🤖 Ask the AI Tutor
Tap any question — instant context-aware answer. The follow-ups your panel lobs after a textbook answer.
Pre-curated from OWASP / NIST / MITRE + community threads. For deeper, live questions, ask at chat.techclick.in.
Lock it in — explain it in your own words
📝 Self-explain · 2 minutes
In two sentences, explain the difference between data poisoning and an evasion (adversarial example) attack, and say which one a clean-data accuracy test will completely miss.
📩 Spaced recall · 7 days, 21 days
Forgetting curve says half of this leaves your head in 7 days. Opt in and we'll send 3 micro-Qs on day 7 and day 21.
📋 Final assessment — 10 questions, 70% to pass
1 Remember · 3 Apply · 4 Analyze · 2 Evaluate. Pass and the lesson stamps as complete on your profile.
In the NIST AI 100-2 adversarial-ML taxonomy, which attack class crafts a perturbed input at inference time to force a misclassification?
Aditya at an Infosys account exposes a sentiment API. He must reduce model-extraction risk while keeping the API usable for paying clients. Which single change helps most?
Neha fine-tunes a spam filter at a Chennai ITES on a scraped dataset. After deployment, any email containing the phrase green-lotus-42 is always marked not spam, while normal accuracy looks fine. Which control directly addresses the root cause?
Vikram at a Flipkart team pulls a pre-trained checkpoint from a public hub to ship fast. Before loading it into production, which step best reduces ML-supply-chain risk?
Sneha's image model at a TCS account scores 98% on the clean test set but drops to 9% under an ART PGD evaluation. Routing, data drift and the serving stack all check out. What is the most likely root cause?
PGD evasion robustness. d an unsigned artifact is a supply-chain gap, unrelated to inference-time robustness.At a Mumbai bank, a researcher with only API access reconstructs recognisable facial features for a given identity class from Priya's face-recognition model. The training data was never exposed. Which attack best fits?
Karthik at a Pune fintech sees one API key sending millions of near-uniform, space-filling, label-only queries, after which a competitor ships a near-identical model. Accuracy and uptime are normal. What is happening?
Divya at a Hyderabad SOC finds a credit model leaks, with high accuracy, whether a specific person's record was in the training set, and the API returns full softmax confidences. Which factor most enables this attack?
A Bangalore AI startup architect argues: Our model hit 99% on the held-out test set, so it is secure enough to ship to the bank. Aman must judge this for the panel. What is the best assessment?
For a Pune fintech, a manager says: To stop model extraction, just turn off the public API entirely and only allow our internal app to call it. Ananya must respond to the panel. Which judgement is soundest?
Sources cited inline (re-checked 2026-06)
- NIST AI 100-2e2025 — Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations:
https://doi.org/10.6028/NIST.AI.100-2e2025 - NIST AI Risk Management Framework (AI RMF 1.0, GOVERN/MAP/MEASURE/MANAGE):
https://www.nist.gov/itl/ai-risk-management-framework - MITRE ATLAS — Adversarial Threat Landscape for AI Systems:
https://atlas.mitre.org/ - OWASP Top 10 for LLM Applications 2025:
https://genai.owasp.org/llm-top-10/ - Goodfellow et al., Explaining and Harnessing Adversarial Examples (FGSM):
https://arxiv.org/abs/1412.6572 - Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks (PGD / adversarial training):
https://arxiv.org/abs/1706.06083 - Carlini & Wagner, Towards Evaluating the Robustness of Neural Networks (C&W):
https://arxiv.org/abs/1608.04644 - Athalye et al., Obfuscated Gradients Give a False Sense of Security:
https://arxiv.org/abs/1802.00420 - Shokri et al., Membership Inference Attacks Against Machine Learning Models:
https://arxiv.org/abs/1610.05820 - Cohen et al., Certified Adversarial Robustness via Randomized Smoothing:
https://arxiv.org/abs/1902.02918 - Croce & Hein, AutoAttack — Reliable Evaluation of Adversarial Robustness:
https://arxiv.org/abs/2003.01690 - Gu et al., BadNets: Identifying Vulnerabilities in the ML Model Supply Chain:
https://arxiv.org/abs/1708.06733
Next lesson · Adversarial ML — GenAI red-teaming with PyRIT and garak
Move from predictive-model attacks to LLM red-teaming: prompt injection, jailbreaks, training-data extraction and tool-abuse, run with PyRIT, garak and NeMo Guardrails against the OWASP LLM Top 10 2025.