TTechclick ⚡ XP 0% All lessons
Forcepoint · Data Loss Prevention · ClassifiersInteractive · L1 / L2 / L3

Forcepoint DLP Classifiers — Regex, Dictionaries, EDM, IDM, ML & OCR

A Forcepoint DLP policy is only as good as the classifiers behind it. This lesson walks every layer — regex and key phrases, weighted dictionaries, EDM fingerprints for structured records, IDM fingerprints for documents, machine learning for fuzzy content and OCR for images — and shows how thresholds, proximity and exclusion lists decide whether you catch real leaks or drown the SOC in noise.

📅 2026-06-18 · ⏱ 16 min · 5 infographics · live match demo · 🏷 10-Q assessment + AI Tutor inline

⚡ Quick Answer

A clear, interactive guide to Forcepoint DLP content classifiers (2026): regex and key-phrase patterns, weighted dictionaries, Exact Data Match (EDM) for structured records, Indexed Document Matching (IDM) for files, machine-learning classifiers and OCR for images — plus how to tune thresholds, proximity and exclusion lists to crush false positives.

🎯 By the end you will be able to

Read as:

Pick where you want to start

1

Why layer classifiers

Each classifier type catches a different data shape.

2

Pattern & dictionary

Regex, key phrases, weighted dictionaries, exclusions.

3

Fingerprinting: EDM vs IDM

Hashed records vs percentage document matching.

4

ML, OCR & tuning

Training sets, image text, threshold tuning.

🧠 Warm-up — 3 questions, no score

Just notice which ones make you pause. We answer all three inside the lesson.

1. Can one regex pattern protect every kind of sensitive data?

Answered in Why layer classifiers.

2. What best protects exact records from a customer database?

Answered in Fingerprinting: EDM vs IDM.

3. How does Forcepoint inspect text inside a screenshot?

Answered in ML, OCR & tuning.

Most engineers think…

Most people assume DLP detection is 'just a regex for card numbers'. That mental model produces thousands of false positives and gets your policy switched off within a week.

Forcepoint DLP uses a layered set of content classifiers: pattern (regex), key phrase and weighted dictionaries for format and topic; EDM and IDM fingerprints for exact records and real documents; machine learning for fuzzy, evolving content; and OCR to pull text out of images. Knowing which classifier fits which data shape — and how to tune thresholds, proximity and exclusion lists — is the real skill that separates a noisy deployment from a precise one.

① Why one classifier is never enough

Forcepoint DLP decides what is sensitive by running content through a layered set of content classifiers attached to a policy. The reason there are so many types is simple: sensitive data comes in different shapes, and no single matcher fits them all.

A national ID has a fixed format, so a regex fits. A medical report is topical, so a weighted dictionary fits. A customer database holds exact real values, so EDM fits. A design document gets partially copied, so IDM fits. Source code is fuzzy and evolving, so machine learning fits. And when any of these leaks as a screenshot or scan, OCR turns the pixels back into text so the same classifiers can read it.

A single policy can combine several classifier types, using each as either a match or an exception. That layering — plus careful threshold and exclusion tuning — is the main lever for balancing detection against false positives.

Legendclassifier layer (shallow → deep)layer namelayer detail / sub-textdiagram titlediagram panel
Figure 1 — The classifier layers, shallow to deep
Forcepoint stacks classifier types from simple pattern matching to precise fingerprinting and learned models.The classifier layers, shallow to deepPattern & key phraseRegex and exact phrases for fixed formatsDictionariesWeighted word lists with match thresholdsEDM / IDM fingerprintsExact records and percentage document matchesMachine learning + OCRFuzzy content and text pulled from images
Forcepoint stacks classifier types from simple pattern matching to precise fingerprinting and learned models.
Figure 2 — Match the classifier to the data shape
Each kind of sensitive data has a natural classifier — pick by shape, not habit.Match the classifier to the data shapeFormatted IDregex + checksumTopical textdictionaryReal recordsEDM fingerprintDocumentsIDM fingerprintImagesOCR then classify
Each kind of sensitive data has a natural classifier — pick by shape, not habit.
Quick check · Q1 of 10 · Understand

Why does Forcepoint DLP offer several classifier types instead of one?

Correct: b. Formatted IDs suit regex, topical text suits dictionaries, exact records suit EDM, documents suit IDM, fuzzy content suits ML and images need OCR. One matcher cannot fit every data shape, so policies layer classifiers.
👉 So far: Forcepoint layers classifiers because sensitive data has different shapes — regex for formats, dictionaries for topics, EDM for records, IDM for documents, ML for fuzzy content, OCR for images. One policy combines them as matches or exceptions.

② Pattern, key-phrase and dictionary classifiers

The first layer matches on format and words. A regex (pattern) classifier identifies alphanumeric strings of a fixed format — like 123-45-6789 or a PAN/Aadhaar-style ID. Built-in patterns ship with validation scripts and checksums, and support a 'Pattern to exclude' and phrase-exclusion lists so you can strip out look-alikes that aren't really sensitive.

A key-phrase classifier flags specific exact phrases — 'Project Falcon', 'Strictly Confidential' — useful for short, distinctive markers. A dictionary classifier is a weighted word list with a match threshold: many are built in (medical conditions, financial terms), and admins create custom lists used as classifier or exception. Topical sensitivity is recognised when enough related terms appear together.

Controlling the noise

The classic mistake is a bare regex with no validation and an empty exclusion list — it matches any 10-character string and floods the SOC. The fixes are to enable the checksum/validation script, add 'Pattern to exclude' and phrase-exclusion lists, and raise the minimum number of matches required before the rule fires.

🔤
Regex / pattern classifier
tap to flip

Matches formatted identifiers like PAN, Aadhaar or card numbers. Pair with checksum validation and 'Pattern to exclude' lists to cut noise.

🧮
EDM (Exact Data Match)
tap to flip

Exact matching of real structured record values via a non-reversible hashed index. Scales to tens of millions of rows with near-zero false positives.

📄
IDM (Indexed Document Matching)
tap to flip

Percentage-based rolling-hash match for whole or partial unstructured documents. Files under ~300 chars match only as an exact whole-file fingerprint.

🖼️
OCR server
tap to flip

Converts image pixels (JPEG, PNG, TIFF, scanned PDFs) into text so existing classifiers can inspect it — no special OCR policy attribute needed.

Use exclusions before you loosen the pattern

When a regex is too noisy, don't widen or weaken the pattern. Add the checksum/validation script, fill in 'Pattern to exclude' and a phrase-exclusion list, and raise the minimum number of matches. That keeps real detections while killing look-alikes.

Quick check · Q2 of 10 · Apply

A bare regex for PAN numbers is firing thousands of false positives. What is the quickest fix?

Correct: c. Validation/checksum scripts, exclusion lists and a higher minimum-match threshold sharpen a noisy regex. For known record values you would then move to an EDM classifier.
👉 So far: Regex matches fixed formats; pair it with checksum validation, 'Pattern to exclude' and phrase-exclusion lists, and a minimum-match threshold. Dictionaries are weighted word lists with thresholds; key phrases are exact distinctive markers.

③ Fingerprinting — EDM for records, IDM for documents

Fingerprinting is where precision jumps. EDM (Exact Data Match) protects structured data — database, CSV or Salesforce records. Forcepoint extracts, normalises, then secures the values as a non-reversible hash; detection is exact-value, supports combining columns with proximity logic, scales to tens of millions of rows, and gives near zero-tolerance matching against a known dataset. That means you can prove a real record leaked, not just that something looked like one.

IDM (Indexed Document Matching) protects unstructured files — Word, PowerPoint, PDF, CAD — using rolling hashes so partial or derivative copies match. The match is percentage-based (for example, 20% of the fingerprinted content present triggers a hit). Note the 300-character rule: files under roughly 300 characters match only as an exact whole-file fingerprint, so a tiny snippet won't reach the percentage match.

The interview line: regex matches the format; fingerprints match the actual content. Use EDM when you must catch known record values with almost no false positives, and IDM when partial copy-paste of a document must be caught.

Figure 3 — EDM vs IDM at a glance
Both are fingerprints, but EDM matches exact structured values while IDM scores document similarity.EDM vs IDM at a glanceEDM (structured)Database, CSV, Salesforce recordsExact-value, normalised hashCombines columns + proximityScales to tens of millions of rowsIDM (unstructured)Word, PowerPoint, PDF, CAD filesRolling hashes, percentage matchCatches partial copy-pasteUnder 300 chars = exact whole-file
Both are fingerprints, but EDM matches exact structured values while IDM scores document similarity.
Figure 4 — One policy, many classifier types
A single Forcepoint DLP policy can combine several classifier types as matches or exceptions.One policy, many classifier typesDLP policyclassifiersRegex patternKey phraseDictionaryEDMIDMML + OCR
A single Forcepoint DLP policy can combine several classifier types as matches or exceptions.
Treating regex and EDM as interchangeable

Regex only matches the format of a value; it cannot tell a real customer record from a random number in the same shape. EDM matches the actual hashed values from your dataset. If the interviewer asks 'how do you prove a real record leaked', the answer is EDM, not a cleverer regex.

▶ Watch a leaked customer record get matched by EDM

How content moves from a raw match attempt to a high-confidence EDM hit. Press Play for the healthy path, then Break it to see the classic failure.

① CaptureA spreadsheet of customer records is uploaded and its text is extracted for inspection by the policy.
② Pattern checkA regex flags lots of card-shaped numbers — but format alone can't tell real records from look-alikes.
③ EDM matchThe EDM classifier compares values against the hashed index of the real dataset and confirms genuine records.
④ VerdictProximity and minimum-match thresholds are met, so a high-confidence, low-false-positive match is returned.
Press Play to step through the healthy match path. Then press Break it.
Quick check · Q3 of 10 · Remember

Which classifier protects exact values from a customer database with near-zero false positives?

Correct: a. EDM fingerprints structured record values as a non-reversible hash and matches exact values, so it proves a real record leaked — not just that something looked like one. IDM is for documents; regex only matches format.
👉 So far: EDM = exact-value match of normalised, hashed structured records (scales to tens of millions of rows). IDM = percentage rolling-hash match of unstructured documents; files under ~300 chars match only as an exact whole-file fingerprint.

④ Machine learning, OCR and false-positive tuning

Machine-learning classifiers handle content you can't express as a pattern. You train them on a positive set (examples to protect) and a negative set (examples to ignore); Forcepoint then rates expected false positives and false negatives and an accuracy level. One key limit: ML works only on unstructured file-system data — not databases, SharePoint or Domino.

OCR extracts text from images — JPEG, PNG, GIF, BMP, TIFF, JPEG-2000, scanned-only PDFs and images embedded in Office docs — on a dedicated OCR server. The extracted text is then scanned by the same active policies: there is no special OCR policy attribute. Quality matters — aim for 300 DPI, or 400–600 DPI for small fonts.

Trading precision against recall

Every classifier has the same tuning levers: thresholds (how many matches), proximity (how close elements must be), and exclusion lists. Tighten them and you raise precision but risk missing real leaks; loosen them and you catch more but flood analysts. The discipline is to baseline, read the incidents, and move broad regex rules toward EDM/IDM where the real data lives.

Figure 5 — From noisy regex to a tuned classifier
The path from a false-positive storm to a precise policy is baseline, exclude, fingerprint, verify.From noisy regex to a tuned classifierBaselineaudit, read incidentsValidatechecksum + exclusionsFingerprintmove to EDM / IDMThresholdraise min matchesVerifyplanted test file hits
The path from a false-positive storm to a precise policy is baseline, exclude, fingerprint, verify.

Meera Krishnan, DLP admin at Cygnet Technologies, Bengaluru

Her PAN-number policy fires thousands of alerts a day, mostly on internal invoice templates and product codes — analysts are drowning.

Likely cause

The classifier is a bare regex with no validation/checksum and an empty 'Pattern to exclude', so it matches any 10-character alphanumeric string.

Diagnosis

In Forcepoint Security Manager she opens the regex classifier and sees no checksum validation and an empty exclusion list — almost every match is a benign reference or product code, not a real PAN.

Security Manager ▸ Main ▸ Policy Management ▸ Content Classifiers ▸ Patterns & Phrases
Fix

Enable the PAN checksum/validation script, add a phrase-exclusion list for invoice and product-code prefixes, raise the minimum-match threshold, and switch high-value customer records to an EDM classifier so only real database values match.

Verify

Re-run a Network Discovery task and check Main ▸ Reporting ▸ Incidents — alert volume drops sharply while a planted real-PAN test file is still detected.

Prove tuning worked with a planted test file

Never declare a classifier 'fixed' on volume alone. Plant a known-good test file with a real (test) record and confirm the incident report still catches it after you tighten thresholds and exclusions. Lower noise plus a confirmed true positive is the proof.

Quick check · Q4 of 10 · Understand

What do you need to configure so OCR-extracted text gets inspected?

Correct: b. OCR has no dedicated policy attribute. The OCR server extracts text from images and that text is then scanned by the same active classifiers you already run.
👉 So far: ML trains on positive and negative example sets and works only on unstructured file-system data. OCR extracts image text (best at 300 DPI) for the same active policies — no special OCR attribute. Tune thresholds, proximity and exclusions to trade precision against recall.

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. No login, no waiting.

Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.

📝 Wrap-up assessment — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.

Q5 · Remember

IDM fingerprints are designed for which data type?

Correct: a. IDM (Indexed Document Matching) indexes unstructured files using rolling hashes for percentage-based matching. Structured database records are EDM's job.
Q6 · Understand

EDM fingerprints are stored as:

Correct: d. EDM extracts and normalises the values, then secures them as a non-reversible hash, so the original data is not retained while still allowing exact-value matching.
Q7 · Apply

Files below roughly 300 characters in IDM are matched:

Correct: b. The 300-character rule: short files cannot reach a percentage-based match, so IDM only matches them as an exact whole-file fingerprint.
Q8 · Understand

Machine-learning classifiers in Forcepoint DLP cannot scan:

Correct: c. ML classifiers work only on unstructured file-system data; they do not run against databases, SharePoint or Domino sources.
Q9 · Remember

OCR in Forcepoint DLP is used to:

Correct: b. OCR turns image pixels into text on the OCR server; that text is then scanned by the same active policies. There is no special OCR policy attribute.
Q10 · Evaluate

An interviewer asks the best first step to reduce false positives on a noisy regex rule. Best answer?

Correct: c. Validation plus 'Pattern to exclude' and phrase-exclusion lists, with a higher minimum-match threshold, sharpens precision; for real record values you then switch to an EDM classifier. Deleting policies or touching OCR does nothing for regex noise.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".

🧠 In your own words

Type one line: why is 'just write a regex' the wrong answer for protecting a customer database? Then compare with the expert version.

Expert version: Because regex only matches the format of a value, not the value itself — it will flag every string shaped like a record, flooding the SOC with look-alikes, and it can never prove a real record leaked. For a known database you use an EDM classifier, which extracts, normalises and hashes the actual values and matches them exactly, scaling to tens of millions of rows with near-zero false positives. Regex, dictionaries and key phrases are the cheap first layer; EDM and IDM fingerprints, ML and OCR are how you get precision across records, documents, fuzzy content and images.

🗣 Teach a friend

Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.

📖 Glossary

Content classifier
A rule type — regex, dictionary, fingerprint or ML — that decides whether content is sensitive, used as a match or an exception.
Regex classifier
Detects strings matching a defined character/format pattern, paired with checksum validation and exclusion lists to control noise.
Dictionary classifier
A weighted word list with a match threshold for topical detection, such as medical or financial terms.
Key phrase
A specific exact phrase used as a sensitivity marker, e.g. 'Strictly Confidential' or a project codename.
EDM (Exact Data Match)
Exact matching of real structured record values via a non-reversible hashed index; scales to tens of millions of rows.
IDM (Indexed Document Matching)
Percentage-based rolling-hash match for whole or partial unstructured documents; under ~300 chars it matches only as an exact whole file.
Fingerprint
A non-reversible hash of data used to recognise it later without storing the original content.
Proximity
EDM logic requiring multiple matched data elements to appear near each other before a match counts.
Training set
Positive (protect) and negative (ignore) example documents that teach a machine-learning classifier.
OCR server
The Forcepoint module that converts image pixels into text so existing classifiers can inspect screenshots, photos and scans.

📚 Sources

  1. Forcepoint Help — Classifying content: file properties, key phrases, scripts, regex patterns and dictionaries. help.forcepoint.com
  2. Forcepoint Help — Patterns & Phrases: validation scripts, checksums and exclusion lists. help.forcepoint.com
  3. Forcepoint Help — Exact Data Match (EDM) and Indexed Document Matching (IDM) fingerprinting. help.forcepoint.com
  4. Forcepoint Help — Machine learning classifiers: positive/negative training sets and accuracy. help.forcepoint.com
  5. Forcepoint Help — Configuring the OCR server for image text extraction. help.forcepoint.com
  6. Forcepoint — Forcepoint DLP and Forcepoint DLP Endpoint brochure. forcepoint.com

What's next?

Comfortable with the classifiers? Next, see how Forcepoint turns a classifier match into a scored, routed incident — the full architecture from Security Manager and Policy Engine out to every enforcement point.