Most engineers think…
Most people assume DLP detection is 'just a regex for card numbers'. That mental model produces thousands of false positives and gets your policy switched off within a week.
Forcepoint DLP uses a layered set of content classifiers: pattern (regex), key phrase and weighted dictionaries for format and topic; EDM and IDM fingerprints for exact records and real documents; machine learning for fuzzy, evolving content; and OCR to pull text out of images. Knowing which classifier fits which data shape — and how to tune thresholds, proximity and exclusion lists — is the real skill that separates a noisy deployment from a precise one.
① Why one classifier is never enough
Forcepoint DLP decides what is sensitive by running content through a layered set of content classifiers attached to a policy. The reason there are so many types is simple: sensitive data comes in different shapes, and no single matcher fits them all.
A national ID has a fixed format, so a regex fits. A medical report is topical, so a weighted dictionary fits. A customer database holds exact real values, so EDM fits. A design document gets partially copied, so IDM fits. Source code is fuzzy and evolving, so machine learning fits. And when any of these leaks as a screenshot or scan, OCR turns the pixels back into text so the same classifiers can read it.
A single policy can combine several classifier types, using each as either a match or an exception. That layering — plus careful threshold and exclusion tuning — is the main lever for balancing detection against false positives.
Why does Forcepoint DLP offer several classifier types instead of one?
② Pattern, key-phrase and dictionary classifiers
The first layer matches on format and words. A regex (pattern) classifier identifies alphanumeric strings of a fixed format — like 123-45-6789 or a PAN/Aadhaar-style ID. Built-in patterns ship with validation scripts and checksums, and support a 'Pattern to exclude' and phrase-exclusion lists so you can strip out look-alikes that aren't really sensitive.
A key-phrase classifier flags specific exact phrases — 'Project Falcon', 'Strictly Confidential' — useful for short, distinctive markers. A dictionary classifier is a weighted word list with a match threshold: many are built in (medical conditions, financial terms), and admins create custom lists used as classifier or exception. Topical sensitivity is recognised when enough related terms appear together.
Controlling the noise
The classic mistake is a bare regex with no validation and an empty exclusion list — it matches any 10-character string and floods the SOC. The fixes are to enable the checksum/validation script, add 'Pattern to exclude' and phrase-exclusion lists, and raise the minimum number of matches required before the rule fires.
Matches formatted identifiers like PAN, Aadhaar or card numbers. Pair with checksum validation and 'Pattern to exclude' lists to cut noise.
Exact matching of real structured record values via a non-reversible hashed index. Scales to tens of millions of rows with near-zero false positives.
Percentage-based rolling-hash match for whole or partial unstructured documents. Files under ~300 chars match only as an exact whole-file fingerprint.
Converts image pixels (JPEG, PNG, TIFF, scanned PDFs) into text so existing classifiers can inspect it — no special OCR policy attribute needed.
When a regex is too noisy, don't widen or weaken the pattern. Add the checksum/validation script, fill in 'Pattern to exclude' and a phrase-exclusion list, and raise the minimum number of matches. That keeps real detections while killing look-alikes.
A bare regex for PAN numbers is firing thousands of false positives. What is the quickest fix?
③ Fingerprinting — EDM for records, IDM for documents
Fingerprinting is where precision jumps. EDM (Exact Data Match) protects structured data — database, CSV or Salesforce records. Forcepoint extracts, normalises, then secures the values as a non-reversible hash; detection is exact-value, supports combining columns with proximity logic, scales to tens of millions of rows, and gives near zero-tolerance matching against a known dataset. That means you can prove a real record leaked, not just that something looked like one.
IDM (Indexed Document Matching) protects unstructured files — Word, PowerPoint, PDF, CAD — using rolling hashes so partial or derivative copies match. The match is percentage-based (for example, 20% of the fingerprinted content present triggers a hit). Note the 300-character rule: files under roughly 300 characters match only as an exact whole-file fingerprint, so a tiny snippet won't reach the percentage match.
The interview line: regex matches the format; fingerprints match the actual content. Use EDM when you must catch known record values with almost no false positives, and IDM when partial copy-paste of a document must be caught.
Regex only matches the format of a value; it cannot tell a real customer record from a random number in the same shape. EDM matches the actual hashed values from your dataset. If the interviewer asks 'how do you prove a real record leaked', the answer is EDM, not a cleverer regex.
▶ Watch a leaked customer record get matched by EDM
How content moves from a raw match attempt to a high-confidence EDM hit. Press Play for the healthy path, then Break it to see the classic failure.
Which classifier protects exact values from a customer database with near-zero false positives?
④ Machine learning, OCR and false-positive tuning
Machine-learning classifiers handle content you can't express as a pattern. You train them on a positive set (examples to protect) and a negative set (examples to ignore); Forcepoint then rates expected false positives and false negatives and an accuracy level. One key limit: ML works only on unstructured file-system data — not databases, SharePoint or Domino.
OCR extracts text from images — JPEG, PNG, GIF, BMP, TIFF, JPEG-2000, scanned-only PDFs and images embedded in Office docs — on a dedicated OCR server. The extracted text is then scanned by the same active policies: there is no special OCR policy attribute. Quality matters — aim for 300 DPI, or 400–600 DPI for small fonts.
Trading precision against recall
Every classifier has the same tuning levers: thresholds (how many matches), proximity (how close elements must be), and exclusion lists. Tighten them and you raise precision but risk missing real leaks; loosen them and you catch more but flood analysts. The discipline is to baseline, read the incidents, and move broad regex rules toward EDM/IDM where the real data lives.
Meera Krishnan, DLP admin at Cygnet Technologies, Bengaluru
Her PAN-number policy fires thousands of alerts a day, mostly on internal invoice templates and product codes — analysts are drowning.
The classifier is a bare regex with no validation/checksum and an empty 'Pattern to exclude', so it matches any 10-character alphanumeric string.
In Forcepoint Security Manager she opens the regex classifier and sees no checksum validation and an empty exclusion list — almost every match is a benign reference or product code, not a real PAN.
Security Manager ▸ Main ▸ Policy Management ▸ Content Classifiers ▸ Patterns & PhrasesEnable the PAN checksum/validation script, add a phrase-exclusion list for invoice and product-code prefixes, raise the minimum-match threshold, and switch high-value customer records to an EDM classifier so only real database values match.
Re-run a Network Discovery task and check Main ▸ Reporting ▸ Incidents — alert volume drops sharply while a planted real-PAN test file is still detected.
Never declare a classifier 'fixed' on volume alone. Plant a known-good test file with a real (test) record and confirm the incident report still catches it after you tighten thresholds and exclusions. Lower noise plus a confirmed true positive is the proof.
What do you need to configure so OCR-extracted text gets inspected?
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: why is 'just write a regex' the wrong answer for protecting a customer database? Then compare with the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- Content classifier
- A rule type — regex, dictionary, fingerprint or ML — that decides whether content is sensitive, used as a match or an exception.
- Regex classifier
- Detects strings matching a defined character/format pattern, paired with checksum validation and exclusion lists to control noise.
- Dictionary classifier
- A weighted word list with a match threshold for topical detection, such as medical or financial terms.
- Key phrase
- A specific exact phrase used as a sensitivity marker, e.g. 'Strictly Confidential' or a project codename.
- EDM (Exact Data Match)
- Exact matching of real structured record values via a non-reversible hashed index; scales to tens of millions of rows.
- IDM (Indexed Document Matching)
- Percentage-based rolling-hash match for whole or partial unstructured documents; under ~300 chars it matches only as an exact whole file.
- Fingerprint
- A non-reversible hash of data used to recognise it later without storing the original content.
- Proximity
- EDM logic requiring multiple matched data elements to appear near each other before a match counts.
- Training set
- Positive (protect) and negative (ignore) example documents that teach a machine-learning classifier.
- OCR server
- The Forcepoint module that converts image pixels into text so existing classifiers can inspect screenshots, photos and scans.
📚 Sources
- Forcepoint Help — Classifying content: file properties, key phrases, scripts, regex patterns and dictionaries. help.forcepoint.com
- Forcepoint Help — Patterns & Phrases: validation scripts, checksums and exclusion lists. help.forcepoint.com
- Forcepoint Help — Exact Data Match (EDM) and Indexed Document Matching (IDM) fingerprinting. help.forcepoint.com
- Forcepoint Help — Machine learning classifiers: positive/negative training sets and accuracy. help.forcepoint.com
- Forcepoint Help — Configuring the OCR server for image text extraction. help.forcepoint.com
- Forcepoint — Forcepoint DLP and Forcepoint DLP Endpoint brochure. forcepoint.com
What's next?
Comfortable with the classifiers? Next, see how Forcepoint turns a classifier match into a scored, routed incident — the full architecture from Security Manager and Policy Engine out to every enforcement point.