Most engineers think…
Most people assume DLP detection is just 'a clever regex' — you write a pattern for card numbers or Aadhaar and call it done. That model is exactly why so many DLP projects flood the SOC with false alerts.
Forcepoint's strongest classifiers are fingerprints, not patterns. Exact Data Match (EDM) indexes your real structured records; Indexed Document Matching (IDM) indexes your real documents. Both keep only irreversible partial hashes in a central repository that syncs everywhere — even to an offline endpoint. The result: a match means a real record or a real document left, not just something shaped like one. Knowing how to build, count and tune these is what separates a noisy deployment from a quiet, trusted one.
① Why fingerprinting beats regex — format vs real values
The core problem with pattern DLP: a regex matches a format, not a value. A rule for 12-digit Aadhaar numbers also fires on order IDs, GST references and invoice totals — anything 12 digits long. Multiply that across email, web and endpoint and your analysts spend the day closing false positives.
Fingerprinting flips this. Instead of asking 'does this look like sensitive data?', it asks 'is this an actual record or document we indexed?' EDM matches the real values from your database; IDM matches the real text of your documents. Because only genuine content triggers an incident, false positives collapse — and a fired alert is something a SOC can trust and act on.
Why does a regex rule for IDs create so many false positives?
② Inside EDM — fingerprinting structured records
Exact Data Match (EDM) is for structured, tabular data: customer tables, employee PII, PAN/Aadhaar lists, account numbers. You point the Database Fingerprinting Wizard at a source — a direct database, cloud Salesforce, or a UTF-8 CSV — and step through Data Source, Field Selection and Scheduler. Forcepoint extracts and normalizes the chosen columns, then stores them as a non-reversible hash; the original data is never saved or backed up.
Fields and thresholds
You can select up to 32 fields per table. At least one primary field is required — a unique key the rules are built around — while other columns act as secondary fields combined for match logic. A condition threshold is the number of unique record matches needed to fire an incident; the same classifier can be added to a rule multiple times with different field combinations and thresholds, joined by And/Or/Customized logic. EDM detects when specific values co-occur — this exact name plus this exact ID plus this exact number, from the same row.
Fingerprints structured database/CSV records to match exact field values — needs a unique primary field, up to 32 fields per table.
Fingerprints whole/partial documents and scores similarity, catching copies, excerpts and edits at roughly a 70% match.
The required unique key in an EDM dataset that policy rules are built around, so each match maps to a real record.
The hashed-data store on the management server that syncs secondary copies to policy engines and to offline endpoints.
EDM needs at least one unique primary field — the key your rules are built around — so every match maps to a real record. Pair it with secondary fields (name, DOB) and a threshold across 3+ fields, and you get precision that a single-column regex can never match.
What is the maximum number of fields you can select per table for database fingerprinting?
③ Inside IDM — fingerprinting whole and partial documents
Indexed Document Matching (IDM), also called file fingerprinting, is for unstructured content: contracts, source code, board decks, M&A drafts. It indexes example documents from file shares / network file systems, SharePoint (2007–2016), or IBM Domino, storing only partial hashes of the content.
The power of IDM is the percentage-based partial match: a candidate must match at least roughly 70% of an indexed document to fire, so IDM catches full copies, excerpts, and lightly edited derivatives — not just byte-identical files. Use IDM when the sensitive asset is a document and you need to flag fragments and reformatted copies. Like EDM, the engine combines partial hashes irreversibly, so the repository never holds recoverable source content.
IDM is not an exact-file checksum. It stores partial hashes and fires on roughly a 70% content match, so it catches excerpts, reformatted copies and lightly edited derivatives. If you describe IDM as 'only whole-file matching', you miss the entire point of indexed document matching.
▶ Watch an EDM fingerprint block a real Aadhaar record
How a single outbound email is inspected end-to-end. Press Play for the healthy path, then Break it to see the classic failure.
IDM decides a match based on…
④ Repository & tuning — distribution, offline match, accuracy
All fingerprints live in the Primary Fingerprint Repository on the management server (defaults: 50,000 MB max disk, 512 MB cache). It pushes secondary repositories to protectors, Content Gateway, DLP servers and any module that runs a policy engine. Crucially, endpoints receive a synchronized copy of the fingerprint hashes, so EDM/IDM detection still works when a laptop is off-network.
Tune for accuracy
Field count drives EDM accuracy. Scan 3 or more fields for the most accurate results. If only 1 field is scanned, the minimum threshold is forced to 5 — lower values are auto-raised — to avoid noise; with 2 fields, use a threshold of 3 or more. The classic mistake is to keep a broad regex and crank the action to Block. Replace it with a fingerprint, scan enough fields, set a sane threshold, and the false-positive storm disappears.
Meera at PaySetu Technologies (Pune) faces this
Analysts get dozens of daily false-positive alerts on outbound email — every invoice and HR sheet with a 12-digit number fires the 'Aadhaar leak' rule.
The policy uses a regex pattern for 12-digit Aadhaar numbers, so any 12-digit string (order IDs, GST refs) matches the format.
Open the Security Manager ▸ Content Classifiers ▸ Patterns & Phrases — the rule is a wide regex; the incident report under Reporting confirms most hits are non-Aadhaar numbers.
Main ▸ Policy Management ▸ Content Classifiers ▸ Patterns & Phrases + Reporting ▸ Data Loss PreventionExport the verified Aadhaar dataset to UTF-8 CSV, run the Database Fingerprinting Wizard (Aadhaar as primary field, name + DOB as secondary, threshold across 3 fields), schedule a refresh, and swap the regex condition for the EDM classifier.
After the Primary Fingerprint Repository syncs to the policy engines, re-test: a real record fires, a random 12-digit invoice does not, and false-positive volume drops sharply.
Never assume detection is live. The Primary Fingerprint Repository must push secondary copies to the policy engines and endpoints first. Re-test with one real record (should fire) and one lookalike (should not) after the sync — that single check confirms the fingerprint is actually deployed.
You can only scan one field in an EDM classifier. What minimum threshold applies?
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: why does an EDM fingerprint catch a real Aadhaar leak but ignore a random 12-digit invoice, when a regex flags both? Then compare with the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- EDM (Exact Data Match)
- Fingerprinting of structured database/CSV records to match exact field values from real rows.
- IDM (Indexed Document Matching)
- File fingerprinting of whole/partial documents to detect copies, excerpts and edits at roughly a 70% match.
- Fingerprint classifier
- A Forcepoint content classifier built from EDM or IDM fingerprints rather than a regex pattern.
- Primary field
- The required unique key in an EDM dataset that policy rules are built around, mapping each match to a real record.
- Secondary field
- A non-key column combined with the primary field for match logic and confidence.
- Threshold
- The number of unique record matches required before an incident is raised.
- Partial hash
- An irreversible hash fragment stored instead of the original data, so the source cannot be recovered.
- Primary Fingerprint Repository
- The hashed-data store on the management server (defaults 50,000 MB disk, 512 MB cache) that syncs secondary copies everywhere.
- Secondary repository
- A synced copy of the fingerprint hashes pushed to protectors, gateways, DLP servers, policy engines and endpoints.
- Offline detection
- Endpoint matching against locally synced fingerprint hashes with no live server link.
📚 Sources
- Forcepoint Help — Database fingerprinting (EDM): Database Fingerprinting Wizard, field selection, normalization & hashing. help.forcepoint.com/dlp
- Forcepoint Help — File fingerprinting (IDM): indexing documents from file shares, SharePoint & Domino with partial-match scoring. help.forcepoint.com
- Forcepoint Help — Fingerprint classifiers & fingerprinting specific field combinations in a database table. help.forcepoint.com
- Forcepoint Help — Configuring the Primary Fingerprint Repository and secondary repositories. help.forcepoint.com
- Proofpoint — Advanced Data Classification in DLP: EDM vs IDM. proofpoint.com
- Forcepoint — Forcepoint releases DLP at scale (data classification & fingerprinting). forcepoint.com
What's next?
Got fingerprinting? Next, see how Forcepoint scores, severities and routes a match into a tuned incident — the full remediation workflow from match to closed ticket.