Most engineers think…
Most people think making Splunk fast means extracting every field at index time so searches don't have to do the work. That mental model is wrong — and expensive.
Splunk is designed around search-time extraction: you store raw events and extract fields only when a search needs them. This keeps indexes lean, lets you change extractions without re-indexing, and is the only approach that remains flexible as your queries evolve. The secret weapon for speed is not index-time extraction — it is the Common Information Model (CIM) plus data model acceleration, which pre-builds tsidx summary files so the tstats command scans tiny summaries instead of billions of raw events. Understanding that split is what separates a Splunk beginner from a SIEM engineer who can actually tune a system under load.
① The Splunk indexing pipeline — from raw bytes to searchable events
Every event Splunk ingests travels through a three-stage pipeline before it lands on disk. Input: data arrives from forwarders, HEC, scripted inputs or file monitors. Parsing: Splunk splits the stream into individual events, stamps a timestamp, sets sourcetype, and runs line-breaking rules. Indexing: the event is written to an index bucket on disk and a small set of default fields are stored — host, source, sourcetype, _time, _raw.
The critical insight for interviews: Splunk extracts almost no custom fields at index time. Custom field extraction happens at search time, when a query actually needs those fields. This keeps the index small, keeps indexing fast, and — most importantly — lets you fix or improve an extraction rule without touching the stored data. The only fields committed to disk at index time are the handful that Splunk needs for routing and basic filtering.
Why does Splunk extract almost no custom fields at index time?
② Indexes, buckets & field extraction — where data really lives
An index in Splunk is a named directory on disk that holds event data organised into time-based buckets. Buckets age through four stages — hot (actively written), warm (read-only, recent), cold (older, often slower storage), frozen (archived or deleted). You route different data to different indexes by sourcetype, business unit or retention requirement; data lands in the main index by default but that is rarely right in production.
Two ways to extract fields
Index-time extraction runs during the parsing pipeline and writes the key=value pair permanently alongside the event. It is fast at search time but inflexible — you cannot change the extraction without re-indexing. Use it sparingly (Splunk recommends avoiding it for custom fields). Search-time extraction, configured in props.conf and transforms.conf, applies regex or delimiter rules when a search runs. It is the standard approach: flexible, changeable without re-indexing, and the only option that works with field aliases and calculated fields used by the CIM.
A named directory on disk holding time-based buckets. Route different data sources to dedicated indexes for access control and retention management.
Fields extracted by regex or delimiter rules in props.conf and transforms.conf when a query runs — the standard approach. Change extractions any time without re-indexing.
A knowledge object that maps a vendor-specific field name (e.g. IpAddress, src_addr) to the canonical CIM name (e.g. src_ip) at search time — no re-indexing needed.
A compact summary file built alongside index buckets for an accelerated data model. Stores only CIM-defined fields. The tstats command reads these instead of raw events for near-instant results.
In an interview, mention that indexes serve two purposes: access control (who can search what) and data lifecycle (different retention periods). Route short-lived NetFlow data to a 30-day index and compliance logs to a 365-day index. Using one default 'main' index for everything is a red flag to any experienced interviewer.
Which two configuration files govern search-time field extraction in Splunk?
③ The Common Information Model (CIM) — one schema, every data source
The Common Information Model (CIM) is a Splunk add-on that defines a shared vocabulary of field names grouped into data model domains — Network Traffic, Authentication, Endpoint, Alerts, Web, Email, Change, Malware and more. Instead of each technology vendor using its own field names (Cisco calls it src_addr, Palo Alto calls it src_ip, Windows calls it IpAddress), the CIM maps them all to a single canonical name like src_ip. A correlation search written against the CIM works on Cisco, Palo Alto and Windows logs without modification.
Normalization to the CIM is done at search time, not index time. You use field aliases and calculated fields in props.conf, or you use the technology-specific add-ons (Splunk Add-on for Windows, Cisco, Palo Alto, etc.) that ship their own field aliases pre-mapped to CIM. The CIM add-on itself ships the data model definitions; the TA (Technology Add-on) ships the mappings for a specific source.
For SIEM work, CIM compliance is mandatory: Splunk Enterprise Security (ES) correlation searches, risk-based alerting and Adaptive Response all run against CIM-normalised data models, not raw source fields. If your firewall logs are not CIM-normalised, no ES network detection rule will fire on them.
A common mistake is writing custom index-time extractions to 'pre-normalize' data. If the CIM field names ever change, or if you add a second sourcetype with different raw field names, you must re-index everything. Always normalise at search time with field aliases in props.conf — that way a TA update fixes the mapping in seconds with no data re-ingestion.
▶ Watch a firewall event get normalised and accelerated
From raw syslog to a tstats result in four steps. Press Play for the healthy path, then Break it to see the CIM gap failure.
A new firewall TA maps the vendor field src_addr to the CIM field src_ip using a field alias. When does that mapping take effect?
④ Data model acceleration — tsidx files and the tstats command
A data model without acceleration still works — but searches run against raw events, which is slow at scale. Data model acceleration tells Splunk to continuously build tsidx summary files alongside the index buckets for every index that contains events matching the data model. These files store only the fields defined in the data model, making them tiny compared to the raw journal. The tstats command queries tsidx summaries directly — typical searches that would take minutes on raw events return in seconds.
Enabling and monitoring acceleration
Enable acceleration per data model in the CIM Setup view or in datamodels.conf (set acceleration = true and choose a summary range — 1 day, 7 days, 30 days, 90 days, all time). Splunk then runs background summarisation jobs on each search head or indexer cluster. Monitor acceleration health with the Data Model Audit dashboard in ES, or run | rest /servicesNS/-/-/data/models to check coverage percentage. A common gotcha: if a data source is not CIM-normalised, its events do not match the data model constraint, so no tsidx is built for them — and tstats silently returns zero results for that source.
Priya at a Mumbai fintech faces this
Splunk ES correlation searches for network anomalies time out after two minutes. A tstats query over the Network Traffic data model returns zero results, even though raw searches find hundreds of thousands of firewall events per day.
The team installed a new next-gen firewall but never deployed a CIM-compliant TA for it. The firewall events land in the index raw, but none of their fields match the CIM Network Traffic data model — so the data model acceleration job never builds tsidx summaries for those events.
Check the Data Model Audit dashboard in ES: Network Traffic shows 0% coverage for the new firewall sourcetype. Confirm with a raw search: the events exist but the field src_ip is absent — only the vendor field src_address is present.
ES ▸ Data Model Audit ▸ Network Traffic coverage + raw search for sourcetype + CIM field checkInstall the correct Splunk TA (or write a props.conf field alias mapping src_address to src_ip and other vendor fields to CIM names). After deployment, the next acceleration summarisation cycle picks up the normalised events and builds tsidx files for them.
Re-run the tstats query after the summarisation lag (minutes to a few hours). Network Traffic coverage climbs to near 100% for the new sourcetype; ES correlation searches complete in under five seconds.
After enabling data model acceleration, run the Data Model Audit dashboard in Splunk ES and confirm coverage is above 95% for each accelerated data model. If a sourcetype shows 0%, the events are not hitting the data model's base search — meaning the CIM TA for that source is missing or misconfigured.
A tstats query over the Network Traffic data model returns zero results for a new firewall source, even though raw searches find events. Most likely cause?
🤖 Ask the AI Tutor
Tap any question — instant, scoped to this lesson. No login, no waiting.
Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.
📝 Wrap-up assessment — six more
You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.
🧠 In your own words
Type one line: why is search-time extraction better than index-time extraction for most custom fields? Then compare with the expert version.
🗣 Teach a friend
Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.
📖 Glossary
- Index
- A named directory on disk holding time-based event buckets. Separate indexes by retention period and access control — never dump everything into 'main'.
- Bucket
- A time-bounded directory within an index holding the raw journal and index files. Ages through hot, warm, cold and frozen stages.
- Search-time extraction
- Field extraction applied when a query runs, configured in props.conf and transforms.conf. The standard approach — flexible and changeable without re-indexing.
- Common Information Model (CIM)
- A Splunk add-on defining a shared schema of canonical field names (src_ip, dest, action, etc.) grouped into data domains — the foundation of Splunk ES detection.
- Field alias
- A knowledge object that maps vendor-specific field names to CIM canonical names at search time. Deployed via a Technology Add-on (TA).
- Data model acceleration
- A Splunk feature that builds tsidx summary files alongside index buckets for events matching an accelerated data model, enabling near-instant tstats queries.
- tsidx file
- Time-Series InDex file — a compact summary built by data model acceleration containing only the fields of the data model. Read by tstats to skip raw event scanning.
- tstats
- An SPL command that queries tsidx summary files from accelerated data models, returning aggregated results in seconds instead of scanning raw events.
- Technology Add-on (TA)
- A Splunk app that provides field extractions, field aliases and sourcetype transformations for a specific data source, making it CIM-compliant.
- Sourcetype
- A classification tag set at parse time telling Splunk which rules to apply for line-breaking, timestamping and field extraction — e.g. pan:traffic, WinEventLog:Security.
📚 Sources
- Splunk Docs — How Splunk Enterprise handles your data (indexing pipeline). docs.splunk.com/Documentation/Splunk/latest/Data/WhatSplunkdoeswithyourdata
- Splunk Docs — When Splunk software extracts fields (index-time vs search-time). help.splunk.com/en/splunk-enterprise/manage-knowledge-objects/knowledge-management-manual/10.4/fields-and-field-extractions/when-splunk-software-extracts-fields
- Splunk Docs — Set up the Splunk Common Information Model Add-on. help.splunk.com/en/splunk-cloud-platform/common-information-model/8.5/introduction/set-up-the-splunk-common-information-model-add-on
- Splunk Docs — Accelerate CIM data models (tsidx & tstats). help.splunk.com/en/splunk-cloud-platform/common-information-model/6.3/using-the-common-information-model/accelerate-cim-data-models
- Splunk Lantern — Managing data models in Enterprise Security. lantern.splunk.com/Security_Use_Cases/Threat_Hunting/Managing_data_models_in_Enterprise_Security
- Splunk Docs — Use the CIM to normalize data at search time. help.splunk.com/en/splunk-cloud-platform/common-information-model/8.5/using-the-common-information-model/use-the-cim-to-normalize-data-at-search-time
What's next?
Got the indexing fundamentals? Next, go deep on Splunk ES correlation searches — how rules fire, notable events, risk scores and the adaptive response framework.