TTechclick ⚡ XP 0% All lessons
Splunk · SIEM · Indexing & Data ModelsInteractive · L1 / L2 / L3

Splunk Indexing & Data Models — CIM, tsidx Acceleration & Normalization

When your Splunk searches take minutes to return, the answer is almost always the same: your data is not normalized to the CIM, your data models are not accelerated, or you are extracting fields too early. This lesson maps the full indexing pipeline — index-time vs search-time, indexes, the Common Information Model (CIM), tsidx data-model acceleration, and field extractions — so you know exactly which knob to turn.

📅 2026-06-20 · ⏱ 18 min · 5 infographics · live block demo · 🏷 10-Q assessment + AI Tutor inline

⚡ Quick Answer

Master Splunk indexing in 2026: index-time vs search-time processing, indexes, the Common Information Model (CIM), tsidx acceleration, and field normalization for faster SIEM searches.

🎯 By the end you will be able to

Read as:

Pick where you want to start

1

The indexing pipeline

Parsing, index-time vs search-time, buckets.

2

Indexes & field extraction

Indexes, props.conf, transforms.conf, regex.

3

CIM & normalization

CIM add-on, data domains, field aliases.

4

Data model acceleration

tsidx files, tstats, tuning, monitoring.

🧠 Warm-up — 3 questions, no score

Just notice which ones make you pause. We answer all three inside the lesson.

1. When does Splunk extract most custom fields — at index time or search time?

Answered in The indexing pipeline.

2. What does the Common Information Model (CIM) standardise in Splunk?

Answered in CIM & normalization.

3. What file format does data model acceleration store its summaries in?

Answered in Data model acceleration.

Most engineers think…

Most people think making Splunk fast means extracting every field at index time so searches don't have to do the work. That mental model is wrong — and expensive.

Splunk is designed around search-time extraction: you store raw events and extract fields only when a search needs them. This keeps indexes lean, lets you change extractions without re-indexing, and is the only approach that remains flexible as your queries evolve. The secret weapon for speed is not index-time extraction — it is the Common Information Model (CIM) plus data model acceleration, which pre-builds tsidx summary files so the tstats command scans tiny summaries instead of billions of raw events. Understanding that split is what separates a Splunk beginner from a SIEM engineer who can actually tune a system under load.

① The Splunk indexing pipeline — from raw bytes to searchable events

Every event Splunk ingests travels through a three-stage pipeline before it lands on disk. Input: data arrives from forwarders, HEC, scripted inputs or file monitors. Parsing: Splunk splits the stream into individual events, stamps a timestamp, sets sourcetype, and runs line-breaking rules. Indexing: the event is written to an index bucket on disk and a small set of default fields are stored — host, source, sourcetype, _time, _raw.

The critical insight for interviews: Splunk extracts almost no custom fields at index time. Custom field extraction happens at search time, when a query actually needs those fields. This keeps the index small, keeps indexing fast, and — most importantly — lets you fix or improve an extraction rule without touching the stored data. The only fields committed to disk at index time are the handful that Splunk needs for routing and basic filtering.

Figure 1 — The Splunk indexing pipeline
Every event travels input to parsing to indexing before a search can find it.The Splunk indexing pipelineInputforwarder/HEC/fileParsingevents, timestamp,sourcetypeIndexingbucket on disk,default fieldsSearchextract custom fieldson demandResultSPL returns matchedevents
Every event travels input to parsing to indexing before a search can find it.
Quick check · Q1 of 10 · Understand

Why does Splunk extract almost no custom fields at index time?

Correct: b. Search-time extraction keeps the raw index lean and flexible. You can update or fix an extraction rule in props.conf without touching stored events. Index-time extraction permanently commits the field to disk and requires re-indexing to change.
👉 So far: Splunk pipeline: input → parse (sourcetype, timestamp) → index (bucket, default fields only). Custom fields come at search time.

② Indexes, buckets & field extraction — where data really lives

An index in Splunk is a named directory on disk that holds event data organised into time-based buckets. Buckets age through four stages — hot (actively written), warm (read-only, recent), cold (older, often slower storage), frozen (archived or deleted). You route different data to different indexes by sourcetype, business unit or retention requirement; data lands in the main index by default but that is rarely right in production.

Two ways to extract fields

Index-time extraction runs during the parsing pipeline and writes the key=value pair permanently alongside the event. It is fast at search time but inflexible — you cannot change the extraction without re-indexing. Use it sparingly (Splunk recommends avoiding it for custom fields). Search-time extraction, configured in props.conf and transforms.conf, applies regex or delimiter rules when a search runs. It is the standard approach: flexible, changeable without re-indexing, and the only option that works with field aliases and calculated fields used by the CIM.

Figure 2 — Bucket lifecycle — hot to frozen
Splunk ages buckets through four stages; you map cold and frozen to cheaper storage.Bucket lifecycle — hot to frozenHotActively written — newest eventsWarmRead-only, recent — fast diskColdOlder events — slower or NASFrozenArchived or deleted per policy
Splunk ages buckets through four stages; you map cold and frozen to cheaper storage.
🗂️
Index
tap to flip

A named directory on disk holding time-based buckets. Route different data sources to dedicated indexes for access control and retention management.

⏱️
Search-time extraction
tap to flip

Fields extracted by regex or delimiter rules in props.conf and transforms.conf when a query runs — the standard approach. Change extractions any time without re-indexing.

📐
CIM field alias
tap to flip

A knowledge object that maps a vendor-specific field name (e.g. IpAddress, src_addr) to the canonical CIM name (e.g. src_ip) at search time — no re-indexing needed.

tsidx file
tap to flip

A compact summary file built alongside index buckets for an accelerated data model. Stores only CIM-defined fields. The tstats command reads these instead of raw events for near-instant results.

Separate indexes by retention, not just source

In an interview, mention that indexes serve two purposes: access control (who can search what) and data lifecycle (different retention periods). Route short-lived NetFlow data to a 30-day index and compliance logs to a 365-day index. Using one default 'main' index for everything is a red flag to any experienced interviewer.

Quick check · Q2 of 10 · Remember

Which two configuration files govern search-time field extraction in Splunk?

Correct: a. props.conf ties sourcetypes (or hosts/sources) to extraction rules; transforms.conf defines the actual regex or lookup used. Together they control all search-time field extractions.
👉 So far: Buckets: hot → warm → cold → frozen. Search-time extraction via props.conf + transforms.conf is the standard; index-time extraction is used sparingly for routing fields only.

③ The Common Information Model (CIM) — one schema, every data source

The Common Information Model (CIM) is a Splunk add-on that defines a shared vocabulary of field names grouped into data model domains — Network Traffic, Authentication, Endpoint, Alerts, Web, Email, Change, Malware and more. Instead of each technology vendor using its own field names (Cisco calls it src_addr, Palo Alto calls it src_ip, Windows calls it IpAddress), the CIM maps them all to a single canonical name like src_ip. A correlation search written against the CIM works on Cisco, Palo Alto and Windows logs without modification.

Normalization to the CIM is done at search time, not index time. You use field aliases and calculated fields in props.conf, or you use the technology-specific add-ons (Splunk Add-on for Windows, Cisco, Palo Alto, etc.) that ship their own field aliases pre-mapped to CIM. The CIM add-on itself ships the data model definitions; the TA (Technology Add-on) ships the mappings for a specific source.

For SIEM work, CIM compliance is mandatory: Splunk Enterprise Security (ES) correlation searches, risk-based alerting and Adaptive Response all run against CIM-normalised data models, not raw source fields. If your firewall logs are not CIM-normalised, no ES network detection rule will fire on them.

Figure 3 — CIM data model domains
One CIM schema unifies field names across every data source that feeds Splunk ES.CIM data model domainsCIM Schemashared field namesNetwork TrafficAuthenticationEndpointWebEmailMalware
One CIM schema unifies field names across every data source that feeds Splunk ES.
Figure 4 — Index-time vs search-time extraction
Splunk recommends search-time extraction for almost all custom fields.Index-time vs search-time extractionIndex-time extractionStored permanently on diskFast at search timeInflexible — re-index to changeUse only for routing/filteringSearch-time extractionApplied when a query runsFlexible — change props.conf anyWorks with aliases &Standard approach for all custom
Splunk recommends search-time extraction for almost all custom fields.
Normalising at index time breaks CIM flexibility

A common mistake is writing custom index-time extractions to 'pre-normalize' data. If the CIM field names ever change, or if you add a second sourcetype with different raw field names, you must re-index everything. Always normalise at search time with field aliases in props.conf — that way a TA update fixes the mapping in seconds with no data re-ingestion.

▶ Watch a firewall event get normalised and accelerated

From raw syslog to a tstats result in four steps. Press Play for the healthy path, then Break it to see the CIM gap failure.

① Raw event inA Palo Alto firewall syslog arrives at a heavy forwarder. Splunk parses it, sets sourcetype=pan:traffic, and writes the raw event to the firewall index bucket.
② TA normalisesThe Splunk Add-on for Palo Alto applies field aliases at search time: src_ip, dest_ip, action, bytes_in map to CIM Network Traffic field names.
③ tsidx builtThe data model acceleration job finds the CIM-normalised event matches the Network Traffic model constraint and writes its fields to a tsidx summary file alongside the bucket.
④ tstats returnsAn ES correlation search runs tstats against the Network Traffic data model — it reads the tiny tsidx file, skips raw events, and returns results in under two seconds.
Press Play to step through the healthy normalisation path. Then press Break it.
Quick check · Q3 of 10 · Apply

A new firewall TA maps the vendor field src_addr to the CIM field src_ip using a field alias. When does that mapping take effect?

Correct: c. Field aliases are a search-time knowledge object. They are applied when a query executes, not when data is ingested. No re-indexing is needed — just deploy the TA and the mapping is live immediately.
👉 So far: CIM = shared field names (src_ip, dest, action) across all sources. Field aliases in TAs normalise at search time. Without CIM compliance, ES correlation searches do not fire.

④ Data model acceleration — tsidx files and the tstats command

A data model without acceleration still works — but searches run against raw events, which is slow at scale. Data model acceleration tells Splunk to continuously build tsidx summary files alongside the index buckets for every index that contains events matching the data model. These files store only the fields defined in the data model, making them tiny compared to the raw journal. The tstats command queries tsidx summaries directly — typical searches that would take minutes on raw events return in seconds.

Enabling and monitoring acceleration

Enable acceleration per data model in the CIM Setup view or in datamodels.conf (set acceleration = true and choose a summary range — 1 day, 7 days, 30 days, 90 days, all time). Splunk then runs background summarisation jobs on each search head or indexer cluster. Monitor acceleration health with the Data Model Audit dashboard in ES, or run | rest /servicesNS/-/-/data/models to check coverage percentage. A common gotcha: if a data source is not CIM-normalised, its events do not match the data model constraint, so no tsidx is built for them — and tstats silently returns zero results for that source.

Figure 5 — How tsidx acceleration speeds up tstats
tstats reads compact tsidx summaries, not raw events — orders-of-magnitude faster.How tsidx acceleration speeds up tstatsRaw eventsland in index bucketsSummarisebackground job buildstsidxtsidx filesonly CIM fields storedtstats queryreads summaries, skipsrawResultseconds, not minutes
tstats reads compact tsidx summaries, not raw events — orders-of-magnitude faster.

Priya at a Mumbai fintech faces this

Splunk ES correlation searches for network anomalies time out after two minutes. A tstats query over the Network Traffic data model returns zero results, even though raw searches find hundreds of thousands of firewall events per day.

Likely cause

The team installed a new next-gen firewall but never deployed a CIM-compliant TA for it. The firewall events land in the index raw, but none of their fields match the CIM Network Traffic data model — so the data model acceleration job never builds tsidx summaries for those events.

Diagnosis

Check the Data Model Audit dashboard in ES: Network Traffic shows 0% coverage for the new firewall sourcetype. Confirm with a raw search: the events exist but the field src_ip is absent — only the vendor field src_address is present.

ES ▸ Data Model Audit ▸ Network Traffic coverage + raw search for sourcetype + CIM field check
Fix

Install the correct Splunk TA (or write a props.conf field alias mapping src_address to src_ip and other vendor fields to CIM names). After deployment, the next acceleration summarisation cycle picks up the normalised events and builds tsidx files for them.

Verify

Re-run the tstats query after the summarisation lag (minutes to a few hours). Network Traffic coverage climbs to near 100% for the new sourcetype; ES correlation searches complete in under five seconds.

Always check acceleration coverage, not just 'enabled'

After enabling data model acceleration, run the Data Model Audit dashboard in Splunk ES and confirm coverage is above 95% for each accelerated data model. If a sourcetype shows 0%, the events are not hitting the data model's base search — meaning the CIM TA for that source is missing or misconfigured.

Quick check · Q4 of 10 · Analyze

A tstats query over the Network Traffic data model returns zero results for a new firewall source, even though raw searches find events. Most likely cause?

Correct: b. Data model acceleration only builds tsidx summaries for events that match the data model's base search (which filters by CIM-normalised fields). If the firewall TA is missing or misconfigured, events are not CIM-normalised and no summaries exist for tstats to read.
👉 So far: Data model acceleration builds tsidx files alongside buckets for CIM-normalised events. tstats reads tsidx — seconds, not minutes. Check coverage with the Data Model Audit dashboard.

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. No login, no waiting.

Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.

📝 Wrap-up assessment — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.

Q5 · Remember

Which fields does Splunk commit to disk at index time by default?

Correct: b. Splunk stores only a small set of default fields at index time: host, source, sourcetype, _time and the raw event (_raw). Custom fields are extracted at search time via props.conf and transforms.conf — keeping the index lean and flexible.
Q6 · Understand

What is the primary reason Splunk recommends search-time over index-time field extraction?

Correct: c. Search-time extraction keeps stored data lean (only raw events on disk) and keeps extraction rules flexible — you update props.conf and the change is live immediately with no re-indexing. Index-time extraction permanently bakes the field into the index, requiring a full re-index to correct.
Q7 · Apply

You deploy a new cloud proxy TA that adds field aliases mapping vendor fields to CIM Web data model names. What must you do next to make tstats queries over the Web data model work for this new source?

Correct: c. After deploying the TA (which normalises events at search time), you must wait for the background acceleration summarisation job to run and build tsidx files for the newly normalised events. Only after tsidx files exist for those events will tstats return results for that source.
Q8 · Analyze

An ES correlation search for failed authentications fires for Windows and Linux events but never for a new VPN concentrator. The VPN events appear in raw searches. What is the most likely root cause?

Correct: c. ES correlation searches query CIM-normalised data models via tstats. If the VPN TA is absent or has wrong field mappings, VPN events do not match the Authentication data model's base search and no tsidx is built — so tstats (and thus the correlation search) returns zero for that source.
Q9 · Evaluate

An interviewer asks how you would speed up a slow Splunk ES deployment. What is the most impactful single step?

Correct: b. Ensuring CIM normalisation is complete and data model acceleration is enabled — so tstats can read compact tsidx summaries instead of scanning raw events — is the single biggest lever for ES search performance. Index-time extraction creates inflexibility with minimal benefit; adding search heads helps concurrency but not per-query speed.
Q10 · Evaluate

Which statement best describes the relationship between a Technology Add-on (TA) and the CIM?

Correct: d. The CIM add-on ships the data model definitions (the schema and domain structure — Network Traffic, Authentication, etc.). Technology Add-ons (TAs) ship the source-specific field aliases and extractions that map vendor-specific field names to the CIM canonical names. Together they make a source CIM-compliant.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".

🧠 In your own words

Type one line: why is search-time extraction better than index-time extraction for most custom fields? Then compare with the expert version.

Expert version: Search-time extraction keeps the on-disk index small (only raw events are stored), and — critically — lets you fix or improve an extraction rule in props.conf at any time without touching the stored data. Index-time extraction permanently bakes the field into the bucket: change your mind about the field name or regex, and you must re-index everything. In a multi-TB Splunk deployment that is an expensive, disruptive operation. Search-time extraction is also the only approach compatible with field aliases and calculated fields, which is how the CIM works — making it the foundation of every ES correlation search and tstats query.

🗣 Teach a friend

Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.

📖 Glossary

Index
A named directory on disk holding time-based event buckets. Separate indexes by retention period and access control — never dump everything into 'main'.
Bucket
A time-bounded directory within an index holding the raw journal and index files. Ages through hot, warm, cold and frozen stages.
Search-time extraction
Field extraction applied when a query runs, configured in props.conf and transforms.conf. The standard approach — flexible and changeable without re-indexing.
Common Information Model (CIM)
A Splunk add-on defining a shared schema of canonical field names (src_ip, dest, action, etc.) grouped into data domains — the foundation of Splunk ES detection.
Field alias
A knowledge object that maps vendor-specific field names to CIM canonical names at search time. Deployed via a Technology Add-on (TA).
Data model acceleration
A Splunk feature that builds tsidx summary files alongside index buckets for events matching an accelerated data model, enabling near-instant tstats queries.
tsidx file
Time-Series InDex file — a compact summary built by data model acceleration containing only the fields of the data model. Read by tstats to skip raw event scanning.
tstats
An SPL command that queries tsidx summary files from accelerated data models, returning aggregated results in seconds instead of scanning raw events.
Technology Add-on (TA)
A Splunk app that provides field extractions, field aliases and sourcetype transformations for a specific data source, making it CIM-compliant.
Sourcetype
A classification tag set at parse time telling Splunk which rules to apply for line-breaking, timestamping and field extraction — e.g. pan:traffic, WinEventLog:Security.

📚 Sources

  1. Splunk Docs — How Splunk Enterprise handles your data (indexing pipeline). docs.splunk.com/Documentation/Splunk/latest/Data/WhatSplunkdoeswithyourdata
  2. Splunk Docs — When Splunk software extracts fields (index-time vs search-time). help.splunk.com/en/splunk-enterprise/manage-knowledge-objects/knowledge-management-manual/10.4/fields-and-field-extractions/when-splunk-software-extracts-fields
  3. Splunk Docs — Set up the Splunk Common Information Model Add-on. help.splunk.com/en/splunk-cloud-platform/common-information-model/8.5/introduction/set-up-the-splunk-common-information-model-add-on
  4. Splunk Docs — Accelerate CIM data models (tsidx & tstats). help.splunk.com/en/splunk-cloud-platform/common-information-model/6.3/using-the-common-information-model/accelerate-cim-data-models
  5. Splunk Lantern — Managing data models in Enterprise Security. lantern.splunk.com/Security_Use_Cases/Threat_Hunting/Managing_data_models_in_Enterprise_Security
  6. Splunk Docs — Use the CIM to normalize data at search time. help.splunk.com/en/splunk-cloud-platform/common-information-model/8.5/using-the-common-information-model/use-the-cim-to-normalize-data-at-search-time

What's next?

Got the indexing fundamentals? Next, go deep on Splunk ES correlation searches — how rules fire, notable events, risk scores and the adaptive response framework.