TTechclick ⚡ XP 0% All lessons
Splunk · SIEM · ArchitectureInteractive · L1 / L2 / L3

Splunk Architecture — Forwarders, Indexers, Search Heads & the Data Pipeline

Splunk is not one server — it is a pipeline. Data is collected by forwarders, parsed and stored by indexers in time-based buckets, and queried by search heads. This lesson maps every component, traces a log from machine to dashboard, and nails the one idea interviewers love: the index-time vs search-time split.

📅 2026-06-19 · ⏱ 16 min · 5 infographics · live pipeline demo · 🏷 10-Q assessment + AI Tutor inline

⚡ Quick Answer

A clear, interactive guide to Splunk architecture (2026): the data pipeline from forwarder to indexer to search head, universal vs heavy forwarders, the indexing pipeline and hot/warm/cold/frozen buckets, the index-time vs search-time split, distributed search, indexer and search head clustering for scale and HA, the deployment server, and the ingest licensing model.

🎯 By the end you will be able to

Read as:

Pick where you want to start

1

The pipeline

Three tiers: forwarder, indexer, search head.

2

Forwarders & indexers

UF vs HF, parsing, indexing, buckets.

3

Index vs search time

What happens when, and schema-on-read.

4

Scale & HA

Clusters, deployment server, licensing.

🧠 Warm-up — 3 questions, no score

Just notice which ones make you pause. We answer all three inside the lesson.

1. Is Splunk a single server that does everything?

Answered in The pipeline.

2. Which forwarder does NOT parse data before sending it?

Answered in Forwarders & indexers.

3. When are most fields extracted in Splunk?

Answered in Index vs search time.

Most engineers think…

Most people picture Splunk as 'a log server you point your logs at and search'. That single-box mental model falls apart the moment an interviewer asks how Splunk scales, or why your custom field is missing.

Splunk is a distributed pipeline: forwarders collect data from machines, indexers parse it into timestamped events and store it in time-based buckets, and search heads run queries by fanning out to every indexer. The other idea that trips people up is index-time vs search-time: Splunk does only light processing as data lands and extracts most fields when you search — it is schema-on-read. Understand the three tiers and that split, and the rest (clustering, buckets, licensing) clicks into place.

① The big picture — Splunk is a three-tier pipeline

The single most important idea: Splunk is not one server, it is a pipeline with three roles. Data flows forwarder ▸ indexer ▸ search head. A forwarder on each machine collects logs and ships them to an indexer, which turns the raw stream into searchable, timestamped events and stores them. A search head is where people run queries and build dashboards — it asks the indexers for results.

In a tiny lab all three roles can live on one instance, but in any real deployment they are separate so each tier can scale on its own. The interview line: collection, storage and search are three different jobs done by three different components. Get that flow on a whiteboard and you have already passed half the question.

Figure 1 — The Splunk data pipeline, end to end
Every log takes the same path: collected by a forwarder, made searchable by an indexer, queried by a search head.The Splunk data pipeline, end to endSourcelogs on a machineForwardercollect & shipIndexerparse & storeSearch headquery & dashboardAnalystanswers & alerts
Every log takes the same path: collected by a forwarder, made searchable by an indexer, queried by a search head.
Quick check · Q1 of 10 · Understand

Splunk architecture is best described as…

Correct: b. Splunk separates three jobs: forwarders collect data, indexers parse and store it, and search heads run queries. In a real deployment these are distinct components so each tier can scale on its own.
👉 So far: Splunk is a three-tier pipeline: forwarders collect, indexers parse and store in time-based buckets, search heads query. Collection, storage and search are three different jobs.

② Forwarders and indexers — collection and the bucket lifecycle

There are two kinds of forwarder. The universal forwarder (UF) is a tiny, separate agent that ships raw, mostly unparsed data — it is the default on every endpoint and server because it is light on CPU and memory. The heavy forwarder (HF) is a full Splunk Enterprise instance with bits turned off; it can parse, filter and route data before sending it, which is useful when you must drop noise or split traffic by content.

What the indexer does

The indexer runs the pipeline that makes data searchable: it identifies events, finds timestamps, does light parsing, then writes events to index buckets. Buckets age through stages: hot (being written), warm (closed but still on fast disk), cold (older, moved to cheaper, slower storage), then frozen (rolled off and deleted, or archived). Archived frozen data can be thawed back in if you ever need it. This lifecycle is how Splunk controls cost versus retention.

Figure 2 — Universal forwarder vs heavy forwarder
Use the light universal forwarder by default; reach for a heavy forwarder only when you must parse, filter or route before sending.Universal forwarder vs heavy forwarderUniversal forwarder (UF)Tiny separate agentSends raw, unparsed dataLight on CPU and memoryDefault on every endpointHeavy forwarder (HF)Full Splunk instance, trimmedParses before forwardingCan filter and route by contentHeavier — use where needed
Use the light universal forwarder by default; reach for a heavy forwarder only when you must parse, filter or route before sending.
Figure 3 — Bucket lifecycle — hot to frozen
Index buckets age by time and size, moving to cheaper storage and finally rolling off (or being archived).Bucket lifecycle — hot to frozenHotbeing written, fastdiskWarmclosed, still fastColdolder, cheaper diskFrozendeleted or archived
Index buckets age by time and size, moving to cheaper storage and finally rolling off (or being archived).
📡
Universal forwarder
tap to flip

A tiny separate agent on each machine that collects logs and ships them raw and unparsed. The default collector — light on CPU and memory.

🏗️
Indexer
tap to flip

Parses the stream into timestamped events and stores them in time-based buckets. The storage and search-peer tier of Splunk.

🔍
Search head
tap to flip

Where you run SPL and build dashboards. It does not store the data — it fans the query out to all indexers and merges their results.

🪣
Bucket lifecycle
tap to flip

Hot (being written) ▸ warm (closed, fast) ▸ cold (cheaper disk) ▸ frozen (deleted or archived). This is how Splunk tiers cost vs retention.

Name the three tiers, then the forwarder split

In an interview, draw forwarder ▸ indexer ▸ search head first, then add the detail: universal forwarder for light raw collection (the default), heavy forwarder only when you must parse, filter or route before sending. Indexers store data in time-based buckets; search heads never store data, they query the indexers.

Quick check · Q2 of 10 · Remember

Which forwarder sends raw, mostly unparsed data and is the lightweight default?

Correct: c. The universal forwarder is a tiny separate agent that ships raw data with minimal processing. The heavy forwarder is a full Splunk instance that can parse, filter and route before sending.
👉 So far: Universal forwarder = light, raw, the default collector; heavy forwarder = full instance that parses/filters/routes. Indexers age buckets hot ▸ warm ▸ cold ▸ frozen to balance cost vs retention.

③ Index-time vs search-time — the split that decides everything

This is the question that separates juniors from seniors. Index time is what happens once, on the indexer, as data is written to disk: breaking the stream into events, assigning the timestamp, host, source and sourcetype, and writing the raw event plus an index of its keywords. Splunk keeps index-time processing deliberately light — heavy index-time field extraction slows ingest and is hard to change later.

Search time is what happens every time you run a query, on the search head: most field extraction, lookups, calculated fields and aliases are applied to the raw events as they are read. Because the schema is applied when you read, Splunk is called schema-on-read — you do not have to define columns up front, and you can add or change field extractions later without re-indexing. Splunk's own guidance is to do most knowledge work at search time. The practical payoff: if a field is missing, it is almost always a search-time extraction to fix, not a re-ingest.

Figure 4 — Index time vs search time
Light work once on the indexer; most field work every time you search on the search head — schema-on-read.Index time vs search timeIndex time (once, indexer)Break stream into eventsAssign timestamp, host, sourceSet sourcetypeLight — keep it leanSearch time (per query, search head)Most field extractionLookups & calculated fieldsAliases & evalChange without re-indexing
Light work once on the indexer; most field work every time you search on the search head — schema-on-read.
Don't extract everything at index time

A classic mistake is forcing lots of custom field extraction at index time. It slows ingest, bloats the index, and is hard to change because the data is already written. Keep index time lean (events, timestamp, sourcetype) and do most field work at search time — that is the whole point of schema-on-read.

Quick check · Q3 of 10 · Apply

A custom field is missing from your search results. Where do you usually fix it?

Correct: a. Splunk is schema-on-read: most fields are extracted at search time. A missing field is almost always a search-time extraction to add or fix, which does not require re-indexing the data.
👉 So far: Index time = light, once, on the indexer (events, timestamp, sourcetype). Search time = most field extraction, per query, on the search head. Splunk is schema-on-read — add fields without re-indexing.

④ Scaling out — clusters, the deployment server and licensing

One indexer and one search head do not survive a busy SOC. For throughput you add more indexers and let the search head run distributed search across all of them. For high availability you use an indexer cluster: a cluster manager keeps multiple copies of every bucket (the replication factor) and multiple searchable copies (the search factor, default 3 each), so a dead indexer loses no data. A search head cluster does the same for the search tier, sharing knowledge objects and jobs across members.

Management and cost

The deployment server is the config push tool — it sends apps and settings out to your fleet of forwarders so you are not editing each one by hand. On cost, classic Splunk licensing is ingest-based (you pay by GB of data indexed per day), while Splunk Cloud also offers workload pricing measured in compute units (SVCs). Either way, the architecture lesson is the same: ingest and search both cost money, so you tier storage with buckets and filter noise early.

Figure 5 — Scaling Splunk — clusters and management
A search head fans queries across all indexers; clusters add HA; the deployment server pushes config to forwarders.Scaling Splunk — clusters and managementSearch headdistributed searchIndexer 1 (peer)Indexer 2 (peer)Indexer 3 (peer)Cluster managerDeployment serverForwarder fleet
A search head fans queries across all indexers; clusters add HA; the deployment server pushes config to forwarders.

Vikram at a Pune fintech SOC faces this

Dashboards suddenly slow to a crawl and some searches time out during peak hours, even though only one indexer is doing all the work.

Likely cause

Everything runs on a single all-in-one Splunk box — one indexer is both storing all data and serving every search, with no distribution or HA.

Diagnosis

Check the monitoring console: the lone indexer is CPU-bound at ingest and search at once, and there is no second peer to share load or survive a failure.

Settings ▸ Distributed Environment ▸ Indexer Clustering + Search Head Clustering
Fix

Split the tiers: add indexer peers behind a cluster manager (set a replication and search factor), put searches on a dedicated search head doing distributed search, and push forwarder config from a deployment server.

Verify

Re-run the heavy dashboard: the query now fans across multiple peers, response times drop, and killing one indexer in a test loses no data.

Prove HA by killing a peer

Never trust 'the cluster should be fine'. With a healthy replication and search factor you can take one indexer offline and searches still return complete results from the replica copies. Test it in a maintenance window — that single check proves your HA design actually works.

▶ Watch a login log travel from a server to a SOC dashboard

How one log line becomes a searchable event end-to-end. Press Play for the healthy path, then Break it to see a classic failure.

① CollectA universal forwarder on a Linux server reads a new line in auth.log and ships it to the indexer tier.
② Parse & indexAn indexer breaks the stream into an event, assigns the timestamp, host and sourcetype, and writes it into a hot bucket.
③ Distributed searchAn analyst runs a search; the search head fans the query out to every indexer peer, which each scan their own buckets.
④ Merge & showThe search head merges the peers' results, applies search-time field extraction, and renders the dashboard panel.
Press Play to step through the healthy path from log line to dashboard. Then press Break it.
Quick check · Q4 of 10 · Analyze

You need both more search throughput and no data loss if an indexer dies. What do you use?

Correct: d. An indexer cluster keeps multiple replicated and searchable copies of each bucket (replication and search factor), so a dead indexer loses nothing, while distributed search spreads query load across all peers.
👉 So far: Scale with more indexers + distributed search; get HA with indexer and search head clusters (replication + search factor). The deployment server pushes config to forwarders; licensing is by daily ingest (GB) or compute (SVC).

🤖 Ask the AI Tutor

Tap any question — instant, scoped to this lesson. No login, no waiting.

Pre-curated from vendor docs + community Q&A, scoped to this lesson. For a live prod issue, paste your export into chat.techclick.in.

📝 Wrap-up assessment — six more

You've answered 4 inline. Six left. 70% (7 of 10) marks the lesson complete on your profile. Tap Submit all answers at the end.

Q5 · Remember

Which tier actually parses data into events and stores it on disk?

Correct: c. The indexer runs the parsing and indexing pipeline — it breaks the stream into timestamped events and writes them to index buckets. Forwarders collect; search heads query; the deployment server pushes config.
Q6 · Understand

Why is the universal forwarder preferred over the heavy forwarder for most endpoints?

Correct: a. The universal forwarder is purpose-built to be lightweight — it collects and ships raw data without heavy processing, so it is safe to run on every server and endpoint. Use a heavy forwarder only when you must parse, filter or route first.
Q7 · Apply

A bucket has just rolled off and been deleted from the index (optionally archived). What stage is that?

Correct: b. Frozen is the end of the lifecycle: the bucket is removed from the searchable index and either deleted or archived. Archived frozen data can later be thawed back in. Hot is being written; warm and cold are still searchable.
Q8 · Analyze

Why can you add a new field extraction in Splunk without re-indexing your data?

Correct: c. Splunk applies most of the schema when you read (search), not when you write (index). So a new or changed field extraction takes effect on the next search against the already-stored raw events — no re-indexing required.
Q9 · Evaluate

An interviewer asks how to scale Splunk for more data and queries. Best answer?

Correct: b. You scale horizontally: add indexers and let the search head run distributed search across them, then cluster the indexer and search head tiers for HA. A single bigger box is the wrong mental model and gives you no resilience.
Q10 · Evaluate

What is the strongest reason to keep index-time processing light?

Correct: c. Index-time work runs on every event as it is written and is baked into stored data. Overloading it hurts ingest performance and storage, and you can't easily revise it later. Doing field work at search time keeps ingest lean and flexible.
Lesson complete — saved to your profile.
Almost! You need 70% (7 of 10) — re-read the path that tripped you up and tap "Try again".

🧠 In your own words

Type one line: why is Splunk called a 'pipeline' rather than 'a log server', and what is schema-on-read? Then compare with the expert version.

Expert version: Splunk is a pipeline because three different components do three different jobs: forwarders collect logs from machines, indexers parse them into timestamped events and store them in time-based buckets (hot ▸ warm ▸ cold ▸ frozen), and search heads run queries by fanning out to every indexer (distributed search) and merging the results. Schema-on-read means Splunk does only light processing at index time (events, timestamp, sourcetype) and applies most field extraction at search time, when you read the data — so you can add or change fields later without re-indexing. That same split is why you scale by adding indexers and clusters, not by buying a bigger box.

🗣 Teach a friend

Best way to lock it in — explain it in one line to a teammate. Tap to generate a paste-ready summary.

📖 Glossary

Forwarder
A Splunk agent that collects data on a source machine and sends it to indexers. Universal (light, raw) or heavy (full instance that parses and routes).
Universal forwarder (UF)
A tiny, separate agent that ships raw, mostly unparsed data with minimal CPU and memory — the default collector on endpoints and servers.
Heavy forwarder (HF)
A full Splunk Enterprise instance with features trimmed; it parses, filters and routes data before forwarding.
Indexer
The component that parses incoming data into timestamped events and stores them in index buckets; also acts as a search peer.
Search head
Where users run SPL queries and dashboards. It does not store data — it fans searches out to indexers and merges the results.
Bucket
A time-based directory of indexed data on an indexer. Buckets age hot ▸ warm ▸ cold ▸ frozen (deleted or archived; archived data can be thawed).
Index time vs search time
Index time = light processing once on the indexer (events, timestamp, sourcetype). Search time = most field extraction per query on the search head — schema-on-read.
Distributed search
A search head sends a query to every indexer peer at once; each searches its own data and returns partial results that the search head merges.
Indexer cluster (replication / search factor)
A group of indexers that keep multiple copies of each bucket (replication factor) and multiple searchable copies (search factor) for high availability.
Deployment server
The management tool that pushes apps and configuration out to a fleet of forwarders so they are not configured one by one.

📚 Sources

  1. Splunk Docs — Types of forwarders (universal vs heavy, parsing and routing). docs.splunk.com/Documentation/Splunk/latest/Forwarding/Typesofforwarders
  2. Splunk Docs — How the indexer stores indexes: hot, warm, cold, frozen, thawed buckets. docs.splunk.com/Documentation/Splunk/latest/Indexer/HowSplunkstoresindexes
  3. Splunk Docs — Index time versus search time. docs.splunk.com/Documentation/Splunk/latest/Indexer/Indextimeversussearchtime
  4. Splunk Docs — About indexer clusters and index replication (replication & search factor). help.splunk.com/en/splunk-enterprise/administer/manage-indexers-and-indexer-clusters
  5. Splunk Docs — Configure the search head cluster (default replication factor 3). help.splunk.com/en/splunk-enterprise/administer/distributed-search/configure-search-head-clustering
  6. Splunk — Pricing models: ingest-based and workload (SVC) pricing. splunk.com/en_us/products/pricing/pricing-models.html

What's next?

Got the architecture? Next, learn SPL — the Search Processing Language — so you can actually pull answers out of all that indexed data: search, stats, eval, transforms and the pipe model.