Codmaker Studio logo
CybersecurityAILLMArchitecture

Building an AI Security Pipeline with LLMs in 2026

Practical AI security pipeline patterns for threat detection in 2026 — LLM triage, log analysis, and the architecture decisions that actually work.

·13 min read
Building an AI Security Pipeline with LLMs in 2026

What an AI Security Pipeline Actually Does

An AI security pipeline uses large language models to triage, enrich, and prioritize security events that traditional rule-based systems either miss or flood with false positives. The pipeline sits between your log sources (CloudTrail, application logs, EDR alerts) and your incident response process, replacing the analyst-tier-1 role of 'is this real' with an LLM that reads the context and decides.

The promise is alert quality, not alert volume. A traditional SIEM produces 10,000 alerts per day across a mid-sized organization; analysts triage maybe 100 of them. An AI pipeline can reduce that to 50-200 high-quality alerts per day with explanations attached, which a human team can actually action.

This guide is the architecture patterns we recommend after building security-adjacent pipelines for AdMetric Pro and reviewing the AI security tooling space going into 2026. It is not a vendor pitch — most of the components below are open source or BYO models.

Why Traditional SIEMs Hit a Wall

Traditional SIEMs (Splunk, Elastic Security, Sumo Logic, the open-source Wazuh) work via rules: 'if user_agent contains X and login_country differs from previous, alert.' Rules are precise but brittle. They miss novel attacks, drown analysts in false positives, and require constant tuning.

The fundamental limitation is that rules cannot reason about context. A rule sees 'login from new country' and alerts. An analyst sees 'login from new country, but the user is on a known business trip mentioned in this morning's Slack channel, and the device fingerprint matches' and dismisses. The reasoning step is what LLMs add.

The other limitation is that rules require a human to author them. New threat patterns require new rules; analysts spend days each quarter updating detection content. LLM-augmented pipelines can identify novel patterns by reading raw events and applying broad reasoning, without explicit rules for each scenario.

What rules do better: they are deterministic, fast, and cheap. The right architecture is layered — rules as the first-pass filter, LLMs as the second-pass enrichment. Replacing rules entirely with LLMs would be slow and expensive. The pipeline below is the layered approach.

The Three-Layer Architecture

Three layers, each with a distinct job. The separation is what makes the pipeline economically viable at production event volumes.

Layer 1: Ingestion plus Normalization. Pull events from all sources (cloud audit logs, EDR, application logs, network telemetry). Normalize to a common schema (OCSF is the emerging standard). Filter out the 95% of events that are clearly benign with simple rules. Output: a stream of pre-filtered events at maybe 5% of original volume.

Layer 2: LLM Triage. Each pre-filtered event gets enriched with context (user profile, recent activity, related events, threat intel lookups) and submitted to an LLM with a structured prompt: 'Given this event and this context, classify as benign/suspicious/malicious, explain reasoning, suggest priority 1-5.' Output: a structured assessment per event.

Layer 3: Response. High-priority malicious events trigger automated containment (disable user, isolate device, revoke tokens). Medium-priority events route to a human analyst queue with the LLM's reasoning attached. Low-priority events are logged for trend analysis.

This separation matters. Layer 1 is cheap and fast — every event passes through. Layer 2 is expensive but only sees pre-filtered events. Layer 3 only fires on a small subset. The cost scales gracefully even at high event volume.

Layer 1: Log Ingestion and Normalization

Choose a log shipper that fits your sources. Vector (open source, very fast) is our default — handles transformation, normalization, and routing in one tool. Alternatives: Fluent Bit, Logstash, or commercial options like Cribl.

Schema choice matters. Open Cyber Security Schema Framework (OCSF) is being adopted by AWS, Splunk, and most modern security tooling. Normalizing to OCSF early means your LLM prompts work consistently regardless of source. The investment pays back in week two.

Pre-filtering with rules removes noise. Common filters: known service accounts performing routine operations, scheduled jobs from CI/CD systems, healthchecks, monitoring probes. A well-tuned filter set reduces event volume by 90-95% without losing signal.

Storage: index events in a queryable store for context retrieval (Elasticsearch, OpenSearch, or a columnar store like ClickHouse). The LLM in Layer 2 needs to query 'what other events did this user generate in the last hour' — that requires a fast index.

Layer 2: LLM Triage and Enrichment

This is where the AI lives. The pattern: for each pre-filtered event, run an enrichment pipeline that adds context, then submit to an LLM with a structured prompt.

Enrichment sources: user profile (department, role, manager, recent geo), device profile (known fingerprints, OS, recent activity), threat intel (IP reputation, known-bad domains, recent CVE matches), related events (other events from this user or device in the last hour).

The prompt template is simple: feed the LLM the normalized event, the user context, the recent activity, and any threat intel matches, then ask it to classify the event as benign, suspicious, or malicious, explain its reasoning in two or three sentences, and assign a numeric priority. Use structured output (JSON mode) to force a parseable response.

Model choice: a moderately powerful model (Claude Sonnet, GPT-4o-mini, or Llama 3.3 70B self-hosted) is the right balance. Frontier models are overkill at this volume. Cache enrichment lookups aggressively — the same user profile and threat intel queries happen across hundreds of events per hour; caching cuts enrichment cost by 80%.

Layer 3: Automated Response and Human Loop

Automated response is the most consequential part of the pipeline. Get it wrong and you lock out the CEO during a board meeting. Get it right and you contain breaches in seconds.

The human loop is non-negotiable for any irreversible action. Even for Priority 1 events, build a reversible default — disabled accounts get auto-re-enabled in 4 hours unless a human confirms; isolated devices show a notification with a help URL.

This 'reversible default' pattern is what makes automated response safe enough to deploy. Without it, one false positive that locks out the wrong person leads to the system being permanently downgraded to 'alert only' — which defeats the purpose.

  • Priority 1 (confirmed malicious): immediate automated containment — disable account, isolate device, revoke tokens — and human notification
  • Priority 2 (high confidence suspicious): automated soft response — require re-auth, throttle access, alert security team
  • Priority 3 (medium confidence): human review queue with full LLM reasoning attached
  • Priority 4-5 (low): logged for trend analysis, no immediate action
  • All automated actions: reversible by default with auto-rollback if not confirmed

Choosing Models: Speed vs Accuracy Trade-offs

Three model tiers cover almost any pipeline. Route between them dynamically based on event characteristics.

Fast tier (Haiku, GPT-4o-mini, Llama 3.3 8B): under 500ms latency, $0.0001-0.001 per event. Use for high-volume pre-classification — does this event look interesting enough to warrant deeper analysis? Accuracy is good enough to catch obvious patterns.

Medium tier (Claude Sonnet, GPT-4o, Llama 3.3 70B): 1-3 seconds, $0.001-0.01 per event. The workhorse tier. Use for the main triage step where reasoning matters.

Slow tier (Claude Opus, GPT-5, frontier models): 3-10 seconds, $0.01-0.1 per event. Use only for high-priority deep investigation — events flagged as 'maybe malicious, need careful analysis.' Volume should be 1-5% of total.

Most events go through fast and medium. A small subset escalates to slow. Total spend stays manageable while quality on hard cases stays high.

False Positives: The Constraint That Kills Most Pipelines

The single biggest failure mode of AI security pipelines is false positive volume. A pipeline that produces 500 'suspicious' alerts per day for a team of 3 analysts is worse than no pipeline at all — analysts stop reading them.

Pattern 1: explicit benign-pattern teaching. Maintain a corpus of known-benign event patterns (CI/CD service accounts, scheduled jobs, security scan tools). Reference this corpus in the LLM prompt: 'These patterns are known benign and should not be flagged.' Cuts false positives dramatically in week one.

Pattern 2: confidence thresholding. Have the LLM emit a numeric confidence (0-100) alongside its classification. Only fire alerts above a threshold (e.g., 70). Tune the threshold based on analyst feedback — start high, lower if real events are slipping through.

Pattern 3: feedback loop. When an analyst marks an alert as a false positive, capture the event and context and use it as a negative example in future prompts. This requires a structured feedback UI but pays off in declining false positive rates over weeks and months.

Cost Management at Production Scale

At enterprise event volumes (millions per day), naive pipelines cost tens of thousands per month in LLM API spend. Three rules keep costs sustainable.

Rule 1: pre-filter aggressively. Layer 1 should remove 90-95% of events. Every event that reaches the LLM costs money; every event the LLM never sees is free.

Rule 2: use the smallest model that works. Most triage decisions do not need a frontier model. A fine-tuned Llama 3.3 70B running self-hosted on a single GPU server handles the volume that would cost tens of thousands per month on hosted GPT-4o. Operational cost is one engineer's quarter of time.

Rule 3: cache repeated patterns. Many security events have repeated structure (same user, same source IP, same event type). Cache the LLM's assessment for identical event signatures with a 1-hour TTL. Cache hit rates of 30-50% are achievable.

For a mid-sized organization processing ~5M events/day through this pipeline, our reference architecture (self-hosted Llama for triage, hosted Sonnet for escalations) costs roughly $3,000-5,000/month — vs $30,000+/month with naive hosted-only approaches [verify against your model pricing].

Real Pattern: Suspicious Login Detection End-to-End

A worked example: detecting account takeover via suspicious login patterns. Event sources: AWS CloudTrail Console login events plus identity provider audit logs (Okta, Google Workspace).

Layer 1 filtering drops events from known service principals, healthchecks, and routine SSO flows. Keep raw human login events. Roughly 95% volume reduction.

Layer 2 enrichment pulls user profile (department, manager, normal login geo, device fingerprint history), recent activity (other logins in last hour, recent privilege changes), and threat intel (IP reputation, ASN reputation, recent compromise feeds). The LLM then classifies the login as benign, suspicious, or malicious given the context.

Reasoning examples the model handles well: 'login from new country but device matches and user mentioned travel in recent calendar event' → benign; 'login at 3am from never-seen country on never-seen device' → suspicious; 'login followed immediately by privilege escalation request from unfamiliar IP' → malicious.

Layer 3 response: malicious → disable account, force MFA re-enroll, notify security team; suspicious → require re-auth on next action, log to review queue; benign → no action, log for trend. In our benchmark testing on synthetic enterprise login data, this pipeline catches roughly 95% of account takeover patterns at a false positive rate below 2% — well above rule-based baselines [verify with your environment].

Frequently Asked Questions

Common questions from security and engineering teams evaluating whether to build an AI security pipeline in-house.

  • Can I build this without machine learning expertise? Yes. The pipeline uses LLMs via API or self-hosted runtime, not custom-trained ML. Standard backend engineering skills are enough.
  • Which open-source SIEM should I pair this with? Wazuh and Elastic Security are the strongest open-source options. Both integrate with Vector for log shipping and can serve as Layer 1.
  • How do I evaluate pipeline accuracy? Build a synthetic test set of known benign and known malicious events. Run the pipeline against it weekly. Track precision and recall over time.
  • What about LLM-injection attacks on security tooling? Real concern. Treat all log data as untrusted in prompts; use system prompts to instruct the model to ignore injected instructions; validate structured outputs.
  • Should I use a vendor (Vectra, Darktrace) or build this? Vendors are faster to deploy and well-engineered. Building lets you customize and own the data. Build if you have engineering capacity and customization needs; buy if you do not.

Related Reading

Adjacent material on the AI infrastructure and automation pieces that connect to a security pipeline.

  • AI Cybersecurity: Automated Threat Defense 2026 — the broader landscape: /blog/ai-cybersecurity-automated-threat-defense-2026
  • Self-Hosted AI: Running Llama, Mistral, DeepSeek — the models that power the pipeline: /blog/self-hosted-ai-llama-mistral-deepseek-2026
  • n8n Workflow Automation: Build, Scale, Self-Host — orchestrating response workflows: /blog/n8n-workflow-automation-complete-guide
  • AI Models Compared: GPT vs Gemini vs Claude vs Llama — choosing your triage model: /blog/ai-models-compared-gpt-gemini-claude-llama
  • AdMetric Pro — built with the same engineering principles: /portfolio/admetric-pro

More articles

View all →