Turning Weeks of LLM Evaluation into Minutes in Security Operations

Bright curved horizon of a planet glowing against the dark backdrop of space.

The Challenge

It's 2 AM. An alert fires. A SOC analyst stares at a screen full of indicators: IP addresses, file hashes, behavioral patterns, user activity logs. Is this a real threat or another false positive? They have minutes to decide, and the backlog isn't getting any shorter.

This is the reality of modern security operations. The average SOC handles thousands of alerts daily, and the vast majority turn out to be benign. Analysts are drowning in data, making split-second decisions that could mean the difference between catching a breach early and missing it entirely.

Large Language Models promised to change this. Feed an LLM the alert context, relevant policies, and historical cases, and let it help analysts triage faster. And it works. But there's a catch.

Security verdict evaluation is uniquely brutal. Unlike most LLM applications, success isn't about fluency or factual recall. It's about applying complex, often ambiguous organizational policies to context-heavy scenarios. Edge cases are rare but critical. You can't just test on the happy path. And ground truth is often delayed: you might not know if a verdict was right until weeks later when an incident either escalates or doesn't. The cost asymmetry is real. Miss a true threat, and you have a breach. Overcall benign activity, and you burn out your analysts with false positives.

So when an LLM gets something wrong, what do you do about it?

Most organizations test changes manually and hope for the best. Some have evaluation mechanisms, but they take weeks to accumulate enough production data to be confident. Very few can actually show, at scale, what impact a change will have before shipping it.

We had A/B tests and safe rollout mechanisms. Deploying changes wasn't the hard part. The bottleneck was evaluation. To know whether a change actually worked, you needed enough real production cases to accumulate. Days. Weeks. Depending on volume and the specific scenario you were trying to improve.

Every iteration meant another waiting period. You couldn't validate your next idea until you'd validated your current one. At scale, this becomes a nightmare: a queue of changes waiting for evaluation, each one blocking the next.

And this wasn't just the AI team's problem. Analysts themselves own many LLM-related changes. They had the ideas. They had the intuitions. But they couldn't move fast because evaluation was the bottleneck.

When feedback loops are this slow, people stop giving feedback. The LLM becomes a black box that "does what it does," and everyone works around its quirks instead of fixing them.

We refused to accept that.

The Bet We Made

For an AI-native company, there's a fundamental question: does AI expertise stay concentrated in a few specialists, or does it spread across everyone who needs it?

As a startup less than a year old, we made a strategic bet. We dedicated significant time to building AI Lab, not because we had resources to spare, but because democratizing access to complex LLM workflows was core to how we wanted to operate. The goal was never just "evaluate LLMs better." It was to empower everyone, developers and analysts alike, to work with this technology without needing years of specialized expertise.

The core technical innovation is mutation testing for LLMs: systematically modify inputs and see how outputs change. Remove a policy from the context and see if the verdict changes. Swap one model for another and compare results. Change how a previous step reasoned and watch the ripple effects. You're mapping the sensitivity of the system, understanding what actually matters versus what's noise.

But the technical innovation was always in service of a bigger goal: turning feedback loops that took weeks into minutes. Giving ownership to the people who actually use these systems every day.

How It Works

For the technically curious, here's what's under the hood.

Recording Verdicts

Every verdict run gets captured with complete provenance:

Raw input event
Organizational policies in scope
Historical cases referenced
Intermediate stage outputs
Final verdict with confidence
Execution metadata: model per stage, token counts, cache hits, execution time

This isn't just logging. It's a queryable corpus. Filter by verdict type, confidence ranges, policies used, time windows, or specific stages. This corpus becomes the foundation for everything else.

Multi-Stage Pipeline

Our verdict generation breaks into independent stages. Each stage has its own model configuration, system prompts, user prompts, and structured outputs.

Why this matters: you can't test what you can't isolate.

When something goes wrong, you pinpoint exactly which stage caused it. More importantly, you can mutate individual stages and observe downstream effects. Change how one stage reasons, watch how it cascades through the rest.

Mutation Testing

The key insight: you test against existing, already-recorded verdict runs. No waiting for new production data. This is the speed multiplier.

Four mutation categories:

Organizational context — Add, remove, or edit policies. Does the model incorporate new context correctly? Does removing a policy break a dependency you didn't know existed?

Stage outputs — Override intermediate step outputs. What if earlier analysis had concluded differently? How sensitive is the final verdict to each stage?

Inputs — Swap historical precedent, change input prompts, modify system prompts per stage. Test instruction-following and context sensitivity.

Models — Switch LLM providers or configurations per stage. Compare quality vs. cost tradeoffs.

LLM-as-Evaluator

Mutation testing generates pairwise comparisons: original verdict vs. mutated verdict. Reviewing these manually doesn't scale.

We use an LLM to evaluate other LLMs. It looks for:

Verdict classification changes (most critical)
Confidence shifts above 5%
Reasoning divergence in chain-of-thought
Differences in evidence interpretation

Evaluation prompts are customizable. The workflow runs async in batch mode. Mutate 100 verdicts, apply 5 mutations each, get 500 automated comparisons. No manual review required.

Structured Feedback

Free-text feedback disappears into noise. We use predefined categories mapped to failure modes:

Context issues: wrong entity, policy misinterpreted, relevant context ignored
Analysis issues: malformed output, logical errors
Confidence issues: over/underconfident relative to evidence
Classification issues: false positive patterns, missed threats

This taxonomy creates shared vocabulary and enables pattern detection. Fifteen "policy misinterpreted" instances for the same policy? You know exactly where to look.

Dry-Run Mode

Preview your mutated inputs before running them through the LLM. See exactly what you're about to test, verify your mutations are configured correctly, and check how many runs will be affected. Catch mistakes before spending tokens.

The technical architecture enables something simple: anyone can answer "what would happen if we changed X?" in minutes instead of weeks.

What Changed

When we released AI Lab, something happened that validated the entire bet.

Analysts started using it immediately. Not because we told them to, but because they finally had power over a system that had previously felt like someone else's problem. They ran tests. They identified issues. They changed actual production flows. They drove outcomes.

The transformation isn't "we can deploy faster." It's "we can know faster."

Speed. A change that would have taken weeks to validate can now be tested in minutes. Issues that used to sit in evaluation queues get diagnosed and fixed the same day.

Ownership. "The AI team will look into it" became "let me check what's happening." When people can investigate and fix issues themselves, they stop treating AI as someone else's responsibility.

Confidence. Before, deploying LLM changes felt like rolling dice. Now we validate changes against real cases before they go live. We know what we're shipping.

Culture. This is the big one. The conversation shifted from "the LLM does weird things sometimes" to "we have the power to make this better." People stopped working around problems and started solving them.

Watching analysts use AI Lab from day one, running tests, identifying issues, driving real improvements to production systems: that's the payoff. Not the architecture. Not the mutation testing framework. The fact that people who work with these systems every day finally have the tools to shape them.

Conclusion

That bet paid off.

Any organization using LLMs for consequential decisions will eventually face the question: who gets to improve this, and how fast can they do it?

Our answer is: everyone who works with it, and as fast as they can think.

Testing the untestable isn't magic. It's giving people the right tools and getting out of their way. That's what AI Lab is for us at Daylight Security, and we're just getting started.