Building a Real-Time SOC Command Center: From Slack Alerts to Intelligent Security Operations

Bright curved horizon of a planet glowing against the dark backdrop of space.

At Daylight, we provide Managed Agentic Security Services, combining AI-driven automation with elite human security analysts to protect organizations from cyber threats in real-time. This is the story of how we built the internal platform that powers our security operations.

Part I: The Breaking Point

It started innocuously enough. A simple Slack channel named #activity-prod. Every time our security system detected a potential threat—a suspicious file share, an unusual login, a risky permission change—a message would appear. For our SOC team, this channel was their lifeline.

Each of these messages represented a case that our team needed to investigate as soon as possible.

Then we scaled.

What worked beautifully at 20-30 cases per day became a nightmare at 200+. Analysts would arrive at their desks each morning to find hundreds of unread messages. Critical high-severity incidents sat buried beneath routine alerts. There was no way to track who was investigating what. The team had developed an informal protocol of emoji reactions to claim cases, a creative workaround that highlighted just how far we'd stretched our tools beyond their limits.

"I don’t have good enough visibility," our SOC manager said during a particularly tense standup. "I don't know who's working on what. And I’m staring on slack and find it hard to understand the status of the work being done"

That's when we knew: Slack wasn't the problem. We'd outgrown the concept of notifications-as-workflow.

The Real Problems

As we dug deeper, we identified five fundamental issues:

1. No Prioritization

Every case appeared in chronological order, regardless of severity. A critical credential theft incident could easily be pushed off-screen by a dozen low-severity alerts. Analysts spent mental energy constantly re-evaluating priorities instead of investigating threats.

2. Zero Status Visibility

There was no way to see which cases were being actively investigated, which were pending, and which had been resolved. Teams developed ad-hoc systems—private threads, separate tracking spreadsheets—that fragmented information and increased cognitive load.

3. No Time Tracking or SLA Metrics

Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR) are critical SOC performance indicators. But in Slack, tracking live SLA timers and surfacing exceptions is far more complicated than it should be. We had to cobble together external tools to get any visibility into how long cases sat unassigned, which ones were approaching SLA breaches, and whether our commitments were being met. It worked, but it was inefficient and fragile.

4. The Constant Vigilance Problem

Analysts felt compelled to keep Slack open at all times, constantly checking for the agentic investigation to finish in order to know if they should take action or not. The lack of intelligent notifications meant either being overwhelmed with pings or missing critical alerts. There was no middle ground.

5. Manager Blindness

SOC managers had no dashboard, no overview, no way to see the state of operations without scrolling through hundreds of Slack messages. Capacity planning, workload distribution, and performance analysis were nearly impossible.

The team needed more than a better notification system. They needed an operations platform built specifically for security case management.

Why Not Off-the-Shelf?

The obvious question: why build this ourselves? We evaluated the usual suspects - Jira Service Management, PagerDuty, ServiceNow, and several SOC-specific ticketing platforms. Each one failed the same test.

Our architecture requires strict data residency: EU customer data lives in EU infrastructure, US data in US infrastructure. No exceptions. Most off-the-shelf tools assume a single centralized database.

We also needed sub-second case visibility. Security operations can't tolerate the eventual consistency model that most SaaS platforms rely on. When a high-severity case appears in any region, every analyst needs to see it immediately—not after a sync cycle, not after a webhook fires, not after a background job runs.

Finally, our workflow is unique. We're not managing IT tickets. We're orchestrating a human-in-the-loop AI investigation pipeline where analysts need to see the AI's work in real-time and intervene at precisely the right moment. No existing tool was designed for this.

The combination of multi-region data residency, real-time aggregation, and AI-native workflow meant we were looking at a genuinely hard distributed systems problem with no good off-the-shelf solution.

Part II: The Architecture Challenge

Design Principles

Before writing a single line of code, we established four non-negotiable principles:

1. Privacy and Compliance First

Our customers span multiple regions with varying data residency requirements. EU customer data must stay in EU infrastructure; US data in US infrastructure. We couldn't simply build a centralized database that aggregates everything.

2. Real-Time, Not Eventually Consistent

Security operations can't wait for background jobs or cache refreshes. When a high-severity case appears, analysts need to see it immediately.

3. Mobile-First Experience

Security doesn't stop when analysts leave their desks. The platform needed to work as well on a phone as on a 27-inch monitor.

4. Smart Notifications Only

Push notifications are powerful but easily become noise. The system should only notify analysts when they specifically need to take action.

The Core Technical Bet: Database-Free Aggregation

The most critical architectural decision was this: The platform holds no database.

Every customer's case data remains in its regional cluster (EU or US), stored in the databases that already power our security platform. The command center acts purely as an aggregation and presentation layer, and shows no PII or sensetive data.

Here's the high-level architecture:

When an analyst opens the platform, it simultaneously queries both EU and US regions via secure connections. Each region reads from its local database. The platform anonymizes customer-identifying data before merging results in-memory, ensuring that analysts see only what they need to investigate and presents a unified view.

Benefits:

Data sovereignty: Customer data never leaves its designated region
Scalability: Adding a new region means updating configuration, not complex migrations
Simplicity: No database replication or synchronization logic
Resilience: Regional outages don't corrupt central state because there is no central state

Challenges:

Query performance: Filtering and sorting happen at multiple levels
Consistency trade-offs: We prioritize availability over strong consistency

Part III: Key Technical Decisions

1. Multi-Region Aggregation Strategy

The heart of the system is the aggregation layer, which implements parallel request fanout with resilience:

Key Design Points:

Configurable timeout per region (we use 5 seconds)
Automatic retry with exponential backoff
Returns both successful results and detailed error metadata per region
Tracks latency for observability

How it works:

When a user requests data, the platform fans out requests to all configured regions in parallel. Each region processes the request independently. The platform waits for all responses (or timeouts), merges the successful results, and returns them along with metadata about any failures.

If one region were to fail, the system would continue serving results from available regions. The aggregation layer tracks status for each region (success, error, or timeout) along with latency metrics, and failures are logged for alerting.

2. Real-Time Updates: Hybrid Approach

We use a two-pronged approach for keeping the UI current:

Polling for Active View:

The active cases table polls every 3 seconds. This is "real-time enough" for case management—much simpler to implement than WebSockets, easier to debug, and works reliably across different network conditions.

Smart New Case Detection:

When polling detects new cases, they appear at the top of the list with a subtle highlight animation. The UI smoothly integrates new cases without jarring page refreshes, keeping analysts oriented in their workflow.

Push Notifications for Off-Screen:

For mobile and desktop notifications, we use the Web Push API with a message queue system:

New Case Created ↓ Message Queue ↓ Notification Service (Polls queue) ↓ Check User Preferences ↓ Respect Quiet Hours (Timezone-aware) ↓ Send Push to All User Devices ↓ Browser/Mobile OS Displays Notification

Smart Filtering:

Configurable notification preferences based on case severity, so analysts choose which severity levels trigger push notifications
Respects user-configured "quiet hours" with timezone support
Automatically removes invalid device subscriptions
Honors user notification preferences

3. Progressive Web App (PWA) for Mobile

Rather than building separate native apps, we built a Progressive Web App:

Why PWA:

90% of native app benefits with 10% of the development effort
No App Store approvals or review processes
Instant updates—no waiting for users to upgrade
Single codebase for desktop, iOS, and Android
Offline capability with graceful degradation

Key Features:

Installable on home screen (iOS, Android, desktop)
Push notifications that work like native apps
Service worker handles push notifications
Touch-optimized mobile UI
Vibration feedback on notifications
Standalone display mode (no browser chrome)

4. Graceful Degradation Philosophy

We made a critical decision early: never show stale data.

When a region is unavailable, the system returns partial results from healthy regions with a clear indication that we have a problem. No cached data, no "last known state," no silent failures—just the available data.

Our degradation strategy:

Each regional request has a 5-second timeout
Failed requests trigger one automatic retry with exponential backoff
Partial results from available regions are returned seamlessly
Regional status metadata is tracked for observability
Alerts would fire if regions become unavailable

5. Observability from the Start

We instrumented everything from day one using distributed tracing, and it paid off almost immediately.

What we track:

End-to-end request tracing across regions, including per-region latency breakdowns
Error rates, failure patterns, and timeout frequency per region
Aggregation layer overhead (how long merging adds on top of raw query time)

What we learned:

During our first weeks in production, tracing revealed that one of our regional clusters consistently added 300-400ms more latency than the other. The traces pinpointed the issue to a misconfigured connection pool that was limiting concurrent database connections in that cluster. Without per-region latency tracking, this would have silently degraded the experience for all users—since the aggregation layer waits for the slowest region to respond.

We also discovered that certain case types triggered heavier database queries in specific regions (customers with larger event volumes naturally produce more complex investigations). Tracing let us identify these patterns and optimize the queries before they became a problem at scale.

Why it matters:

With data split across regions and multiple services involved in every request, understanding system behavior requires tracing. In a distributed aggregation system, a slowdown in any single region becomes a slowdown for everyone. Observability isn't optional—it's what turns a black box into a debuggable system.

Part IV: The Results

Operational Improvements

Before:

200+ Slack messages per day in a single channel
No prioritization or sorting
No assignment visibility
Hard to track SLA breach
Constant screen watching required

After:

Smart queue with priority-based tabs
Real-time SLA indicators showing time remaining
Assignment tracking with visual indicators
Historical view with search and filtering
Push notifications only for actionable items during work hours

Performance Metrics

Scalability: Handles hundreds of cases across regions seamlessly
Real-time: keeping the data fresh without overwhelming the backend
Mobile: Full feature parity on iOS and Android
Availability: Graceful degradation means partial regional outages don't stop operations

Team Impact

For Analysts:

Cases automatically sorted by severity and age
Clear visual indicators for investigation status
Mobile access enables on-call response from anywhere
Smart notifications reduce alert fatigue

For SOC Managers:

Dashboard view of all active cases and assignments
Real-time SLA compliance tracking
Workload distribution visibility
Historical data for performance analysis

For the Business:

Maintained data residency compliance across all regions
Reduced MTTA by surfacing high-severity cases immediately
Improved MTTR through better coordination
Scalable architecture that grows with customer base

Part V: Lessons Learned

1. Privacy-First Architecture is Possible

We proved that you can build a unified operations platform without centralizing sensitive data. The database-free aggregation approach added complexity but delivered real business value: compliance, customer trust, and architectural flexibility. Beyond keeping data in its region, we also strip PII and filter sensitive data at the aggregation layer, so the command center only ever holds anonymized case information.Key insight: By keeping data in its original location, aggregating at query time, and anonymizing what surfaces to analysts, we solved data residency without compromising user experience.

2. Push Notifications Are a Double-Edged Sword

The key to successful push notifications isn't sending more—it's sending the right ones. Quiet hours, severity filtering, and user preferences transformed push from annoying to essential.

Key insight: Respect your users' time and attention. One well-targeted notification is worth more than a hundred ignored ones.

3. Progressive Web Apps Are Underrated

PWAs gave us 90% of native app benefits with 10% of the development effort. No App Store approvals, no separate mobile codebase, instant updates.

Key insight: For enterprise internal tools, PWAs are often the right choice. They're especially powerful when you control the deployment environment.

4. Graceful Degradation by Design

We designed the system to handle partial regional failures gracefully—continuing to serve available data rather than failing entirely. While we haven't experienced a regional outage in production, the architecture ensures analysts wouldn't lose access to data from healthy regions if one became unavailable.

Key insight: Design for graceful degradation from the start. It's much harder to add resilience patterns after the fact.

Looking Forward

We shipped the read-only view in Phase 1. The next phases will introduce:

Phase 2 - Write Operations:

Case assignment and reassignment
Status management through the workflow
Investigation notes and annotations
Team collaboration features

Phase 3 - Intelligence:

Complete audit trail and history
Advanced search across all case data
ML-powered case prioritization and routing

Conclusion

What started as a Slack channel overwhelmed by scale became an opportunity to rethink security operations from first principles. By choosing privacy-first architecture, real-time aggregation, and progressive web technologies, we built a platform that scales across regions, respects data sovereignty, and makes SOC teams more effective.

The key insight wasn't technological. It was understanding that tools shape workflows, and workflows shape outcomes. By giving our team the right tool, we didn't just reduce alert fatigue. We enabled a new way of working: proactive, organized, and measurable.

This is what it means to be an AI-enabled services company. We don't just build the best AI SOC technology; we build the operational infrastructure to run it effectively. Our platform is designed from the ground up to support a human-in-the-loop model, where elite analysts work alongside AI-driven automation in real-time.

This focus on operational effectiveness is where legacy service providers hit a glass ceiling. And it's where pure-play technology vendors fail in adoption, because they don't optimize for real-world workflow effectiveness. Building great AI is only half the battle. The other half is building the tools that let humans manage and orchestrate it at scale.

That's what we've done here. And that's worth more than any Slack integration.

‍