CrowdStrike Chaos: A Wake-Up Call for Software Quality

Testinium Engineering Team•21 July 2024

CrowdStrike Chaos: A Wake-Up Call for Software Quality

I’m sure by this time, all of you are at least aware of the recent CrowdStrike issue on Microsoft Windows devices, which brought a significant part of our software world to a standstill. This incident not only highlighted vulnerabilities within critical update mechanisms but also underscored the importance of robust software quality processes.

Investigating the detailed report found here, let’s briefly dive into what happened and then reflect on how we can leverage these learnings to enhance our own practices. By examining the causes and consequences of this event, we can explore key strategies to fortify our systems against similar disruptions in the future.

What Happened? Preliminary Post Incident Review: Configuration Update Impacting Falcon Sensor and Windows OS

Incident Overview

CrowdStrike released a content configuration update on July 19, 2024, for the Windows sensor aimed at gathering telemetry on new threat techniques. This update, however, led to a system crash (Blue Screen of Death – BSOD) for Windows hosts running sensor version 7.11 and above, active during a specified timeframe. The issue was resolved by reverting the update shortly thereafter.

Key Details

Date & Time: July 19, 2024, 04:09 – 05:27 UTC

Affected Systems: Windows hosts running sensor version 7.11+ (Mac and Linux hosts were unaffected)

Resolution: Update reverted at 05:27 UTC

Root Cause

The problem stemmed from a Rapid Response Content update, a type of dynamic update that allows quick adaptation to emerging threats. An undetected error in this update led to the system crash.

Update Delivery Mechanisms

Sensor Content: Long-term capabilities delivered with sensor releases, including on-sensor AI and machine learning models. These updates undergo extensive testing.
Rapid Response Content: Behavioral pattern-matching updates delivered dynamically. These updates include Template Instances, which configure the sensor to detect specific behaviors.

Testing and Deployment Process

Sensor Content: Includes thorough automated and manual testing, staged rollout, and customer-controlled deployment.
Rapid Response Content: Deployed through Channel Files, interpreted by the sensor’s Content Interpreter and Detection Engine. While newly released Template Types are stress tested, an error in the Content Validator allowed problematic content to pass undetected.

Specifics of the Incident

Trigger: Deployment of two new IPC (InterProcessCommunication) Template Instances on July 19, 2024. A bug in the Content Validator allowed one of these instances with problematic content data to pass validation.

Effect: Problematic content caused an out-of-bounds memory read, leading to an unhandled exception and subsequent system crash.

Preventative Measures

To prevent similar incidents, CrowdStrike will:

Enhance Testing: Introduce more comprehensive testing methods, including stress testing, fuzzing, and fault injection.
Improve Content Validator: Add checks to prevent problematic content from being deployed.
Strengthen Error Handling: Enhance the Content Interpreter’s ability to handle exceptions.
Deployment Strategy: Implement staggered deployments for Rapid Response Content, starting with canary deployments and collecting feedback before broader rollout.
Customer Control: Provide more control over update deployments and detailed release notes.
Third-Party Reviews: Conduct independent security code reviews and quality process evaluations.

Manageable & Scalable Testing Starts Here

From automation to live testing, manage and scale everything from a single platform.

UNIFIED AUTOMATION & LIVE TESTING

CENTRALIZED MANAGEMENT

SECURE & SCALABLE INFRASTRUCTURE

Book a Demo