Building a Robust Firewall: Lessons from Protecting Millions of Websites

While building Malcare's Firewall, we have witnessed firsthand the challenges of developing and maintaining a critical security component that runs on over half a million websites. The recent incident where Crowdstrike's automatic channel file rollout caused widespread internet disruption has reinforced the importance of robust, fail-safe code in high-impact systems. Today, I'd like to share our approach to writing firewall code that prevents such mistakes and ensures the safety of millions of websites.

Cover Image for the article

The Stakes Are High

Our firewall runs on every request, before any part of the website is loaded. This is similar to how Crowdstrike's driver operates with high privileges. One mistake could potentially bring down millions of websites, affecting countless businesses. This responsibility drives our paranoia-level attention to detail and robust engineering practices.

Understanding the Environment

Before diving into our solutions, it's crucial to understand our operating environment:

We run on WordPress websites using PHP
PHP is dynamically typed and prone to fatal errors
Error handling in PHP is less reliable compared to languages like Ruby
Simple operations like array access without key checking can be fatal
Arrays in PHP can act as both arrays and hashmaps

Why Firewalls Are Critical and Prone to Failures

Firewalls need to stay up-to-date with the latest attacks, often requiring custom handling for unique threats. These updates need to be released rapidly to all websites, sometimes within hours of discovering a new vulnerability. Any mistake in this process can have catastrophic consequences.

Example Firewall Rule:

if (regex_match(param["x"]["y"]["z"], xss_regex) && !is_current_user_admin && equals(path, "/wp-admin/admin-ajax.php"))
    BLOCK;
end

This rule checks for XSS attempts in a specific parameter, ensures the user is not an admin, and only applies to a specific WordPress path. Let's use this as a reference as we explore our engineering principles.

Engineering Principles for Robust Code

1. Handle Incorrect Configurations Gracefully

Mistake Prevented: Crashing due to malformed configuration files.

Principle: Always assume that input data, including configurations, can be incorrect or malformed.

Example: Parsing configuration files safely

public static function parseFile($fname) {
    $result = array();
    if (file_exists($fname)) {
        $content = file_get_contents($fname);
        if (($content !== false) && is_string($content)) {
            $result = json_decode($content, true);
            if (!is_array($result)) {
                $result = array();
            }
        }
    }
    return $result;
}

Explanation: This code ensures that even if the config file is missing, empty, or malformed, we return an empty array instead of crashing. In the context of our example rule, if the configuration containing the XSS regex patterns is corrupted, this function would return an empty array, preventing a fatal error.

2. Strict Type Checking and Input Validation

Mistake Prevented: Runtime errors due to incorrect data types.

Principle: Never trust input data. Always validate and sanitize.

Example: Enforcing strict type checking in function parameters

private function processRuleFunctionParams($func_name, $args_cnt, $args, $required_params = 0, $param_types = array()) {
    if (($args_cnt < $required_params)) {
        throw new BVProtectRuleError_V549(
            $this->addExState("ArgumentCountError: Too few arguments for " . $func_name)
        );
    }
    foreach ($param_types as $pos => $type) {
        // Type checking logic here
    }
}

Explanation: This wrapper function ensures that all function calls in our rule engine have the correct number and types of arguments. For our example rule, it would validate that param["x"]["y"]["z"] is actually a string before passing it to the regex_match function, preventing type-related errors.

3. Sanitize External Data

Mistake Prevented: Security vulnerabilities from unsafe input data.

Principle: Always convert input data to known, safe types.

Example: Sanitization function for external data

private static function toAllowedType($value, $depth = 1) {
    if ($depth > self::MAX_DEPTH_TO_ALLOWED_TYPE_FUNC) {
        return null;
    }
    switch (gettype($value)) {
        case 'null':
        case 'boolean':
        case 'integer':
        case 'double':
        case 'string':
            return $value;
        case 'array':
            $array_value = [];
            foreach ($value as $key => $val) {
                $array_value[$key] = self::toAllowedType($val, $depth + 1);
            }
            return $array_value;
        // ... other cases
    }
}

Explanation: This function recursively sanitizes input data, ensuring only allowed types are used in our system. In our example rule, it would ensure that param["x"]["y"]["z"] is converted to a safe type (likely a string) before being processed, preventing potential exploits through unexpected data types.

4. Prevent Premature WordPress Function Execution

Mistake Prevented: Race conditions and function call failures.

Principle: Respect the WordPress loading sequence to avoid race conditions.

Example: Checking for WordPress readiness before executing functions

private function _rf_currentUserCan() {
    $args = $this->processRuleFunctionParams(
        'currentUserCan',
        func_num_args(),
        func_get_args(),
        1,
        ['string']
    );
    $capability = $args[0];

    if (!function_exists('current_user_can') || !BVProtectUtils_V549::havePluginsLoaded()) {
        throw new BVProtectRuleError_V549(
            $this->addExState("currentUserCan: Required funcs doesn't exist.")
        );
    }

    // Execute current_user_can() only if it's safe to do so
    return current_user_can($capability);
}

public static function havePluginsLoaded() {
    return (function_exists('did_action') && (did_action('plugins_loaded') > 0));
}

Explanation: This code ensures that WordPress functions like current_user_can() are only called when it's safe to do so. In our example rule, the !is_current_user_admin check would use this function, preventing errors if the firewall runs before WordPress is fully loaded.

5. Version-Specific Namespacing

Mistake Prevented: Conflicts between different plugin versions.

Principle: Prevent conflicts between different versions of the plugin.

Example: Namespaced class names

class BVProtectRuleError_V549 extends Exception {
    // Class implementation
}

Explanation: By including the version number in class names, we ensure that different versions of the plugin can coexist without interfering with each other. This is crucial for maintaining backwards compatibility and allowing smooth updates.

6. Whitelist Required Functions, Blacklist Everything Else

Mistake Prevented: Execution of unintended or unsafe functions.

Principle: Limit the capabilities of the rule engine to only what's necessary.

Example: Function whitelisting in the rule engine

private $allowed_functions = [
    'regex_match',
    'is_current_user_admin',
    'equals'
    // ... other allowed functions
];

private function executeFunction($func_name, $args) {
    if (!in_array($func_name, $this->allowed_functions)) {
        throw new BVProtectRuleError_V549("Function not allowed: " . $func_name);
    }
    // Execute the function
}

Explanation: This ensures that only explicitly allowed functions can be used in firewall rules. In our example rule, regex_match, is_current_user_admin, and equals would be whitelisted, preventing the introduction of potentially dangerous functions through rule updates.

7. Graceful Degradation

Mistake Prevented: Total system failure due to partial issues.

Principle: Prefer reduced functionality over complete failure.

Example: Error handling in the main firewall execution

try {
    $result = $this->executeFirewallRules();
    if ($result === 'BLOCK') {
        $this->blockRequest();
    }
} catch (BVProtectRuleError_V549 $e) {
    $this->logError($e);
    // Allow the request to pass through instead of blocking the site
}

Explanation: If an error occurs during rule execution, we log it and allow the request to pass through instead of potentially blocking legitimate traffic or breaking the site. In the context of our example rule, if any part of the rule evaluation fails (e.g., an invalid regex), the firewall would log the error and allow the request rather than incorrectly blocking it.

Systems & Processes as Another Layer of Safety

Our Testing Process: 10,000+ Test Cases and Automation

Mistake Prevented: Undetected bugs and logic errors.

Testing whether a rule will only block requests with certain bad parameters and will not block admin or other unrelated traffic is crucial. Missing corner cases can lead to false positives. Writing automation tests helps in easy reviews, tests on QA environments, and across different user roles and PHP versions.

Phased and Controlled Rollouts of New Features

Mistake Prevented: Rolling out untested or problematic updates.

All new features are disabled by default. We only initialize values from config if the correct data type is present.

private $admin_cookie_mode = BVProtectFW_V549::ADMIN_COOKIE_MODE_DISABLED;

if (array_key_exists('admincookiemode', $config) && is_int($config['admincookiemode'])) {
    $this->admin_cookie_mode = $config['admincookiemode'];
}

Creating a Robust Testing Process

Mistake Prevented: Compatibility issues across different environments.

Our automated testing tool allows us to run tests on multiple environments with minimal overhead. This process is mandatory, and we track these tests with Google Sheets to ensure compatibility with older environments.

Phased Rollouts of New Rules

Mistake Prevented: Catastrophic failures on all production sites simultaneously.

We enable new rules on selected sites in production:

Enable on our testing sites
Enable on 100 sites
Enable on another 1000 sites, and so on.

Coding Standards and Conventions

Mistake Prevented: Human error and inconsistent implementations.

We enforce restrictions by default in the architecture to minimize the chances of missing checks.

Multi-Level Code Reviews

Mistake Prevented: Oversights and errors slipping through.

Any code that gets pushed into the firewall needs to pass multiple rounds of review by selected individuals.

Configuring VSCode for Compatibility

Mistake Prevented: Compatibility issues with PHP/WordPress functions.

Configuring VSCode to highlight compatibility issues helps catch problems early.

Error Handling for Older Versions

Mistake Prevented: Crashing due to config mismatches.

Our code is designed to read and initialize config only if it's in the correct format, validating nested JSON values at each point.

Logging, Monitoring & Alerts

Logging Violations

Mistake Prevented: Ignoring or missing critical errors.

Our code logs violations with metadata, sending them to our systems for analysis.

Monitoring Stats and Alerts

Mistake Prevented: Failure to detect and respond to issues promptly.

We collect logs and errors, and review them through an admin dashboard, ensuring that we are the first to know if something goes wrong.

Systems to Recover Hosts if Something Goes Wrong

Remote Firewall Disabling

Mistake Prevented: Inability to recover from critical errors.

The firewall will skip rule execution for requests sent through our systems, allowing us to remotely disable components and recover a site.

NOTE Feel free to check out Malcare plugin's source here to see these principles in action.

Conclusion

I hope these principles will help the readers build robust systems.

Building a Robust Firewall: Lessons from Protecting Millions of Websites