ChaosKit

A modular Go framework for chaos engineering, fault injection, and reliability testing of distributed systems, libraries, and services.

Overview

ChaosKit enables systematic testing of system reliability and resilience through controlled fault injection and invariant validation.

Important:

ChaosKit is designed for proactive chaos engineering — it works best when integrated into your code from the start. Most injection methods require adding chaos hooks to your code (except ToxiProxy for network chaos).

Philosophy

Build resilient systems by designing them to be testable with chaos from day one, rather than retrofitting chaos testing into existing code.

When to Use

  • New libraries and frameworks
  • Workflow engine testing
  • Rollback mechanism validation
  • Resource leak detection
  • Network resilience testing

Key Capabilities

Powerful tools for testing the reliability of your systems

Controlled Chaos Injection

Introduce delays, panics, resource pressure, and network faults to test system resilience.

Multiple Injection Methods

Context-based, monkey patching, failpoints, and network proxies for various testing scenarios.

Invariant Validation

Verify system properties: bounded recursion, absence of infinite loops, resource leak prevention.

Continuous Testing

Run long-duration stress tests to discover edge cases and hidden issues.

Comprehensive Metrics

Automatic collection of execution statistics and performance data for analysis.

Flexible Configuration

Customize behavior through functional options pattern and builder pattern for convenient usage.

Injection Methods

Choose the right method for your use case

1. Context-Based Injection (Recommended)

Recommended for new code

Explicitly call chaos functions in your code. Fine-grained control, low overhead, works in production.

Requires code changes

2. Failpoint Injection (Recommended)

Recommended for production

Compile-time injection points. Production-safe (compiles to no-op without tag).

Requires code changes

3. ToxiProxy (Recommended)

Recommended for network chaos

Proxy network connections. Works without code changes, real network conditions.

No code changes

4. Monkey Patching

Limited use cases

Runtime function replacement. Only works with package-level function variables.

Requires -gcflags=all=-l

Comparison Table

Method Code Changes Works with Existing Code Production-Safe Performance Recommendation
Context-Based Required No Yes Excellent New projects
Failpoints Required No Yes (no-op) Excellent Production
ToxiProxy None Yes Yes Good Network
Monkey Patch Specific Rarely Never Poor Avoid

Core Components

Modular architecture for flexible testing

Chaos Injectors

  • DelayInjector - random delays
  • PanicInjector - random panics
  • CPUInjector - CPU stress
  • MemoryInjector - memory pressure
  • ToxiProxy Injectors - network chaos
  • CompositeInjector - combine injectors

Validators

  • PanicRecoveryValidator - recovery validation
  • RecursionDepthValidator - recursion depth
  • GoroutineLeakValidator - goroutine leaks
  • SlowIterationValidator - slow iterations
  • ExecutionTimeValidator - execution time
  • MemoryLimitValidator - memory limits

Quick Start

Get started with ChaosKit in minutes

package main

import (
    "context"
    "log"
    "time"

    "github.com/rom8726/chaoskit"
    "github.com/rom8726/chaoskit/injectors"
    "github.com/rom8726/chaoskit/validators"
)

// Define your system under test
type WorkflowEngine struct{}

func (w *WorkflowEngine) Name() string { return "workflow-engine" }
func (w *WorkflowEngine) Setup(ctx context.Context) error { return nil }
func (w *WorkflowEngine) Teardown(ctx context.Context) error { return nil }

// Define execution step with context-based chaos
func ExecuteWorkflow(ctx context.Context, target chaoskit.Target) error {
    // Inject chaos directly in your code
    if chaoskit.MaybePanic(ctx) {
        panic("chaos: intentional panic")
    }
    
    chaoskit.MaybeDelay(ctx) // May add delay here
    
    // Your workflow execution logic
    time.Sleep(10 * time.Millisecond)
    return nil
}

func main() {
    engine := &WorkflowEngine{}

    scenario := chaoskit.NewScenario("reliability-test").
        WithTarget(engine).
        Step("execute-workflow", ExecuteWorkflow).
        Inject("delay", injectors.RandomDelay(5*time.Millisecond, 25*time.Millisecond)).
        Inject("panic", injectors.PanicProbability(0.01)).
        Assert("goroutine-limit", validators.GoroutineLimit(200)).
        Assert("recursion-depth", validators.RecursionDepthLimit(100)).
        Assert("no-infinite-loop", validators.NoInfiniteLoop(5*time.Second)).
        Repeat(100).
        Build()

    if err := chaoskit.Run(context.Background(), scenario); err != nil {
        log.Fatalf("Scenario execution failed: %v", err)
    }
}

Installation

go get github.com/rom8726/chaoskit

Requires Go 1.25 or later.

Examples

  • simple - basic reliability testing
  • continuous - continuous testing
  • chaos_context - context-based injection
  • monkey_patch - monkey patching
  • toxiproxy - network chaos
  • floxy_stress_test - comprehensive stress test

Usage Patterns

Different ways to use ChaosKit for reliability testing

Basic Reliability Testing

scenario := chaoskit.NewScenario("basic-test").
    WithTarget(workflowEngine).
    Step("execute", ExecuteWorkflow).
    Assert("no-leaks", validators.GoroutineLimit(100)).
    Repeat(1000).
    Build()

chaoskit.Run(ctx, scenario)

Continuous Testing

scenario := chaoskit.NewScenario("continuous-test").
    WithTarget(engine).
    Step("execute", ExecuteWorkflow).
    Inject("chaos", injectors.CompositeInjector(...)).
    Assert("stability", validators.GoroutineLimit(500)).
    RunFor(24*time.Hour).
    Build()

Load Testing with Validation

scenario := chaoskit.NewScenario("load-test").
    WithTarget(engine).
    Step("execute", ExecuteWorkflow).
    Inject("cpu-stress", injectors.NewCPUInjector(4, 500*time.Millisecond)).
    Assert("performance", validators.NewExecutionTimeValidator(100*time.Millisecond)).
    Repeat(10000).
    Build()