✨ Hands-On Tutorial

Polly Resilience (Polly v8): Timeouts, Retries, Circuits, Bulkheads, Hedging

January 17, 2026 24 min read Advanced

Your payment service is down. Without resilience, 10,000 requests pile up, threads block, and your entire API falls over. With Polly, circuits open in 50ms, fallbacks return cached data, and bulkheads isolate the blast radius. Your API stays up. This isn't defensive coding. It's production reality.

Most Polly tutorials show you WaitAndRetry and stop. Production systems need coordinated timeouts, jittered backoffs, circuit breakers with health checks, bulkhead isolation, and hedged requests. You could patch policies together and hope. Or you could learn the patterns that survive outages. This tutorial shows you the battle-tested patterns.

What You'll Build

You'll build a production-ready Orders API that calls Inventory and Payments services with complete resilience:

Timeouts with per-request and per-try budgets, cooperative cancellation across layers
Jittered retries with decorrelated backoff, idempotency guards, and safe verb checks
Circuit breakers with failure-rate thresholds, half-open probes, and health endpoints
Fallbacks with cache-based responses, partial degradation, and error transparency
Bulkhead isolation with separate pools for third-party vs internal calls, queue metrics
Hedging for tail-latency reduction with parallel requests and winner cancellation
Policy composition with correct ordering (Timeout → Retry → Circuit → Fallback)
Per-endpoint policies using Minimal API metadata and endpoint conventions
OpenTelemetry with policy events, custom counters, and distributed tracing
Chaos testing with failure injection, variable latency, and SLO validation

Why Resilience in Cloud-Native .NET

Cloud services fail. Networks hiccup. Dependencies restart. Without resilience, transient failures cascade into full outages. Your API becomes the weakest link. Resilience patterns keep you up when everything around you is falling down.

Transient Faults Are the Norm

Cloud providers guarantee 99.9% uptime. That's 43 minutes of downtime per month. Your app makes 100 dependency calls per request. One failing dependency kills availability. Timeouts prevent hanging. Retries smooth over blips. Circuit breakers stop the bleeding.

SLAs and Error Budgets

If your service promises 99.95% uptime (21 minutes/month), and each dependency is 99.9%, you burn through your error budget fast. Resilience patterns (retries, fallbacks) help you meet SLAs even when dependencies don't.

Polly v8 Changes Everything

Polly v8 is a full rewrite. The old ISyncPolicy and IAsyncPolicy are gone. Pipelines replace PolicyWrap. Telemetry is built-in. The API is cleaner, faster, and works seamlessly with HttpClientFactory and OpenTelemetry.

Setup & First Policy

Start with a minimal API that calls a downstream service. You'll add a basic timeout policy first, then layer on retries, circuits, and fallbacks.

Install Polly v8

Polly v8 is modular. Core policies live in Polly.Core. HTTP integration is in Polly.Extensions.Http. Telemetry hooks into OpenTelemetry via Polly.Extensions.

Terminal

dotnet new webapi -n OrdersApi -minimal
cd OrdersApi
dotnet add package Polly.Core --version 8.0.0
dotnet add package Polly.Extensions.Http --version 8.0.0
dotnet add package Microsoft.Extensions.Http.Polly --version 8.0.0
dotnet add package Microsoft.Extensions.Http.Resilience --version 8.0.0

Register HttpClient with a Typed Client

Use a typed client for clean dependency injection. Register policies via AddResilienceHandler from the new resilience extensions.

Program.cs - Typed Client Setup

using Polly;
using Polly.Timeout;

var builder = WebApplication.CreateBuilder(args);

// Register typed HttpClient with timeout policy
builder.Services.AddHttpClient(client =>
{
    client.BaseAddress = new Uri("https://inventory-api.example.com");
    client.Timeout = TimeSpan.FromSeconds(30); // overall timeout
})
.AddResilienceHandler("inventory-pipeline", (builder, context) =>
{
    // Per-request timeout (inside retries)
    builder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(3),
        Name = "inventory-timeout"
    });
});

var app = builder.Build();

app.MapGet("/orders/{id}", async (int id, InventoryClient inventory) =>
{
    var stock = await inventory.GetStockAsync(id);
    return Results.Ok(new { OrderId = id, Stock = stock });
});

app.Run();

// Typed client
public class InventoryClient(HttpClient http)
{
    public async Task GetStockAsync(int productId, CancellationToken ct = default)
    {
        var response = await http.GetAsync($"/api/stock/{productId}", ct);
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadFromJsonAsync(ct);
    }
}

Per-Try vs Overall Timeout

HttpClient.Timeout is the overall budget. The Polly timeout inside retries is per-try. Set HttpClient.Timeout higher than (retry_count * per_try_timeout) to avoid premature cancellation.

Timeouts & Cancellation

Timeouts are your first line of defense. Without them, slow dependencies block threads indefinitely. With timeouts, you fail fast and free resources. Polly's timeout strategy integrates with CancellationToken propagation.

Per-Request Timeout

Use a timeout outside retries to cap total time spent. This prevents infinite retry loops from consuming all request budget.

Program.cs - Per-Request Timeout

builder.Services.AddHttpClient(client =>
{
    client.BaseAddress = new Uri("https://payments-api.example.com");
})
.AddResilienceHandler("payments-pipeline", (pipelineBuilder, context) =>
{
    // Outer timeout: total request budget
    pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(5),
        Name = "payments-overall-timeout"
    });

    // Retry with per-try timeout
    pipelineBuilder.AddRetry(new Polly.Retry.RetryStrategyOptions
    {
        MaxRetryAttempts = 2,
        Delay = TimeSpan.FromMilliseconds(500),
        BackoffType = Polly.DelayBackoffType.Exponential,
        UseJitter = true,
        Name = "payments-retry"
    });

    // Inner timeout: per-try budget
    pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(2),
        Name = "payments-per-try-timeout"
    });
});

Cooperative Cancellation

Pass CancellationToken through all layers. When a timeout fires, Polly cancels the token. Your code must check ct.IsCancellationRequested and honor it.

PaymentsClient.cs - Cancellation

public class PaymentsClient(HttpClient http)
{
    public async Task ChargeAsync(decimal amount, CancellationToken ct)
    {
        // Pass ct to every async call
        var response = await http.PostAsJsonAsync("/api/charge", new { amount }, ct);

        if (!response.IsSuccessStatusCode)
        {
            throw new PaymentException($"Payment failed: {response.StatusCode}");
        }

        return await response.Content.ReadFromJsonAsync(ct)
            ?? throw new PaymentException("Empty response");
    }
}

public record PaymentResult(string TransactionId, string Status);

TaskCanceledException vs TimeoutRejectedException

Polly v8 throws TimeoutRejectedException on timeout. HttpClient throws TaskCanceledException. Catch both if you need to distinguish timeout from user cancellation.

Retries that Don't Hurt

Retries smooth over transient failures. But naive retries hammer failing services and cause thundering herds. Use decorrelated jitter, bound attempts, and only retry idempotent operations.

Decorrelated Jitter Backoff

Exponential backoff without jitter causes retry storms. All clients retry at the same intervals. Decorrelated jitter spreads retries randomly across time.

Program.cs - Jittered Retry

using Polly.Retry;

builder.Services.AddHttpClient(client =>
{
    client.BaseAddress = new Uri("https://inventory-api.example.com");
})
.AddResilienceHandler("inventory-retry", (pipelineBuilder, context) =>
{
    pipelineBuilder.AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromMilliseconds(200),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true, // Decorrelated jitter
        Name = "inventory-retry",
        ShouldHandle = new PredicateBuilder().Handle(),
        OnRetry = args =>
        {
            Console.WriteLine($"Retry {args.AttemptNumber} after {args.RetryDelay}");
            return ValueTask.CompletedTask;
        }
    });
});

Idempotency Guards

Only retry safe HTTP methods (GET, PUT, DELETE) or requests with idempotency keys. POST without idempotency can create duplicate orders.

Extensions/PollyExtensions.cs - Safe Retry

public static class PollyExtensions
{
    public static bool IsSafeToRetry(this HttpRequestMessage request)
    {
        // Safe verbs
        if (request.Method == HttpMethod.Get ||
            request.Method == HttpMethod.Head ||
            request.Method == HttpMethod.Options)
            return true;

        // PUT/DELETE with idempotency
        if (request.Method == HttpMethod.Put ||
            request.Method == HttpMethod.Delete)
            return true;

        // POST with idempotency key
        if (request.Method == HttpMethod.Post &&
            request.Headers.Contains("Idempotency-Key"))
            return true;

        return false;
    }
}

// Use in retry predicate
builder.Services.AddHttpClient()
    .AddResilienceHandler("orders-retry", (pb, ctx) =>
    {
        pb.AddRetry(new RetryStrategyOptions
        {
            ShouldHandle = args =>
    {
    // For HTTP error responses, check if the request is safe to retry
    if (args.Outcome.Result is HttpResponseMessage response &&
        !response.IsSuccessStatusCode)
    {
        var request = response.RequestMessage;
        return ValueTask.FromResult(request?.IsSafeToRetry() ?? false);
    }

    // For network exceptions, retry (typically transient)
    if (args.Outcome.Exception is HttpRequestException)
    {
        return ValueTask.FromResult(true);
    }

    return ValueTask.FromResult(false);
    }
        });
    });

Anti-Pattern: Retry on 500

Don't retry all 5xx errors. 500 Internal Server Error often means a bug, not a transient issue. Retry 503 Service Unavailable and network failures. Skip 500, 501, 505.

Circuit Breakers

Circuit breakers fail fast when a service is persistently down. Instead of waiting for timeouts, the breaker opens and rejects requests immediately. This protects your app and gives the downstream service time to recover.

Failure-Rate Breaker

Track failure rate over a rolling window. If failures exceed a threshold, open the circuit. After a cooldown, enter half-open state and test with a single probe request.

Program.cs - Circuit Breaker

using Polly.CircuitBreaker;

builder.Services.AddHttpClient()
    .AddResilienceHandler("payments-circuit", (pb, ctx) =>
    {
        pb.AddCircuitBreaker(new CircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,           // Open at 50% failure
            MinimumThroughput = 10,       // Need 10 requests in window
            SamplingDuration = TimeSpan.FromSeconds(30),
            BreakDuration = TimeSpan.FromSeconds(15),
            Name = "payments-circuit",
            ShouldHandle = new PredicateBuilder()
                .Handle()
                .HandleResult(r =>
                    r.StatusCode == System.Net.HttpStatusCode.ServiceUnavailable),
            OnOpened = args =>
            {
                Console.WriteLine($"Circuit opened for {args.BreakDuration}");
                return ValueTask.CompletedTask;
            },
            OnClosed = args =>
            {
                Console.WriteLine("Circuit closed");
                return ValueTask.CompletedTask;
            },
            OnHalfOpened = args =>
            {
                Console.WriteLine("Circuit half-open, testing...");
                return ValueTask.CompletedTask;
            }
        });
    });

Health Endpoint Integration

Expose circuit breaker state via ASP.NET Core health checks. Monitoring tools can alert when circuits open.

Program.cs - Health Check

using Microsoft.Extensions.Diagnostics.HealthChecks;

builder.Services.AddHealthChecks()
    .AddCheck("payments-circuit", () =>
    {
        // Access circuit state via DI (requires custom tracking)
        // For now, return healthy - production should check actual state
        return HealthCheckResult.Healthy("Circuit closed");
    });

app.MapHealthChecks("/health");

Circuit State Tracking

Store breaker state in a singleton service. Update it in OnOpened/OnClosed callbacks. Query state in health checks and expose via /health endpoint.

Fallbacks & Graceful Degradation

When all retries fail and circuits open, fallbacks provide degraded responses. Return cached data, default values, or partial results. Never hide persistent failures from users—make degradation visible.

Cache-Based Fallback for GETs

For read-only requests, serve stale cache when the service is down. Mark responses with X-Fallback: true header.

Program.cs - Fallback Policy

using Polly.Fallback;
using Microsoft.Extensions.Caching.Memory;

var cache = new MemoryCache(new MemoryCacheOptions());

builder.Services.AddHttpClient()
    .AddResilienceHandler("inventory-fallback", (pb, ctx) =>
    {
                pb.AddFallback(new FallbackStrategyOptions
        {
            ShouldHandle = new PredicateBuilder()
                .Handle()
                .Handle()
                .HandleResult(r => !r.IsSuccessStatusCode),
            FallbackAction = async args =>
            {
                var cacheKey = args.Outcome.Result?.RequestMessage?.RequestUri?.ToString() ?? "unknown";

                if (cache.TryGetValue(cacheKey, out HttpResponseMessage? cached))
                {
                    Console.WriteLine($"Serving cached response for {cacheKey}");
                    cached!.Headers.Add("X-Fallback", "true");
                    return Outcome.FromResult(cached);
                }

                // Return empty result with 503
                var fallback = new HttpResponseMessage(System.Net.HttpStatusCode.ServiceUnavailable)
                {
                    Content = JsonContent.Create(new { error = "Service unavailable, no cache" })
                };
                fallback.Headers.Add("X-Fallback", "true");
                return Outcome.FromResult(fallback);
            }
        });
    });

Partial Response Degradation

For composite endpoints that call multiple services, return partial data when some services fail. Users get what's available instead of total failure.

Endpoints/OrdersEndpoint.cs - Partial Response

app.MapGet("/orders/{id}/details", async (int id,
    InventoryClient inventory,
    PaymentsClient payments) =>
{
    var details = new OrderDetails { OrderId = id };

    // Try to fetch inventory (with fallback)
    try
    {
        details.Stock = await inventory.GetStockAsync(id);
    }
    catch (Exception ex)
    {
        details.Stock = null;
        details.Warnings.Add($"Inventory unavailable: {ex.Message}");
    }

    // Try to fetch payment status (with fallback)
    try
    {
        details.PaymentStatus = await payments.GetStatusAsync(id);
    }
    catch (Exception ex)
    {
        details.PaymentStatus = "Unknown";
        details.Warnings.Add($"Payments unavailable: {ex.Message}");
    }

    return Results.Ok(details);
});

public class OrderDetails
{
    public int OrderId { get; set; }
    public int? Stock { get; set; }
    public string PaymentStatus { get; set; } = "Unknown";
    public List Warnings { get; set; } = new();
}

Don't Hide Persistent Outages

Fallbacks keep your app alive during brief outages. But if a service is down for hours, users deserve to know. Add timestamps to fallback data and alert thresholds for stale caches.

Bulkhead Isolation

Bulkheads limit concurrent requests to a service. If one dependency is slow, it can't exhaust your thread pool and starve other services. Use separate bulkheads for third-party vs internal calls.

Rate Limiter Bulkhead

Polly v8 uses rate limiter strategies for bulkheads. Configure max concurrent requests and queue depth.

Program.cs - Bulkhead

using Polly.RateLimiting;

builder.Services.AddHttpClient()
    .AddResilienceHandler("payments-bulkhead", (pb, ctx) =>
    {
        pb.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 10,          // Max 10 concurrent requests
            QueueLimit = 5,            // Queue up to 5 waiting requests
            Name = "payments-bulkhead",
            OnRejected = args =>
            {
                Console.WriteLine($"Request rejected by bulkhead, queue full");
                return ValueTask.CompletedTask;
            }
        });
    });

Separate Pools for Third-Party Services

External APIs are less reliable than internal services. Give them separate bulkheads so slow third-party calls don't block internal traffic.

Program.cs - Separate Bulkheads

// Internal service: generous limits
builder.Services.AddHttpClient()
    .AddResilienceHandler("inventory-bulkhead", (pb, ctx) =>
    {
        pb.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 50,
            QueueLimit = 20,
            Name = "inventory-bulkhead-internal"
        });
    });

// Third-party API: tight limits
builder.Services.AddHttpClient()
    .AddResilienceHandler("shipping-bulkhead", (pb, ctx) =>
    {
        pb.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
        {
            PermitLimit = 5,
            QueueLimit = 2,
            Name = "shipping-bulkhead-external"
        });
    });

Bulkhead Metrics

Track queued vs rejected requests. High rejection rates mean you need higher limits or the service is too slow. High queue depth means the service can't keep up with load.

Hedging & Parallel Requests

Hedging fires parallel requests to reduce tail latency. Wait for p95-p99 latency, then send a second request. The first response wins, and the slower one is cancelled. Use only for idempotent reads.

Hedge Strategy

Configure hedge delay based on observed latency percentiles. Too short and you double load. Too long and hedging doesn't help.

Program.cs - Hedging

using Polly.Hedging;

builder.Services.AddHttpClient()
    .AddResilienceHandler("catalog-hedging", (pb, ctx) =>
    {
        pb.AddHedging(new HedgingStrategyOptions
        {
            MaxHedgedAttempts = 2,
            Delay = TimeSpan.FromMilliseconds(500), // Hedge after p95
            Name = "catalog-hedging",
            ShouldHandle = new PredicateBuilder()
                .Handle()
                .Handle(),
            ActionGenerator = args =>
            {
                // Return the action to execute for hedged attempt
                return () => args.ActionProvider();
            },
            OnHedging = args =>
            {
                Console.WriteLine($"Hedging request {args.AttemptNumber}");
                return ValueTask.CompletedTask;
            }
        });
    });

Guard Against Thundering Herds

Hedging doubles request volume. Only hedge read-only endpoints. Never hedge writes or non-idempotent operations. Monitor hedge rates and adjust delay if load becomes a problem.

Hedging Increases Load

Hedging helps latency at the cost of throughput. If hedge delay is too aggressive, you can double the load on a service. Start conservative (p99 delay), monitor backend load, and tune based on capacity.

Policy Composition & Ordering

Policies execute in the order they're added to the pipeline. Order matters. The standard pattern is: Timeout → Retry → Circuit → Fallback. This ensures timeouts apply to retries, circuits track retry failures, and fallbacks catch everything.

Standard Pipeline Order

Build pipelines inside-out. Inner policies execute first per request. Outer policies wrap inner ones.

Program.cs - Full Pipeline

builder.Services.AddHttpClient()
    .AddResilienceHandler("payments-full-pipeline", (pb, ctx) =>
    {
        // 1. Outer timeout (caps total time)
        pb.AddTimeout(TimeSpan.FromSeconds(10));

        // 2. Fallback (catches all failures)
        pb.AddFallback(new FallbackStrategyOptions
        {
            ShouldHandle = _ => ValueTask.FromResult(true),
            FallbackAction = _ => new ValueTask>(
                Outcome.FromResult(
                    new HttpResponseMessage(System.Net.HttpStatusCode.ServiceUnavailable)))
        });

        // 3. Circuit breaker (stops hammering failing service)
        pb.AddCircuitBreaker(new CircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            SamplingDuration = TimeSpan.FromSeconds(30),
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        });

        // 4. Retry (smooths transient failures)
        pb.AddRetry(new RetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromMilliseconds(200),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true
        });

        // 5. Inner timeout (per-try budget)
        pb.AddTimeout(TimeSpan.FromSeconds(3));
    });

Why This Order?

Timeout → Retry: Per-try timeout prevents one slow attempt from blocking retries.
Retry → Circuit: Circuit tracks failures after retries exhaust. Don't open circuit on first failure if retries would succeed.
Circuit → Fallback: Fallback catches circuit-open exceptions and serves degraded responses.
Outer Timeout: Caps total time including all retries and fallback execution.

Pipeline Debugging

Use the OnRetry, OnCircuitOpened, OnFallback callbacks to log pipeline execution. This shows you which policy handled each request and helps tune thresholds.

Per-Endpoint Policies

Not all endpoints need the same resilience. Critical writes need tight timeouts and no retries. Background jobs tolerate longer delays. Use endpoint metadata to select policies dynamically.

Metadata-Based Policy Selection

Tag endpoints with metadata, then resolve the policy pipeline based on that metadata.

Program.cs - Metadata Policies

// Define policy keys as metadata
public record ResiliencePolicyMetadata(string PolicyKey);

// Register named pipelines
builder.Services.AddResiliencePipeline("critical", pb =>
{
    pb.AddTimeout(TimeSpan.FromSeconds(2));
    // No retries for critical writes
});

builder.Services.AddResiliencePipeline("standard", pb =>
{
    pb.AddTimeout(TimeSpan.FromSeconds(5));
    pb.AddRetry(new RetryStrategyOptions { MaxRetryAttempts = 3 });
    pb.AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        FailureRatio = 0.5,
        SamplingDuration = TimeSpan.FromSeconds(30),
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(15)
    });
});

// Apply to endpoints
app.MapPost("/orders", async (CreateOrderRequest req) =>
{
    // Critical path
    return Results.Created($"/orders/123", new { Id = 123 });
})
.WithMetadata(new ResiliencePolicyMetadata("critical"));

app.MapGet("/orders/{id}", async (int id) =>
{
    // Standard path
    return Results.Ok(new { Id = id });
})
.WithMetadata(new ResiliencePolicyMetadata("standard"));

Endpoint Convention Extensions

Create extension methods for clean policy wiring.

Extensions/ResilienceExtensions.cs

public static class ResilienceExtensions
{
    public static RouteHandlerBuilder WithCriticalResilience(this RouteHandlerBuilder builder)
    {
        return builder.WithMetadata(new ResiliencePolicyMetadata("critical"));
    }

    public static RouteHandlerBuilder WithStandardResilience(this RouteHandlerBuilder builder)
    {
        return builder.WithMetadata(new ResiliencePolicyMetadata("standard"));
    }
}

// Usage
app.MapPost("/orders", handler).WithCriticalResilience();
app.MapGet("/orders/{id}", handler).WithStandardResilience();

Observability & Tuning

Resilience without metrics is blind guessing. Use OpenTelemetry to track policy executions, retry counts, circuit state, and hedge rates. Set up dashboards and alerts. Tune thresholds based on error budgets and SLOs.

OpenTelemetry Integration

Polly v8 emits telemetry events automatically. Wire it into OpenTelemetry for distributed tracing.

Program.cs - OpenTelemetry

using OpenTelemetry.Trace;
using OpenTelemetry.Metrics;

builder.Services.AddOpenTelemetry()
    .WithTracing(tracerBuilder =>
    {
        tracerBuilder
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddSource("Polly")
            .AddOtlpExporter();
    })
    .WithMetrics(meterBuilder =>
    {
        meterBuilder
            .AddAspNetCoreInstrumentation()
            .AddHttpClientInstrumentation()
            .AddMeter("Polly")
            .AddOtlpExporter();
    });

Custom Counters and Events

Track retry attempts, circuit opens, and fallback usage with custom metrics.

Telemetry/ResilienceMetrics.cs

using System.Diagnostics.Metrics;

public class ResilienceMetrics
{
    private readonly Counter _retryCounter;
    private readonly Counter _circuitOpenCounter;
    private readonly Counter _hedgeCounter;

    public ResilienceMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("OrdersApi.Resilience");
        _retryCounter = meter.CreateCounter("retry_attempts");
        _circuitOpenCounter = meter.CreateCounter("circuit_breaker_opens");
        _hedgeCounter = meter.CreateCounter("hedged_requests");
    }

    public void RecordRetry(string policyName) =>
        _retryCounter.Add(1, new KeyValuePair("policy", policyName));

    public void RecordCircuitOpen(string policyName) =>
        _circuitOpenCounter.Add(1, new KeyValuePair("policy", policyName));

    public void RecordHedge(string policyName) =>
        _hedgeCounter.Add(1, new KeyValuePair("policy", policyName));
}

// Wire into policy callbacks
pb.AddRetry(new RetryStrategyOptions
{
    OnRetry = args =>
    {
        var metrics = args.Context.ServiceProvider?.GetService();
        metrics?.RecordRetry("payments-retry");
        return ValueTask.CompletedTask;
    }
});

KPIs to Track

Monitor: retry rate, circuit breaker state changes, fallback usage, request latency (p50/p95/p99), error rate by policy. Set alerts on circuit opens and fallback spikes.

Chaos & Load Testing

Resilience policies are worthless if they're never tested under failure. Inject faults in staging. Add variable latency. Simulate service restarts. Validate that your SLOs hold up.

Chaos Middleware

Build middleware that randomly injects failures and latency. Enable it via feature flag in non-production environments.

Middleware/ChaosMiddleware.cs

public class ChaosMiddleware
{
    private readonly RequestDelegate _next;
    private readonly Random _random = new();

    public ChaosMiddleware(RequestDelegate next) => _next = next;

    public async Task InvokeAsync(HttpContext context)
    {
        // 10% failure rate
        if (_random.NextDouble() < 0.10)
        {
            context.Response.StatusCode = 503;
            await context.Response.WriteAsync("Chaos: Service Unavailable");
            return;
        }

        // 20% slow requests (200-800ms)
        if (_random.NextDouble() < 0.20)
        {
            var delay = _random.Next(200, 800);
            await Task.Delay(delay);
        }

        await _next(context);
    }
}

// Enable in staging only
if (builder.Environment.IsStaging())
{
    app.UseMiddleware();
}

Load Testing with k6

Use k6 or Bombardier to generate realistic load. Validate that retries don't cause cascading failures.

load-test.js - k6 Script

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '30s', target: 50 },  // Ramp up
    { duration: '2m', target: 100 },  // Sustained load
    { duration: '30s', target: 0 },   // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<1000'],
    http_req_failed: ['rate<0.05'], // 95% success
  },
};

export default function () {
  const res = http.get('https://orders-api.example.com/orders/123');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(1);
}

Chaos in Production?

Some teams run chaos in production (Netflix, AWS). Start in staging. Once you trust your resilience, consider controlled production chaos during low-traffic windows. Always have kill switches.

Deployment & Production Practices

Resilience policies are code. Treat them like production config. Use feature flags for policy changes. Roll out new thresholds with canary deployments. Monitor policy effectiveness and iterate.

Configuration as Code

Store policy settings in appsettings.json. Use the Options pattern for reloadable config.

appsettings.json

{
  "Resilience": {
    "Payments": {
      "TimeoutSeconds": 3,
      "RetryCount": 3,
      "CircuitBreakerFailureRatio": 0.5,
      "CircuitBreakerSamplingSeconds": 30,
      "CircuitBreakerBreakSeconds": 15
    },
    "Inventory": {
      "TimeoutSeconds": 2,
      "RetryCount": 2,
      "CircuitBreakerFailureRatio": 0.3,
      "CircuitBreakerSamplingSeconds": 20,
      "CircuitBreakerBreakSeconds": 10
    }
  }
}

Options/ResilienceOptions.cs

public class ResilienceOptions
{
    public int TimeoutSeconds { get; set; }
    public int RetryCount { get; set; }
    public double CircuitBreakerFailureRatio { get; set; }
    public int CircuitBreakerSamplingSeconds { get; set; }
    public int CircuitBreakerBreakSeconds { get; set; }
}

// Bind in Program.cs
builder.Services.Configure(
    "Payments",
    builder.Configuration.GetSection("Resilience:Payments"));

// Use in pipeline
builder.Services.AddHttpClient()
    .AddResilienceHandler("payments", (pb, ctx) =>
    {
        var opts = ctx.ServiceProvider
            .GetRequiredService>()
            .Get("Payments");

        pb.AddTimeout(TimeSpan.FromSeconds(opts.TimeoutSeconds));
        pb.AddRetry(new RetryStrategyOptions
        {
            MaxRetryAttempts = opts.RetryCount
        });
    });

Safe Rollout Checklist

Test policy changes in staging with chaos injection
Use feature flags to enable/disable policies without redeployment
Roll out to canary instances (5-10% traffic) first
Monitor retry rates, circuit opens, latency, error rate
Have a rollback plan (feature flag toggle or config revert)
Document policy rationale and tuning history

Feature Flags for Policies

Wrap policy registration in feature flag checks. If a policy causes issues, toggle it off without redeploying. Use LaunchDarkly, Azure App Config, or simple environment variables.

Frequently Asked Questions

Should I use Polly for all HTTP calls?

Use Polly for external HTTP calls and inter-service communication where transient failures are expected. Internal loopback calls or trusted services with stable SLAs may not need full resilience. Start with circuit breakers for external dependencies, then add retries and timeouts as needed. Avoid over-engineering stable, low-latency services.

What's the right retry count for production?

Start with 2–3 retries using decorrelated jitter backoff. More retries increase latency under load and can overwhelm failing services. Use retries only for idempotent operations (GET, PUT with idempotency keys). POST without idempotency should fail fast or use hedging instead.

How do circuit breakers affect user experience?

Circuit breakers fail fast when a service is down, preventing cascading failures and long timeouts. Users see immediate errors instead of hanging requests. Pair circuits with fallbacks (cached data, degraded responses) to maintain partial functionality. Monitor breaker state and alert on transitions to open state.

When should I use hedging vs retries?

Use hedging for read-only, idempotent requests where tail latency matters more than total request volume. Hedging fires parallel requests after p95-p99 delay, canceling slower ones. Use retries for write operations, non-idempotent requests, or when you want to avoid duplicate work. Hedging increases load; retries don't.

How do I tune policies without breaking production?

Start with conservative settings (high thresholds, low retry counts). Use feature flags to toggle policies dynamically. Monitor policy events (retries, breaker opens) with OpenTelemetry. Run chaos tests in staging to validate settings. Adjust thresholds based on error budgets and SLOs, not gut feel. Roll out changes gradually with canary deployments.

Back to Tutorials