Your payment service is down. Without resilience, 10,000 requests pile up, threads block, and your entire API falls over. With Polly, circuits open in 50ms, fallbacks return cached data, and bulkheads isolate the blast radius. Your API stays up. This isn't defensive coding. It's production reality.
Most Polly tutorials show you WaitAndRetry and stop. Production systems need coordinated timeouts, jittered backoffs, circuit breakers with health checks, bulkhead isolation, and hedged requests. You could patch policies together and hope. Or you could learn the patterns that survive outages. This tutorial shows you the battle-tested patterns.
What You'll Build
You'll build a production-ready Orders API that calls Inventory and Payments services with complete resilience:
Timeouts with per-request and per-try budgets, cooperative cancellation across layers
Jittered retries with decorrelated backoff, idempotency guards, and safe verb checks
Circuit breakers with failure-rate thresholds, half-open probes, and health endpoints
Fallbacks with cache-based responses, partial degradation, and error transparency
Bulkhead isolation with separate pools for third-party vs internal calls, queue metrics
Hedging for tail-latency reduction with parallel requests and winner cancellation
Per-endpoint policies using Minimal API metadata and endpoint conventions
OpenTelemetry with policy events, custom counters, and distributed tracing
Chaos testing with failure injection, variable latency, and SLO validation
Why Resilience in Cloud-Native .NET
Cloud services fail. Networks hiccup. Dependencies restart. Without resilience, transient failures cascade into full outages. Your API becomes the weakest link. Resilience patterns keep you up when everything around you is falling down.
Transient Faults Are the Norm
Cloud providers guarantee 99.9% uptime. That's 43 minutes of downtime per month. Your app makes 100 dependency calls per request. One failing dependency kills availability. Timeouts prevent hanging. Retries smooth over blips. Circuit breakers stop the bleeding.
SLAs and Error Budgets
If your service promises 99.95% uptime (21 minutes/month), and each dependency is 99.9%, you burn through your error budget fast. Resilience patterns (retries, fallbacks) help you meet SLAs even when dependencies don't.
Polly v8 Changes Everything
Polly v8 is a full rewrite. The old ISyncPolicy and IAsyncPolicy are gone. Pipelines replace PolicyWrap. Telemetry is built-in. The API is cleaner, faster, and works seamlessly with HttpClientFactory and OpenTelemetry.
Setup & First Policy
Start with a minimal API that calls a downstream service. You'll add a basic timeout policy first, then layer on retries, circuits, and fallbacks.
Install Polly v8
Polly v8 is modular. Core policies live in Polly.Core. HTTP integration is in Polly.Extensions.Http. Telemetry hooks into OpenTelemetry via Polly.Extensions.
Use a typed client for clean dependency injection. Register policies via AddResilienceHandler from the new resilience extensions.
Program.cs - Typed Client Setup
using Polly;
using Polly.Timeout;
var builder = WebApplication.CreateBuilder(args);
// Register typed HttpClient with timeout policy
builder.Services.AddHttpClient(client =>
{
client.BaseAddress = new Uri("https://inventory-api.example.com");
client.Timeout = TimeSpan.FromSeconds(30); // overall timeout
})
.AddResilienceHandler("inventory-pipeline", (builder, context) =>
{
// Per-request timeout (inside retries)
builder.AddTimeout(new TimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(3),
Name = "inventory-timeout"
});
});
var app = builder.Build();
app.MapGet("/orders/{id}", async (int id, InventoryClient inventory) =>
{
var stock = await inventory.GetStockAsync(id);
return Results.Ok(new { OrderId = id, Stock = stock });
});
app.Run();
// Typed client
public class InventoryClient(HttpClient http)
{
public async Task GetStockAsync(int productId, CancellationToken ct = default)
{
var response = await http.GetAsync($"/api/stock/{productId}", ct);
response.EnsureSuccessStatusCode();
return await response.Content.ReadFromJsonAsync(ct);
}
}
Per-Try vs Overall Timeout
HttpClient.Timeout is the overall budget. The Polly timeout inside retries is per-try. Set HttpClient.Timeout higher than (retry_count * per_try_timeout) to avoid premature cancellation.
Timeouts & Cancellation
Timeouts are your first line of defense. Without them, slow dependencies block threads indefinitely. With timeouts, you fail fast and free resources. Polly's timeout strategy integrates with CancellationToken propagation.
Per-Request Timeout
Use a timeout outside retries to cap total time spent. This prevents infinite retry loops from consuming all request budget.
Pass CancellationToken through all layers. When a timeout fires, Polly cancels the token. Your code must check ct.IsCancellationRequested and honor it.
PaymentsClient.cs - Cancellation
public class PaymentsClient(HttpClient http)
{
public async Task ChargeAsync(decimal amount, CancellationToken ct)
{
// Pass ct to every async call
var response = await http.PostAsJsonAsync("/api/charge", new { amount }, ct);
if (!response.IsSuccessStatusCode)
{
throw new PaymentException($"Payment failed: {response.StatusCode}");
}
return await response.Content.ReadFromJsonAsync(ct)
?? throw new PaymentException("Empty response");
}
}
public record PaymentResult(string TransactionId, string Status);
TaskCanceledException vs TimeoutRejectedException
Polly v8 throws TimeoutRejectedException on timeout. HttpClient throws TaskCanceledException. Catch both if you need to distinguish timeout from user cancellation.
Retries that Don't Hurt
Retries smooth over transient failures. But naive retries hammer failing services and cause thundering herds. Use decorrelated jitter, bound attempts, and only retry idempotent operations.
Decorrelated Jitter Backoff
Exponential backoff without jitter causes retry storms. All clients retry at the same intervals. Decorrelated jitter spreads retries randomly across time.
Only retry safe HTTP methods (GET, PUT, DELETE) or requests with idempotency keys. POST without idempotency can create duplicate orders.
Extensions/PollyExtensions.cs - Safe Retry
public static class PollyExtensions
{
public static bool IsSafeToRetry(this HttpRequestMessage request)
{
// Safe verbs
if (request.Method == HttpMethod.Get ||
request.Method == HttpMethod.Head ||
request.Method == HttpMethod.Options)
return true;
// PUT/DELETE with idempotency
if (request.Method == HttpMethod.Put ||
request.Method == HttpMethod.Delete)
return true;
// POST with idempotency key
if (request.Method == HttpMethod.Post &&
request.Headers.Contains("Idempotency-Key"))
return true;
return false;
}
}
// Use in retry predicate
builder.Services.AddHttpClient()
.AddResilienceHandler("orders-retry", (pb, ctx) =>
{
pb.AddRetry(new RetryStrategyOptions
{
ShouldHandle = args =>
{
if (args.Outcome.Exception is HttpRequestException &&
args.Outcome.Result is HttpResponseMessage response)
{
// Check if safe to retry
var request = response.RequestMessage;
return ValueTask.FromResult(request?.IsSafeToRetry() ?? false);
}
return ValueTask.FromResult(false);
}
});
});
Anti-Pattern: Retry on 500
Don't retry all 5xx errors. 500 Internal Server Error often means a bug, not a transient issue. Retry 503 Service Unavailable and network failures. Skip 500, 501, 505.
Circuit Breakers
Circuit breakers fail fast when a service is persistently down. Instead of waiting for timeouts, the breaker opens and rejects requests immediately. This protects your app and gives the downstream service time to recover.
Failure-Rate Breaker
Track failure rate over a rolling window. If failures exceed a threshold, open the circuit. After a cooldown, enter half-open state and test with a single probe request.
Program.cs - Circuit Breaker
using Polly.CircuitBreaker;
builder.Services.AddHttpClient()
.AddResilienceHandler("payments-circuit", (pb, ctx) =>
{
pb.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
FailureRatio = 0.5, // Open at 50% failure
MinimumThroughput = 10, // Need 10 requests in window
SamplingDuration = TimeSpan.FromSeconds(30),
BreakDuration = TimeSpan.FromSeconds(15),
Name = "payments-circuit",
ShouldHandle = new PredicateBuilder()
.Handle()
.HandleResult(r =>
r.StatusCode == System.Net.HttpStatusCode.ServiceUnavailable),
OnOpened = args =>
{
Console.WriteLine($"Circuit opened for {args.BreakDuration}");
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
Console.WriteLine("Circuit closed");
return ValueTask.CompletedTask;
},
OnHalfOpened = args =>
{
Console.WriteLine("Circuit half-open, testing...");
return ValueTask.CompletedTask;
}
});
});
Health Endpoint Integration
Expose circuit breaker state via ASP.NET Core health checks. Monitoring tools can alert when circuits open.
Program.cs - Health Check
using Microsoft.Extensions.Diagnostics.HealthChecks;
builder.Services.AddHealthChecks()
.AddCheck("payments-circuit", () =>
{
// Access circuit state via DI (requires custom tracking)
// For now, return healthy - production should check actual state
return HealthCheckResult.Healthy("Circuit closed");
});
app.MapHealthChecks("/health");
Circuit State Tracking
Store breaker state in a singleton service. Update it in OnOpened/OnClosed callbacks. Query state in health checks and expose via /health endpoint.
Fallbacks & Graceful Degradation
When all retries fail and circuits open, fallbacks provide degraded responses. Return cached data, default values, or partial results. Never hide persistent failures from users—make degradation visible.
Cache-Based Fallback for GETs
For read-only requests, serve stale cache when the service is down. Mark responses with X-Fallback: true header.
Program.cs - Fallback Policy
using Polly.Fallback;
using Microsoft.Extensions.Caching.Memory;
var cache = new MemoryCache(new MemoryCacheOptions());
builder.Services.AddHttpClient()
.AddResilienceHandler("inventory-fallback", (pb, ctx) =>
{
pb.AddFallback(new FallbackStrategyOptions
{
ShouldHandle = new PredicateBuilder()
.Handle()
.Handle()
.HandleResult(r => !r.IsSuccessStatusCode),
FallbackAction = args =>
{
var cacheKey = args.Outcome.Result?.RequestMessage?.RequestUri?.ToString() ?? "unknown";
if (cache.TryGetValue(cacheKey, out HttpResponseMessage? cached))
{
Console.WriteLine($"Serving cached response for {cacheKey}");
cached!.Headers.Add("X-Fallback", "true");
return Outcome.FromResult(cached);
}
// Return empty result with 503
var fallback = new HttpResponseMessage(System.Net.HttpStatusCode.ServiceUnavailable)
{
Content = JsonContent.Create(new { error = "Service unavailable, no cache" })
};
fallback.Headers.Add("X-Fallback", "true");
return Outcome.FromResult(fallback);
}
});
});
Partial Response Degradation
For composite endpoints that call multiple services, return partial data when some services fail. Users get what's available instead of total failure.
Endpoints/OrdersEndpoint.cs - Partial Response
app.MapGet("/orders/{id}/details", async (int id,
InventoryClient inventory,
PaymentsClient payments) =>
{
var details = new OrderDetails { OrderId = id };
// Try to fetch inventory (with fallback)
try
{
details.Stock = await inventory.GetStockAsync(id);
}
catch (Exception ex)
{
details.Stock = null;
details.Warnings.Add($"Inventory unavailable: {ex.Message}");
}
// Try to fetch payment status (with fallback)
try
{
details.PaymentStatus = await payments.GetStatusAsync(id);
}
catch (Exception ex)
{
details.PaymentStatus = "Unknown";
details.Warnings.Add($"Payments unavailable: {ex.Message}");
}
return Results.Ok(details);
});
public class OrderDetails
{
public int OrderId { get; set; }
public int? Stock { get; set; }
public string PaymentStatus { get; set; } = "Unknown";
public List Warnings { get; set; } = new();
}
Don't Hide Persistent Outages
Fallbacks keep your app alive during brief outages. But if a service is down for hours, users deserve to know. Add timestamps to fallback data and alert thresholds for stale caches.
Bulkhead Isolation
Bulkheads limit concurrent requests to a service. If one dependency is slow, it can't exhaust your thread pool and starve other services. Use separate bulkheads for third-party vs internal calls.
Rate Limiter Bulkhead
Polly v8 uses rate limiter strategies for bulkheads. Configure max concurrent requests and queue depth.
Program.cs - Bulkhead
using Polly.RateLimiting;
builder.Services.AddHttpClient()
.AddResilienceHandler("payments-bulkhead", (pb, ctx) =>
{
pb.AddConcurrencyLimiter(new ConcurrencyLimiterOptions
{
PermitLimit = 10, // Max 10 concurrent requests
QueueLimit = 5, // Queue up to 5 waiting requests
Name = "payments-bulkhead",
OnRejected = args =>
{
Console.WriteLine($"Request rejected by bulkhead, queue full");
return ValueTask.CompletedTask;
}
});
});
Separate Pools for Third-Party Services
External APIs are less reliable than internal services. Give them separate bulkheads so slow third-party calls don't block internal traffic.
Track queued vs rejected requests. High rejection rates mean you need higher limits or the service is too slow. High queue depth means the service can't keep up with load.
Hedging & Parallel Requests
Hedging fires parallel requests to reduce tail latency. Wait for p95-p99 latency, then send a second request. The first response wins, and the slower one is cancelled. Use only for idempotent reads.
Hedge Strategy
Configure hedge delay based on observed latency percentiles. Too short and you double load. Too long and hedging doesn't help.
Program.cs - Hedging
using Polly.Hedging;
builder.Services.AddHttpClient()
.AddResilienceHandler("catalog-hedging", (pb, ctx) =>
{
pb.AddHedging(new HedgingStrategyOptions
{
MaxHedgedAttempts = 2,
Delay = TimeSpan.FromMilliseconds(500), // Hedge after p95
Name = "catalog-hedging",
ShouldHandle = new PredicateBuilder()
.Handle()
.Handle(),
ActionGenerator = args =>
{
// Return the action to execute for hedged attempt
return () => args.ActionProvider();
},
OnHedging = args =>
{
Console.WriteLine($"Hedging request {args.AttemptNumber}");
return ValueTask.CompletedTask;
}
});
});
Guard Against Thundering Herds
Hedging doubles request volume. Only hedge read-only endpoints. Never hedge writes or non-idempotent operations. Monitor hedge rates and adjust delay if load becomes a problem.
Hedging Increases Load
Hedging helps latency at the cost of throughput. If hedge delay is too aggressive, you can double the load on a service. Start conservative (p99 delay), monitor backend load, and tune based on capacity.
Policy Composition & Ordering
Policies execute in the order they're added to the pipeline. Order matters. The standard pattern is: Timeout → Retry → Circuit → Fallback. This ensures timeouts apply to retries, circuits track retry failures, and fallbacks catch everything.
Standard Pipeline Order
Build pipelines inside-out. Inner policies execute first per request. Outer policies wrap inner ones.
Timeout → Retry: Per-try timeout prevents one slow attempt from blocking retries. Retry → Circuit: Circuit tracks failures after retries exhaust. Don't open circuit on first failure if retries would succeed. Circuit → Fallback: Fallback catches circuit-open exceptions and serves degraded responses. Outer Timeout: Caps total time including all retries and fallback execution.
Pipeline Debugging
Use the OnRetry, OnCircuitOpened, OnFallback callbacks to log pipeline execution. This shows you which policy handled each request and helps tune thresholds.
Per-Endpoint Policies
Not all endpoints need the same resilience. Critical writes need tight timeouts and no retries. Background jobs tolerate longer delays. Use endpoint metadata to select policies dynamically.
Metadata-Based Policy Selection
Tag endpoints with metadata, then resolve the policy pipeline based on that metadata.
Program.cs - Metadata Policies
// Define policy keys as metadata
public record ResiliencePolicyMetadata(string PolicyKey);
// Register named pipelines
builder.Services.AddResiliencePipeline("critical", pb =>
{
pb.AddTimeout(TimeSpan.FromSeconds(2));
// No retries for critical writes
});
builder.Services.AddResiliencePipeline("standard", pb =>
{
pb.AddTimeout(TimeSpan.FromSeconds(5));
pb.AddRetry(new RetryStrategyOptions { MaxRetryAttempts = 3 });
pb.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
FailureRatio = 0.5,
SamplingDuration = TimeSpan.FromSeconds(30),
MinimumThroughput = 10,
BreakDuration = TimeSpan.FromSeconds(15)
});
});
// Apply to endpoints
app.MapPost("/orders", async (CreateOrderRequest req) =>
{
// Critical path
return Results.Created($"/orders/123", new { Id = 123 });
})
.WithMetadata(new ResiliencePolicyMetadata("critical"));
app.MapGet("/orders/{id}", async (int id) =>
{
// Standard path
return Results.Ok(new { Id = id });
})
.WithMetadata(new ResiliencePolicyMetadata("standard"));
Endpoint Convention Extensions
Create extension methods for clean policy wiring.
Extensions/ResilienceExtensions.cs
public static class ResilienceExtensions
{
public static RouteHandlerBuilder WithCriticalResilience(this RouteHandlerBuilder builder)
{
return builder.WithMetadata(new ResiliencePolicyMetadata("critical"));
}
public static RouteHandlerBuilder WithStandardResilience(this RouteHandlerBuilder builder)
{
return builder.WithMetadata(new ResiliencePolicyMetadata("standard"));
}
}
// Usage
app.MapPost("/orders", handler).WithCriticalResilience();
app.MapGet("/orders/{id}", handler).WithStandardResilience();
Observability & Tuning
Resilience without metrics is blind guessing. Use OpenTelemetry to track policy executions, retry counts, circuit state, and hedge rates. Set up dashboards and alerts. Tune thresholds based on error budgets and SLOs.
OpenTelemetry Integration
Polly v8 emits telemetry events automatically. Wire it into OpenTelemetry for distributed tracing.
Track retry attempts, circuit opens, and fallback usage with custom metrics.
Telemetry/ResilienceMetrics.cs
using System.Diagnostics.Metrics;
public class ResilienceMetrics
{
private readonly Counter _retryCounter;
private readonly Counter _circuitOpenCounter;
private readonly Counter _hedgeCounter;
public ResilienceMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("OrdersApi.Resilience");
_retryCounter = meter.CreateCounter("retry_attempts");
_circuitOpenCounter = meter.CreateCounter("circuit_breaker_opens");
_hedgeCounter = meter.CreateCounter("hedged_requests");
}
public void RecordRetry(string policyName) =>
_retryCounter.Add(1, new KeyValuePair("policy", policyName));
public void RecordCircuitOpen(string policyName) =>
_circuitOpenCounter.Add(1, new KeyValuePair("policy", policyName));
public void RecordHedge(string policyName) =>
_hedgeCounter.Add(1, new KeyValuePair("policy", policyName));
}
// Wire into policy callbacks
pb.AddRetry(new RetryStrategyOptions
{
OnRetry = args =>
{
var metrics = args.Context.ServiceProvider?.GetService();
metrics?.RecordRetry("payments-retry");
return ValueTask.CompletedTask;
}
});
KPIs to Track
Monitor: retry rate, circuit breaker state changes, fallback usage, request latency (p50/p95/p99), error rate by policy. Set alerts on circuit opens and fallback spikes.
Chaos & Load Testing
Resilience policies are worthless if they're never tested under failure. Inject faults in staging. Add variable latency. Simulate service restarts. Validate that your SLOs hold up.
Chaos Middleware
Build middleware that randomly injects failures and latency. Enable it via feature flag in non-production environments.
Middleware/ChaosMiddleware.cs
public class ChaosMiddleware
{
private readonly RequestDelegate _next;
private readonly Random _random = new();
public ChaosMiddleware(RequestDelegate next) => _next = next;
public async Task InvokeAsync(HttpContext context)
{
// 10% failure rate
if (_random.NextDouble() < 0.10)
{
context.Response.StatusCode = 503;
await context.Response.WriteAsync("Chaos: Service Unavailable");
return;
}
// 20% slow requests (200-800ms)
if (_random.NextDouble() < 0.20)
{
var delay = _random.Next(200, 800);
await Task.Delay(delay);
}
await _next(context);
}
}
// Enable in staging only
if (builder.Environment.IsStaging())
{
app.UseMiddleware();
}
Load Testing with k6
Use k6 or Bombardier to generate realistic load. Validate that retries don't cause cascading failures.
Some teams run chaos in production (Netflix, AWS). Start in staging. Once you trust your resilience, consider controlled production chaos during low-traffic windows. Always have kill switches.
Deployment & Production Practices
Resilience policies are code. Treat them like production config. Use feature flags for policy changes. Roll out new thresholds with canary deployments. Monitor policy effectiveness and iterate.
Configuration as Code
Store policy settings in appsettings.json. Use the Options pattern for reloadable config.
public class ResilienceOptions
{
public int TimeoutSeconds { get; set; }
public int RetryCount { get; set; }
public double CircuitBreakerFailureRatio { get; set; }
public int CircuitBreakerSamplingSeconds { get; set; }
public int CircuitBreakerBreakSeconds { get; set; }
}
// Bind in Program.cs
builder.Services.Configure(
"Payments",
builder.Configuration.GetSection("Resilience:Payments"));
// Use in pipeline
builder.Services.AddHttpClient()
.AddResilienceHandler("payments", (pb, ctx) =>
{
var opts = ctx.ServiceProvider
.GetRequiredService>()
.Get("Payments");
pb.AddTimeout(TimeSpan.FromSeconds(opts.TimeoutSeconds));
pb.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = opts.RetryCount
});
});
Safe Rollout Checklist
Test policy changes in staging with chaos injection
Use feature flags to enable/disable policies without redeployment
Roll out to canary instances (5-10% traffic) first
Have a rollback plan (feature flag toggle or config revert)
Document policy rationale and tuning history
Feature Flags for Policies
Wrap policy registration in feature flag checks. If a policy causes issues, toggle it off without redeploying. Use LaunchDarkly, Azure App Config, or simple environment variables.
Frequently Asked Questions
Should I use Polly for all HTTP calls?
Use Polly for external HTTP calls and inter-service communication where transient failures are expected. Internal loopback calls or trusted services with stable SLAs may not need full resilience. Start with circuit breakers for external dependencies, then add retries and timeouts as needed. Avoid over-engineering stable, low-latency services.
What's the right retry count for production?
Start with 2–3 retries using decorrelated jitter backoff. More retries increase latency under load and can overwhelm failing services. Use retries only for idempotent operations (GET, PUT with idempotency keys). POST without idempotency should fail fast or use hedging instead.
How do circuit breakers affect user experience?
Circuit breakers fail fast when a service is down, preventing cascading failures and long timeouts. Users see immediate errors instead of hanging requests. Pair circuits with fallbacks (cached data, degraded responses) to maintain partial functionality. Monitor breaker state and alert on transitions to open state.
When should I use hedging vs retries?
Use hedging for read-only, idempotent requests where tail latency matters more than total request volume. Hedging fires parallel requests after p95-p99 delay, canceling slower ones. Use retries for write operations, non-idempotent requests, or when you want to avoid duplicate work. Hedging increases load; retries don't.
How do I tune policies without breaking production?
Start with conservative settings (high thresholds, low retry counts). Use feature flags to toggle policies dynamically. Monitor policy events (retries, breaker opens) with OpenTelemetry. Run chaos tests in staging to validate settings. Adjust thresholds based on error budgets and SLOs, not gut feel. Roll out changes gradually with canary deployments.