The Survival Kit for BackgroundService in Production
The BackgroundService base class makes starting a long-running background task in .NET trivially easy: inherit, override ExecuteAsync, register with AddHostedService. What it does not make obvious is everything that can go wrong between that first working implementation and a service that handles restarts, transient failures, and shutdown signals correctly in production.
A BackgroundService that ignores the cancellation token blocks graceful shutdown and gets force-killed by Kubernetes, corrupting in-flight work. One that swallows exceptions and retries without backoff hammers a failing dependency until the circuit breaks everywhere. One that uses async void internally crashes the process with no stack trace. One that drains its queue indefinitely on shutdown holds up a deployment for minutes. None of these are exotic failure modes — they are the predictable collision points between a minimal BackgroundService implementation and what production actually demands. This article gives you the patterns that survive all four.
Cancellation Token Propagation: The One Rule That Cannot Be Broken
The stoppingToken passed to ExecuteAsync is the host's mechanism for telling your service to stop. It is triggered by application shutdown, a SIGTERM from the OS or container orchestrator, or an explicit IHostApplicationLifetime.StopApplication() call. Every async operation inside ExecuteAsync must receive this token. Every one. A single await SomeOperation() without a cancellation token is a potential shutdown blocker — if that operation is in progress when the host signals stop, it will run to completion (or timeout) before ExecuteAsync can return, holding up the entire shutdown sequence.
// ════════════════════════════════════════════════════════════════════════════
// WRONG: stoppingToken not propagated — shutdown blocks until operation completes
// ════════════════════════════════════════════════════════════════════════════
public class NaiveWorker : BackgroundService
{
private readonly IMessageQueue _queue;
private readonly ILogger _logger;
public NaiveWorker(IMessageQueue queue, ILogger logger)
=> (_queue, _logger) = (queue, logger);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
// WRONG: no cancellation token — blocks shutdown if queue is empty
// and the dequeue call has a long internal timeout
var message = await _queue.DequeueAsync();
// WRONG: no cancellation token — blocks shutdown for the full
// duration of message processing regardless of shutdown signal
await ProcessMessageAsync(message);
// WRONG: no cancellation token — delay runs to completion before
// the loop condition is checked again
await Task.Delay(TimeSpan.FromSeconds(5));
}
}
private async Task ProcessMessageAsync(Message message) { /* ... */ }
}
// ════════════════════════════════════════════════════════════════════════════
// RIGHT: stoppingToken propagated to every async call
// ════════════════════════════════════════════════════════════════════════════
public class CorrectWorker : BackgroundService
{
private readonly IMessageQueue _queue;
private readonly ILogger _logger;
public CorrectWorker(IMessageQueue queue, ILogger logger)
=> (_queue, _logger) = (queue, logger);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
_logger.LogInformation("Worker started.");
// OperationCanceledException is the normal exit path when stoppingToken
// fires. Catch it here, log the shutdown, and return cleanly.
// Do NOT catch it inside the loop — let it propagate to ExecuteAsync.
try
{
while (!stoppingToken.IsCancellationRequested)
{
// stoppingToken cancels the dequeue wait immediately on shutdown
var message = await _queue.DequeueAsync(stoppingToken);
// stoppingToken propagated — processing is interrupted on shutdown
await ProcessMessageAsync(message, stoppingToken);
}
}
catch (OperationCanceledException)
{
// Normal shutdown path — stoppingToken was cancelled.
// Log and return. Do not re-throw — the host already knows it is stopping.
_logger.LogInformation("Worker stopping — cancellation requested.");
}
_logger.LogInformation("Worker stopped.");
}
private async Task ProcessMessageAsync(Message message, CancellationToken ct)
{
// Pass ct to every async call within processing
await _dbContext.SaveChangesAsync(ct);
await _httpClient.PostAsync("/api/notify", content, ct);
// ... etc.
}
}
// ════════════════════════════════════════════════════════════════════════════
// PER-OPERATION TIMEOUT WITH LINKED TOKEN SOURCE
// ════════════════════════════════════════════════════════════════════════════
// When you need a per-operation timeout AND host shutdown cancellation,
// link both into a single token. The operation is cancelled by whichever
// fires first — the timeout or the host shutdown.
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
try
{
while (!stoppingToken.IsCancellationRequested)
{
var message = await _queue.DequeueAsync(stoppingToken);
// Per-operation timeout: 30 seconds or host shutdown, whichever is first
using var operationCts = CancellationTokenSource.CreateLinkedTokenSource(
stoppingToken,
new CancellationTokenSource(TimeSpan.FromSeconds(30)).Token);
try
{
await ProcessMessageAsync(message, operationCts.Token);
}
catch (OperationCanceledException) when (!stoppingToken.IsCancellationRequested)
{
// The operation-level timeout fired, not the host shutdown.
// Log it, move to the next message.
_logger.LogWarning("Message {Id} processing timed out.", message.Id);
}
// If stoppingToken fired, OperationCanceledException propagates
// to the outer try/catch and exits the loop cleanly.
}
}
catch (OperationCanceledException)
{
_logger.LogInformation("Worker stopping — cancellation requested.");
}
}
// ════════════════════════════════════════════════════════════════════════════
// async void: THE PATTERN THAT KILLS YOUR SERVICE SILENTLY
// ════════════════════════════════════════════════════════════════════════════
// WRONG: async void — exceptions escape to the thread pool and crash the process
// or are silently swallowed depending on runtime version
private async void FireAndForgetWork() // ← never do this in a BackgroundService
{
await Task.Delay(1000);
throw new InvalidOperationException("This kills the process with no useful stack trace.");
}
// RIGHT: async Task — exceptions are observable, awaitable, and loggable
private async Task DoWorkAsync(CancellationToken ct)
{
await Task.Delay(1000, ct);
// If this throws, the exception propagates to the awaiting caller in ExecuteAsync
// where it can be caught, logged, and handled correctly.
}
// In ExecuteAsync: await DoWorkAsync(stoppingToken); — observable, correct.
The when (!stoppingToken.IsCancellationRequested) filter on the inner OperationCanceledException catch is the detail that makes per-operation timeouts composable with host shutdown. Without it, catching OperationCanceledException inside the loop swallows the host shutdown signal — the outer try/catch never fires, the loop continues, and your service ignores the shutdown entirely. The filter distinguishes between "the operation timed out" (handle it, continue the loop) and "the host is shutting down" (let it propagate, exit the loop). This pattern appears whenever you combine two cancellation dimensions and is worth internalising as a standard.
Retry Policies: Backoff That Does Not Block Shutdown
A BackgroundService that processes messages from a queue or calls an external dependency will encounter transient failures. The correct response to a transient failure is a retry with exponential backoff — not an immediate retry that hammers a failing dependency, and not a bare catch that swallows the exception and moves on. The critical constraint is that every delay in the retry cycle must be cancellable by the stoppingToken. A retry delay of 64 seconds that ignores the stopping token means a shutdown during that delay waits the full 64 seconds before the host can proceed.
// ════════════════════════════════════════════════════════════════════════════
// OPTION A: Manual exponential backoff respecting stoppingToken
// ════════════════════════════════════════════════════════════════════════════
public class RetryWorker : BackgroundService
{
private const int MaxRetryAttempts = 5;
private readonly ILogger _logger;
private readonly IExternalService _service;
public RetryWorker(IExternalService service, ILogger logger)
=> (_service, _logger) = (service, logger);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
try
{
while (!stoppingToken.IsCancellationRequested)
{
var message = await _queue.DequeueAsync(stoppingToken);
await ProcessWithRetryAsync(message, stoppingToken);
}
}
catch (OperationCanceledException)
{
_logger.LogInformation("Worker stopping.");
}
}
private async Task ProcessWithRetryAsync(Message message, CancellationToken ct)
{
var attempt = 0;
while (true)
{
try
{
await _service.ProcessAsync(message, ct);
return; // success — exit retry loop
}
catch (OperationCanceledException)
{
throw; // host shutdown — do not retry, propagate immediately
}
catch (Exception ex) when (attempt < MaxRetryAttempts)
{
attempt++;
// Exponential backoff: 1s, 2s, 4s, 8s, 16s — capped at 60s
var delay = TimeSpan.FromSeconds(Math.Min(Math.Pow(2, attempt), 60));
_logger.LogWarning(ex,
"Processing attempt {Attempt}/{Max} failed. Retrying in {Delay}s.",
attempt, MaxRetryAttempts, delay.TotalSeconds);
// CRITICAL: pass ct to Task.Delay — shutdown cancels the wait immediately
// Task.Delay(delay) without ct blocks shutdown for the full delay duration
await Task.Delay(delay, ct);
}
catch (Exception ex)
{
// Max retries exceeded — log and move on (or dead-letter the message)
_logger.LogError(ex,
"Message {Id} failed after {Max} attempts. Moving to dead-letter.",
message.Id, MaxRetryAttempts);
await _deadLetterQueue.EnqueueAsync(message, ct);
return;
}
}
}
}
// ════════════════════════════════════════════════════════════════════════════
// OPTION B: Polly v8 ResiliencePipeline — declarative retry with backoff
// dotnet add package Polly
// dotnet add package Microsoft.Extensions.Http.Resilience (for HTTP clients)
// ════════════════════════════════════════════════════════════════════════════
// Register the pipeline in DI (Program.cs):
builder.Services.AddResiliencePipeline("worker-retry", pipelineBuilder =>
{
pipelineBuilder.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = 5,
BackoffType = DelayBackoffType.Exponential,
Delay = TimeSpan.FromSeconds(1),
MaxDelay = TimeSpan.FromSeconds(60),
UseJitter = true, // randomises delay ±25% — prevents retry storms
ShouldHandle = new PredicateBuilder()
.Handle()
.Handle(),
OnRetry = args =>
{
logger.LogWarning(
"Retry {Attempt} after {Delay}ms due to: {Exception}",
args.AttemptNumber,
args.RetryDelay.TotalMilliseconds,
args.Outcome.Exception?.Message);
return ValueTask.CompletedTask;
}
});
// Add timeout per attempt — combined with retry gives bounded total time
pipelineBuilder.AddTimeout(TimeSpan.FromSeconds(30));
});
// Use in BackgroundService:
public class PollyWorker : BackgroundService
{
private readonly ResiliencePipeline _pipeline;
private readonly ILogger _logger;
public PollyWorker(
ResiliencePipelineProvider pipelineProvider,
ILogger logger)
{
_pipeline = pipelineProvider.GetPipeline("worker-retry");
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
try
{
while (!stoppingToken.IsCancellationRequested)
{
var message = await _queue.DequeueAsync(stoppingToken);
// stoppingToken passed to ExecuteAsync — Polly aborts retry cycle on shutdown
await _pipeline.ExecuteAsync(
async ct => await _service.ProcessAsync(message, ct),
stoppingToken);
}
}
catch (OperationCanceledException)
{
_logger.LogInformation("Worker stopping.");
}
}
}
// ── Jitter note ────────────────────────────────────────────────────────────
// UseJitter = true is not optional when you have multiple worker instances.
// Without jitter, all instances fail at the same time and retry at the same
// time — creating synchronised retry waves that overwhelm the recovering
// dependency. Jitter distributes retry attempts across time, giving the
// dependency a chance to recover without a thundering herd.
The choice between manual retry and Polly is a maintenance question, not a correctness question — both approaches work when implemented correctly. Polly's advantage is that the retry semantics are declared in one place and are reusable across multiple services registered against the same pipeline name. The manual approach's advantage is zero additional dependencies and explicit control over every decision point. For teams already using Polly for HTTP resilience, using ResiliencePipeline in background services keeps resilience policy consistent across the codebase. For teams without Polly, the manual pattern is correct and readable without adding a dependency.
Graceful Shutdown: Drain the Queue or Stop Immediately
When the host signals shutdown, your BackgroundService faces a decision about work already in progress. Stop immediately means cancel all in-flight processing, return from ExecuteAsync, and let the queue or caller retry. Drain means stop accepting new work but complete everything already dequeued before returning. The right answer depends on whether your work items are idempotent, whether your queue provides at-least-once delivery, and how expensive duplicate processing is relative to the cost of a longer shutdown. Neither is universally correct — the decision belongs to the service's domain, not the framework.
// ════════════════════════════════════════════════════════════════════════════
// HOST SHUTDOWN TIMEOUT CONFIGURATION
// Default: 5 seconds. Extend when your work items need more time to drain.
// Must be shorter than Kubernetes terminationGracePeriodSeconds (default: 30s).
// ════════════════════════════════════════════════════════════════════════════
builder.Services.Configure(options =>
{
// Allow 25 seconds for all hosted services to stop.
// Kubernetes terminationGracePeriodSeconds should be set to 30 or higher.
options.ShutdownTimeout = TimeSpan.FromSeconds(25);
});
// ════════════════════════════════════════════════════════════════════════════
// PATTERN A: Stop immediately — correct for idempotent, at-least-once queues
// ════════════════════════════════════════════════════════════════════════════
// When your queue (SQS, Service Bus, RabbitMQ) provides at-least-once delivery
// and your message handler is idempotent, the simplest shutdown strategy is
// correct: let OperationCanceledException propagate, return from ExecuteAsync.
// The message was not acknowledged — the queue redelivers it to another instance.
public class StopImmediatelyWorker : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
try
{
while (!stoppingToken.IsCancellationRequested)
{
var message = await _queue.DequeueAsync(stoppingToken);
await ProcessAsync(message, stoppingToken);
await _queue.AcknowledgeAsync(message, stoppingToken);
// If stoppingToken fires before Acknowledge, the message is
// redelivered by the queue broker — correct behaviour.
}
}
catch (OperationCanceledException)
{
_logger.LogInformation("Worker stopped. In-flight message will be redelivered.");
}
}
}
// ════════════════════════════════════════════════════════════════════════════
// PATTERN B: Drain in-flight work — correct for non-idempotent or costly work
// ════════════════════════════════════════════════════════════════════════════
// Use when: messages have external side effects (payment, email, third-party API)
// or when deduplication is expensive and redelivery is undesirable.
public class DrainingWorker : BackgroundService
{
private readonly ILogger _logger;
private readonly IMessageQueue _queue;
// Maximum time to spend draining in-flight work after shutdown signal.
// Must fit within the host ShutdownTimeout configured above.
private static readonly TimeSpan DrainTimeout = TimeSpan.FromSeconds(20);
public DrainingWorker(IMessageQueue queue, ILogger logger)
=> (_queue, _logger) = (queue, logger);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
// Track all in-flight processing tasks so we can await them on shutdown
var inFlight = new ConcurrentBag();
try
{
while (!stoppingToken.IsCancellationRequested)
{
var message = await _queue.DequeueAsync(stoppingToken);
// Start processing — do NOT await here, so the dequeue loop continues.
// CancellationToken.None: processing runs to completion even after shutdown.
// The stoppingToken stops dequeuing, not processing.
var processingTask = ProcessAndAcknowledgeAsync(message, CancellationToken.None);
inFlight.Add(processingTask);
// Prune completed tasks to avoid unbounded growth of the bag
// (in production, use a more structured concurrent collection)
}
}
catch (OperationCanceledException)
{
_logger.LogInformation(
"Shutdown signalled. Draining {Count} in-flight messages.",
inFlight.Count(t => !t.IsCompleted));
}
// ── Drain phase ────────────────────────────────────────────────────
// Wait for all in-flight tasks, bounded by the drain timeout.
// After DrainTimeout, any remaining tasks are abandoned — the host
// will force-terminate after ShutdownTimeout regardless.
using var drainCts = new CancellationTokenSource(DrainTimeout);
try
{
await Task.WhenAll(inFlight).WaitAsync(drainCts.Token);
_logger.LogInformation("Drain complete. All in-flight messages processed.");
}
catch (OperationCanceledException)
{
var remaining = inFlight.Count(t => !t.IsCompleted);
_logger.LogWarning(
"Drain timeout reached. {Count} message(s) abandoned — will be redelivered.",
remaining);
}
}
private async Task ProcessAndAcknowledgeAsync(Message message, CancellationToken ct)
{
try
{
await _processor.ProcessAsync(message, ct);
await _queue.AcknowledgeAsync(message, ct);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to process message {Id} during drain.", message.Id);
await _queue.NegativeAcknowledgeAsync(message, ct);
}
}
}
// ════════════════════════════════════════════════════════════════════════════
// SHUTDOWN DECISION REFERENCE
// ════════════════════════════════════════════════════════════════════════════
//
// Condition → Pattern
// ───────────────────────────────────────────────── ──────────────────────
// Queue provides at-least-once + handler idempotent → Stop immediately (A)
// Processing has non-idempotent side effects → Drain (B)
// Processing is fast (< 1s per message) → Stop immediately (A)
// Processing is slow or batched → Drain (B)
// Redelivery is cheap → Stop immediately (A)
// Redelivery causes duplicates user-visible → Drain (B)
// Kubernetes pod rolling update → Either, but drain
// reduces visible errors
The drain timeout is as important as the drain logic itself. Without a bounded drain time, a processing spike during a deployment rollout could cause ExecuteAsync to block long enough that the host exceeds its ShutdownTimeout and force-terminates anyway — defeating the purpose of draining. Set DrainTimeout to a value comfortably inside your ShutdownTimeout, and set ShutdownTimeout comfortably inside your Kubernetes terminationGracePeriodSeconds. The layered timeout structure — per-operation timeout, drain timeout, host shutdown timeout, Kubernetes termination grace period — gives every layer a chance to clean up before the next layer forces termination.
Exception Behaviour & .NET 8 Host Configuration
A BackgroundService that throws an unhandled exception from ExecuteAsync behaves differently depending on the .NET version and the BackgroundServiceExceptionBehavior host option. Understanding this behaviour is the difference between a silently dead worker (the old default) and a host that correctly surfaces the failure. In .NET 8, the default is to stop the host — which is the correct production behaviour, but it means unhandled exceptions that were previously silent will now take the application down.
// ════════════════════════════════════════════════════════════════════════════
// BACKGROUND SERVICE EXCEPTION BEHAVIOUR
// ════════════════════════════════════════════════════════════════════════════
builder.Services.Configure(options =>
{
// .NET 8 default: StopHost — unhandled exception in any BackgroundService
// triggers IHost.StopAsync, bringing down the entire application.
// This is correct production behaviour — a dead worker should not silently
// run alongside healthy services without anyone knowing.
options.BackgroundServiceExceptionBehavior =
BackgroundServiceExceptionBehavior.StopHost;
// Alternative: Ignore — the worker stops but the host continues.
// Only use if your BackgroundService is truly optional and the application
// is meaningful without it. Document why explicitly in code.
// options.BackgroundServiceExceptionBehavior =
// BackgroundServiceExceptionBehavior.Ignore;
options.ShutdownTimeout = TimeSpan.FromSeconds(25);
});
// ════════════════════════════════════════════════════════════════════════════
// STRUCTURED LOGGING FOR BACKGROUND SERVICE LIFECYCLE
// ════════════════════════════════════════════════════════════════════════════
public class WellLoggedWorker : BackgroundService
{
private readonly ILogger _logger;
public WellLoggedWorker(ILogger logger)
=> _logger = logger;
public override async Task StartAsync(CancellationToken cancellationToken)
{
_logger.LogInformation("Worker starting.");
await base.StartAsync(cancellationToken); // calls ExecuteAsync on a background thread
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
_logger.LogInformation("Worker execute loop started.");
try
{
while (!stoppingToken.IsCancellationRequested)
{
// ... work loop ...
await Task.Delay(TimeSpan.FromSeconds(1), stoppingToken);
}
}
catch (OperationCanceledException)
{
// Normal — log and return
_logger.LogInformation("Worker execute loop cancelled.");
}
catch (Exception ex)
{
// Unhandled exception — log with full context before it propagates
// to the host and triggers StopHost behaviour.
_logger.LogCritical(ex, "Worker execute loop failed with unhandled exception.");
throw; // re-throw — let the host react per BackgroundServiceExceptionBehavior
}
}
public override async Task StopAsync(CancellationToken cancellationToken)
{
_logger.LogInformation("Worker stopping.");
await base.StopAsync(cancellationToken);
_logger.LogInformation("Worker stopped.");
}
}
// ════════════════════════════════════════════════════════════════════════════
// STARTUP VALIDATION — fail fast on missing dependencies
// ════════════════════════════════════════════════════════════════════════════
// If your BackgroundService requires a configuration value or external
// dependency that must be present at startup, validate in StartAsync rather
// than discovering the problem at the first loop iteration.
public class ValidatedWorker : BackgroundService
{
private readonly WorkerOptions _options;
private readonly ILogger _logger;
public ValidatedWorker(IOptions options, ILogger logger)
{
_options = options.Value;
_logger = logger;
}
public override Task StartAsync(CancellationToken cancellationToken)
{
// Validate required configuration before the execute loop starts.
// Throws immediately at startup — visible in logs, not buried in loop iteration 1.
if (string.IsNullOrWhiteSpace(_options.QueueConnectionString))
throw new InvalidOperationException(
"WorkerOptions.QueueConnectionString is required but not configured.");
if (_options.MaxConcurrency < 1 || _options.MaxConcurrency > 32)
throw new InvalidOperationException(
$"WorkerOptions.MaxConcurrency must be between 1 and 32. Got: {_options.MaxConcurrency}");
_logger.LogInformation(
"Worker validated. Queue: {Queue}, MaxConcurrency: {Concurrency}",
_options.QueueConnectionString, _options.MaxConcurrency);
return base.StartAsync(cancellationToken);
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
// Configuration is guaranteed valid here — StartAsync validated it.
// ... execute loop ...
await Task.CompletedTask;
}
}
The startup validation pattern in StartAsync is worth applying to every BackgroundService that depends on external configuration. An invalid connection string or an out-of-range configuration value discovered at iteration one of the work loop produces a log entry buried among normal startup messages, possibly with a confusing inner exception. The same validation in StartAsync throws before ExecuteAsync is called, appears prominently in the startup log sequence, and stops the application immediately with a clear error message — the exact behaviour you want when configuration is wrong. Fail fast, fail loudly, fail before the service has pretended to start successfully.