When should I use Span vs Memory ?

Use Span for synchronous operations within a single method scope—it's a ref struct that can't cross async boundaries. Use Memory when you need to store references across await points or return them from methods, as it's a regular struct that works everywhere.

High-Performance C#: Span<T>, Memory<T>, SIMD & Pipelines

Q: How do I avoid memory leaks with ArrayPool?

Always pair Rent with Return in a try-finally block. Never store references to pooled arrays beyond the scope where you called Return. Consider using IMemoryOwner from MemoryPool which implements IDisposable for automatic cleanup.

Q: Can I use System.IO.Pipelines with ASP.NET Core?

Yes. Pipelines power Kestrel internally. You can expose PipeReader from HttpRequest.BodyReader or wrap custom streams. Useful for streaming large uploads, protocol parsers, or real-time data processing endpoints without buffering entire payloads.

Q: What's the largest safe size for stackalloc?

Keep stackalloc buffers under 1 KB (approximately 1024 bytes). Larger allocations risk stack overflow, especially in recursive or deeply nested calls. For dynamic sizes, use ArrayPool .Shared.Rent with a threshold check to decide between stack and pool.

Modern C# offers tools that eliminate heap allocations, slice buffers without copying, and process data with CPU vector instructions. If you're building parsers, network protocols, or data pipelines where performance matters, understanding Span<T>, Memory<T>, and System.IO.Pipelines changes what's possible. This isn't theoretical—by the end you'll have built a production-ready log ingestor that processes NDJSON streams with zero-allocation parsing, pooled buffers, and optional SIMD acceleration. We'll cover safety rules the compiler enforces, benchmarking with real numbers, and the production checklist that keeps allocations predictable.

This guide targets .NET 8 LTS with callouts for .NET 9 improvements where relevant. You'll see complete runnable code, not snippets. Each section builds toward the final ingestor, but includes focused micro-examples you can test in isolation. Whether you're optimizing hot paths identified by profiling or architecting high-throughput services from scratch, these patterns deliver measurable results.

Overview & Safety Model

High-performance C# revolves around avoiding heap allocations in hot loops. The garbage collector is fast, but allocating millions of short-lived objects creates pressure. Span<T> solves this by representing contiguous memory—arrays, stack buffers, or native memory—without allocation overhead. It's a ref struct, meaning it lives only on the stack and can't be boxed or stored in fields of regular classes.

🎯

Zero-Copy Slicing

Slice arrays, strings, and buffers without allocating new objects. Span wraps existing memory with length tracking.

🔒

Compiler Safety

ref struct lifetime rules prevent escaping references. You can't accidentally store stack pointers beyond valid scope.

⚡

Pooled Buffers

ArrayPool recycles arrays across requests. Pair with Memory for async-friendly buffer management.

🚀

SIMD Acceleration

Process multiple elements per instruction with Vector. Works across x64, ARM, and other architectures.

Safety Rules You Need to Know

The compiler enforces strict lifetime rules for ref struct types like Span<T>. You can't return them from async methods, store them in heap-allocated objects, or box them into interfaces. This prevents dangling pointers and use-after-free bugs at compile time. If you need to cross async boundaries, use Memory<T> instead—it's a regular struct that wraps a managed reference and works everywhere.

Span Lifetime Rules

// ✅ Valid: Span in synchronous method
public void ProcessSync(ReadOnlySpan data)
{
    foreach (var b in data)
    {
        Console.Write(b);
    }
}

// ❌ Invalid: Can't use Span across await
public async Task ProcessAsync(ReadOnlySpan data)
{
    await Task.Delay(100); // Compiler error CS4033
    // data no longer valid after await
}

// ✅ Valid: Memory works with async
public async Task ProcessAsyncCorrect(ReadOnlyMemory data)
{
    await Task.Delay(100);
    ProcessSync(data.Span); // Convert to Span only when needed
}

Why this restriction? Async methods compile to state machines stored on the heap. If Span<T> pointed to stack memory and you awaited, the stack frame could unwind, leaving a dangling pointer. Memory<T> holds a safe reference that survives across continuations.

GC Interaction
Span<T> can wrap managed arrays, and the GC won't move them while the Span is in scope because ref struct lifetime guarantees they're stack-bound. For unmanaged memory, use Span<T> with pointers obtained from Marshal.AllocHGlobal or similar—just ensure the memory outlives the Span.

Span & ReadOnlySpan Fundamentals

Span<T> gives you a mutable view over contiguous memory. ReadOnlySpan<T> is the read-only variant. Both support slicing—extracting a subsection without copying bytes. This is huge for parsers: you can slice a line from a buffer, then slice tokens from that line, all without allocations.

Creating Spans

Span Creation Patterns

// From array
int[] numbers = { 1, 2, 3, 4, 5 };
Span span = numbers.AsSpan();

// From array with range
Span slice = numbers.AsSpan(1, 3); // [2, 3, 4]

// From stack (covered in stackalloc section)
Span buffer = stackalloc byte[256];

// From string (ReadOnlySpan)
ReadOnlySpan text = "Hello, World!".AsSpan();
ReadOnlySpan word = text.Slice(0, 5); // "Hello"

Notice AsSpan() returns a view—the original array isn't copied. Changes through the span mutate the underlying array. For strings, you get ReadOnlySpan<char> because strings are immutable.

Slicing Operations

Zero-Copy Slicing

ReadOnlySpan log = "2025-01-13T10:15:30 INFO User logged in".AsSpan();

// Extract date without allocation
ReadOnlySpan date = log.Slice(0, 10); // "2025-01-13"

// Extract level (starts at index 20, length 4)
ReadOnlySpan level = log.Slice(20, 4); // "INFO"

// Extract message (from index 25 to end)
ReadOnlySpan message = log.Slice(25); // "User logged in"

// Use range syntax (.NET 8+)
ReadOnlySpan dateRange = log[..10]; // Same as Slice(0, 10)
ReadOnlySpan messageRange = log[25..]; // Same as Slice(25)

Each slice operation returns a new Span that points into the same memory. No string allocations, no copying. This pattern scales to megabyte buffers with the same constant-time performance.

Common Pitfalls

Span Can't Escape Method Scope
You can't store Span<T> in a class field or return it from a method that might outlive the underlying buffer. The compiler blocks this. If you need longer-lived references, convert to Memory<T> or copy the data to a regular array with .ToArray().

Don't Do This

public class Parser
{
    private Span _buffer; // ❌ Compiler error CS8345
    // Span can't be a field
}

// ✅ Use Memory for fields
public class Parser
{
    private Memory _buffer; // Valid
}

Practical Example: Tokenizing CSV

CSV Tokenizer with Span

public static List ParseCsvLine(ReadOnlySpan line)
{
    var tokens = new List();
    int start = 0;

    for (int i = 0; i < line.Length; i++)
    {
        if (line[i] == ',')
        {
            tokens.Add(line.Slice(start, i - start).ToString());
            start = i + 1;
        }
    }

    // Add final token
    if (start < line.Length)
    {
        tokens.Add(line.Slice(start).ToString());
    }

    return tokens;
}

// Usage
ReadOnlySpan csv = "Alice,30,Engineer".AsSpan();
var fields = ParseCsvLine(csv);
// ["Alice", "30", "Engineer"]

We slice without allocating intermediate strings until we call .ToString() on the final tokens. For high-throughput scenarios, you'd avoid even those allocations by processing spans directly instead of converting to strings.

Performance Tip
Many BCL methods now accept ReadOnlySpan<char> overloads: int.Parse(), Guid.Parse(), DateTime.Parse(), etc. Use these instead of converting spans to strings first.

Memory<T> & Pooling with IMemoryOwner

When you need to store buffer references beyond a single method or pass them through async calls, Memory<T> is the answer. It's a regular struct that wraps a reference to managed memory—think of it as a safe, GC-friendly pointer. You convert it to Span<T> only when you need to read or write.

Memory<T> Basics

Memory Usage Pattern

public class BufferManager
{
    private Memory _buffer;

    public void AllocateBuffer(int size)
    {
        _buffer = new byte[size];
    }

    public async Task ProcessAsync()
    {
        await FillBufferAsync(_buffer);

        // Convert to Span when doing actual work
        Span span = _buffer.Span;
        for (int i = 0; i < span.Length; i++)
        {
            span[i] = (byte)(span[i] ^ 0xFF); // Process data
        }
    }

    private async Task FillBufferAsync(Memory buffer)
    {
        await Task.Delay(10); // Simulate I/O
        Random.Shared.NextBytes(buffer.Span);
    }
}

Memory<T> stores the reference safely across the await boundary. You access .Span when you need direct memory manipulation, which is always synchronous and stack-confined.

ArrayPool for Zero-Allocation Buffer Reuse

Allocating buffers repeatedly hammers the garbage collector. ArrayPool<T>.Shared maintains a pool of reusable arrays. You rent an array, use it, then return it. This is critical for high-throughput services handling thousands of requests per second.

ArrayPool Pattern

using System.Buffers;

public void ProcessData()
{
    byte[] buffer = ArrayPool.Shared.Rent(4096);
    try
    {
        Span span = buffer.AsSpan(0, 4096);
        // Use span for processing
        FillWithData(span);
        ProcessBuffer(span);
    }
    finally
    {
        ArrayPool.Shared.Return(buffer);
    }
}

private void FillWithData(Span buffer)
{
    for (int i = 0; i < buffer.Length; i++)
    {
        buffer[i] = (byte)(i % 256);
    }
}

private void ProcessBuffer(ReadOnlySpan buffer)
{
    int sum = 0;
    foreach (var b in buffer)
    {
        sum += b;
    }
    Console.WriteLine($"Sum: {sum}");
}

Always use try-finally to ensure you return the buffer even if an exception occurs. Failing to return leaks pooled arrays, eventually exhausting the pool and forcing new allocations.

IMemoryOwner for Automatic Cleanup

IMemoryOwner with using Statement

using System.Buffers;

public void ProcessWithOwner()
{
    using IMemoryOwner owner = MemoryPool.Shared.Rent(4096);
    Memory memory = owner.Memory.Slice(0, 4096);

    // Use memory - it's automatically returned when disposed
    Span span = memory.Span;
    Random.Shared.NextBytes(span);

    Console.WriteLine($"Processed {span.Length} bytes");
} // owner.Dispose() returns buffer to pool

IMemoryOwner<T> implements IDisposable, so you can use using statements for automatic cleanup. This is cleaner than manual try-finally blocks and pairs well with async methods.

Don't Store Pooled References
Never keep references to rented arrays beyond the scope where you called Return or Dispose. The pool may hand that same array to another caller, causing data corruption. If you need the data longer, copy it to a new array or keep the IMemoryOwner alive.

Choosing Between Approaches

Span<T>: Synchronous methods, hot loops, no async
Memory<T>: Async methods, storing references, passing buffers around
ArrayPool: Reusable buffers with manual lifetime management
IMemoryOwner: Reusable buffers with automatic disposal

stackalloc & ref struct Patterns

For small, short-lived buffers, stackalloc allocates directly on the stack with zero GC overhead. It's blazing fast but limited by stack size (typically 1 MB per thread). Use it for buffers under 1 KB in methods that aren't deeply recursive.

Basic stackalloc Usage

Stack-Allocated Buffer

public static string ToHexString(ReadOnlySpan bytes)
{
    // Small buffer on stack (512 chars max for 256 bytes)
    Span buffer = stackalloc char[bytes.Length * 2];

    for (int i = 0; i < bytes.Length; i++)
    {
        buffer[i * 2] = GetHexChar(bytes[i] >> 4);
        buffer[i * 2 + 1] = GetHexChar(bytes[i] & 0xF);
    }

    return new string(buffer);
}

private static char GetHexChar(int value) =>
    value < 10 ? (char)('0' + value) : (char)('A' + value - 10);

The buffer vanishes when the method returns. No allocation, no GC, no overhead. Perfect for temporary work buffers in tight loops.

Dynamic Threshold Pattern

What if the size varies? Use a threshold: stackalloc for small sizes, ArrayPool for large ones.

Threshold-Based Allocation

using System.Buffers;

public void ProcessDynamicSize(int size)
{
    const int StackThreshold = 512;

    byte[]? rentedArray = null;
    Span buffer = size <= StackThreshold
        ? stackalloc byte[StackThreshold]
        : (rentedArray = ArrayPool.Shared.Rent(size));

    try
    {
        Span slice = buffer.Slice(0, size);

        // Process buffer
        for (int i = 0; i < slice.Length; i++)
        {
            slice[i] = (byte)(i % 256);
        }

        Console.WriteLine($"Processed {slice.Length} bytes");
    }
    finally
    {
        if (rentedArray != null)
        {
            ArrayPool.Shared.Return(rentedArray);
        }
    }
}

This pattern gives you stack speed for small buffers and pool efficiency for large ones. The threshold of 512 bytes is conservative—you can go higher (up to ~1 KB) if you're confident about call depth.

Custom ref struct Types

You can write your own ref struct types to encapsulate stack-bound logic. Useful for parsers or state machines that must stay stack-allocated.

Custom ref struct

public ref struct SpanParser
{
    private ReadOnlySpan _data;
    private int _position;

    public SpanParser(ReadOnlySpan data)
    {
        _data = data;
        _position = 0;
    }

    public bool TryReadInt(out int value)
    {
        value = 0;
        int start = _position;

        while (_position < _data.Length && char.IsDigit(_data[_position]))
        {
            _position++;
        }

        if (_position == start) return false;

        return int.TryParse(_data.Slice(start, _position - start), out value);
    }

    public void SkipWhitespace()
    {
        while (_position < _data.Length && char.IsWhiteSpace(_data[_position]))
        {
            _position++;
        }
    }
}

// Usage
ReadOnlySpan input = "42 99 123".AsSpan();
var parser = new SpanParser(input);

if (parser.TryReadInt(out int first))
{
    Console.WriteLine($"First: {first}");
    parser.SkipWhitespace();

    if (parser.TryReadInt(out int second))
    {
        Console.WriteLine($"Second: {second}");
    }
}

SpanParser is a ref struct so it can hold ReadOnlySpan<char> as a field. It can't escape the stack, which is exactly what we want for a parser that points into a temporary buffer.

.NET 9 Enhancement
.NET 9 improves stackalloc with better codegen for initialization patterns. The runtime also expands Span support in more BCL APIs. The patterns here work identically on .NET 8, but .NET 9 might show minor performance improvements in benchmarks.

Parsing with Span<char> & Utf8Parser

String parsing traditionally allocates substring after substring. With ReadOnlySpan<char> and methods like int.TryParse(ReadOnlySpan<char>, out int), you can parse without intermediate allocations. For UTF-8 data, Utf8Parser works directly on byte spans.

Parsing Numbers from Spans

Zero-Allocation Number Parsing

public static bool TryParseLogLine(ReadOnlySpan line, out int level, out DateTime timestamp)
{
    level = 0;
    timestamp = default;

    // Expected format: "2025-01-13T10:15:30 INFO 42 Some message"

    // Parse timestamp (first 19 chars)
    if (line.Length < 24 || !DateTime.TryParse(line.Slice(0, 19), out timestamp))
    {
        return false;
    }

    // Skip to level number (after "INFO ")
    int levelStart = 25;
    int levelEnd = line.Slice(levelStart).IndexOf(' ');
    if (levelEnd == -1) return false;

    return int.TryParse(line.Slice(levelStart, levelEnd), out level);
}

// Usage
ReadOnlySpan log = "2025-01-13T10:15:30 INFO 42 Request completed".AsSpan();
if (TryParseLogLine(log, out int level, out DateTime ts))
{
    Console.WriteLine($"Level {level} at {ts}");
}

No substring allocations. DateTime.TryParse and int.TryParse both have ReadOnlySpan<char> overloads that parse in place.

UTF-8 Parsing with Utf8Parser

Many protocols (HTTP, JSON, Protobuf) work with UTF-8 bytes. System.Buffers.Text.Utf8Parser parses integers, floats, dates, and guids directly from ReadOnlySpan<byte>.

UTF-8 Byte Parsing

using System.Buffers.Text;

public static bool TryParseUtf8Int(ReadOnlySpan utf8Bytes, out int value)
{
    return Utf8Parser.TryParse(utf8Bytes, out value, out int bytesConsumed);
}

// Example: Parse "12345" as UTF-8 bytes
byte[] data = "12345"u8.ToArray();
if (TryParseUtf8Int(data, out int result))
{
    Console.WriteLine($"Parsed: {result}"); // 12345
}

// Parse ISO 8601 date
byte[] dateBytes = "2025-01-13T10:15:30Z"u8.ToArray();
if (Utf8Parser.TryParse(dateBytes, out DateTime dt, out _))
{
    Console.WriteLine($"Date: {dt:O}");
}

Utf8Parser is faster than converting to string first because it avoids UTF-8 to UTF-16 transcoding. If you're reading from a network stream or file, you already have UTF-8 bytes—parse them directly.

Tokenizing with IndexOf and Slicing

Tokenize CSV with Span

public static void ParseCsvRow(ReadOnlySpan row, Span destination)
{
    int fieldIndex = 0;
    int start = 0;

    while (start < row.Length && fieldIndex < destination.Length)
    {
        int commaIndex = row.Slice(start).IndexOf(',');
        int end = commaIndex == -1 ? row.Length : start + commaIndex;

        ReadOnlySpan field = row.Slice(start, end - start);
        if (int.TryParse(field, out int value))
        {
            destination[fieldIndex++] = value;
        }

        start = end + 1;
    }
}

// Usage
ReadOnlySpan csv = "10,20,30,40,50".AsSpan();
Span values = stackalloc int[5];
ParseCsvRow(csv, values);

foreach (var v in values)
{
    Console.WriteLine(v);
}

We allocate nothing except the stack buffer for results. For high-frequency parsing (thousands of rows per second), this makes a measurable difference.

Culture-Aware Parsing
Use overloads that accept NumberStyles and IFormatProvider when parsing user input. For logs and protocols you control, invariant culture is faster: int.TryParse(span, NumberStyles.Integer, CultureInfo.InvariantCulture, out result).

MemoryMarshal & Unsafe: When and How

System.Runtime.InteropServices.MemoryMarshal and System.Runtime.CompilerServices.Unsafe bypass safety checks for extreme performance. Use them only when profiling proves it's necessary and you understand the risks. Common scenarios: reinterpreting byte arrays as value types, or skipping bounds checks in tight loops.

MemoryMarshal.Cast for Type Reinterpretation

Reinterpret Bytes as Integers

using System.Runtime.InteropServices;

public static void ProcessBinaryData(ReadOnlySpan bytes)
{
    // Reinterpret bytes as int32 values (4 bytes per int)
    ReadOnlySpan ints = MemoryMarshal.Cast(bytes);

    foreach (var value in ints)
    {
        Console.WriteLine(value);
    }
}

// Example: 8 bytes = 2 ints
byte[] data = { 0x01, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00 };
ProcessBinaryData(data);
// Output: 1, 2 (little-endian)

MemoryMarshal.Cast reinterprets memory without copying. It's safe as long as the byte length is a multiple of the target type size. The runtime checks alignment at the time of cast.

Endianness Matters
MemoryMarshal.Cast assumes the platform's native byte order. On little-endian systems (x64, ARM), multi-byte values are stored least-significant byte first. If you're reading big-endian data (network protocols), use BinaryPrimitives.ReadInt32BigEndian instead.

When to Use Unsafe

Unsafe methods skip bounds checking and can manipulate raw pointers. Useful in inner loops where bounds checks dominate runtime, but dangerous because you can corrupt memory if you miscalculate indices.

Unsafe Loop Optimization

using System.Runtime.CompilerServices;

public static void UnsafeFillArray(Span span, int value)
{
    ref int ptr = ref MemoryMarshal.GetReference(span);

    for (int i = 0; i < span.Length; i++)
    {
        Unsafe.Add(ref ptr, i) = value;
    }
}

// Usage
Span buffer = stackalloc int[100];
UnsafeFillArray(buffer, 42);
Console.WriteLine(buffer[50]); // 42

This bypasses array bounds checks. The JIT can also vectorize such loops more aggressively. However, modern JIT compilers (RyuJIT in .NET 8) already eliminate bounds checks in many cases when it can prove safety, so profile before going unsafe.

Practical Example: Summing Integers

Safe Sum with Bounds Checks

public static long SafeSum(ReadOnlySpan values)
{
    long sum = 0;
    for (int i = 0; i < values.Length; i++)
    {
        sum += values[i]; // Bounds check on each access
    }
    return sum;
}

Unsafe Sum (No Bounds Checks)

public static long UnsafeSum(ReadOnlySpan values)
{
    long sum = 0;
    ref int ptr = ref MemoryMarshal.GetReference(values);

    for (int i = 0; i < values.Length; i++)
    {
        sum += Unsafe.Add(ref ptr, i);
    }
    return sum;
}

In .NET 8, the JIT often optimizes both versions identically. Benchmark before deciding unsafe is worth the risk. Exceptions: SIMD loops and interop scenarios where you're already working with unmanaged memory.

When Unsafe is Justified
Platform interop (P/Invoke with fixed buffers), custom SIMD implementations not covered by Vector<T>, and hot paths identified by profiling where the JIT demonstrably fails to elide bounds checks. Always validate with BenchmarkDotNet before shipping unsafe code.

SIMD Quick Wins with Vectorization

SIMD (Single Instruction, Multiple Data) lets you process multiple elements per CPU instruction. Modern CPUs support 128-bit (SSE), 256-bit (AVX2), and 512-bit (AVX-512) vector operations. .NET exposes these via System.Numerics.Vector<T> (portable) and System.Runtime.Intrinsics (explicit intrinsics).

Portable SIMD with Vector<T>

Start with Vector<T>. It adapts to the CPU's vector width automatically—16 bytes on SSE2, 32 bytes on AVX2, potentially 64 bytes on AVX-512. The JIT generates appropriate instructions.

Portable Vector Sum

using System.Numerics;

public static int VectorSum(ReadOnlySpan values)
{
    int vectorSize = Vector.Count; // e.g., 8 on AVX2 (256-bit / 32-bit int)
    int sum = 0;

    int i = 0;
    // Process full vectors
    for (; i <= values.Length - vectorSize; i += vectorSize)
    {
        var vector = new Vector(values.Slice(i, vectorSize));
        sum += Vector.Sum(vector);
    }

    // Process remaining elements
    for (; i < values.Length; i++)
    {
        sum += values[i];
    }

    return sum;
}

// Usage
int[] data = Enumerable.Range(1, 1000).ToArray();
int total = VectorSum(data);
Console.WriteLine($"Sum: {total}");

The vector loop processes 4–16 integers per iteration (depending on CPU). The scalar tail loop handles leftovers. On AVX2 hardware, this can be 4x faster than the scalar version.

SIMD for Delimiter Scanning

A common pattern: scanning a buffer for a delimiter character (newline, comma, null byte). SIMD can check 16–32 bytes at once.

SIMD Newline Scanner

using System.Numerics;

public static int FindNewline(ReadOnlySpan buffer)
{
    byte newline = (byte)'\n';
    int vectorSize = Vector.Count; // e.g., 32 on AVX2

    for (int i = 0; i <= buffer.Length - vectorSize; i += vectorSize)
    {
        var vector = new Vector(buffer.Slice(i, vectorSize));
        var matches = Vector.Equals(vector, new Vector(newline));

        if (matches != Vector.Zero)
        {
            // Found a match in this vector
            for (int j = 0; j < vectorSize; j++)
            {
                if (buffer[i + j] == newline)
                {
                    return i + j;
                }
            }
        }
    }

    // Scalar tail
    for (int i = buffer.Length - (buffer.Length % vectorSize); i < buffer.Length; i++)
    {
        if (buffer[i] == newline) return i;
    }

    return -1;
}

// Test
byte[] data = "Hello\nWorld"u8.ToArray();
int index = FindNewline(data);
Console.WriteLine($"Newline at index: {index}"); // 5

Vector.Equals compares all bytes in one instruction. If any match, we scan that vector for the exact index. This scales to megabyte buffers where scanning byte-by-byte becomes a bottleneck.

Explicit Intrinsics for Control

When you need specific CPU instructions, use System.Runtime.Intrinsics. Guard with IsSupported checks to avoid crashes on older hardware.

Intrinsics with Fallback

using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

public static int SumWithIntrinsics(ReadOnlySpan values)
{
    if (Avx2.IsSupported && values.Length >= 8)
    {
        return SumAvx2(values);
    }
    else if (Sse2.IsSupported && values.Length >= 4)
    {
        return SumSse2(values);
    }
    else
    {
        return SumScalar(values);
    }
}

private static int SumAvx2(ReadOnlySpan values)
{
    // 256-bit vectors (8 ints)
    Vector256 sum256 = Vector256.Zero;
    int i = 0;

    for (; i <= values.Length - 8; i += 8)
    {
        var vec = Vector256.Create(values.Slice(i, 8));
        sum256 = Avx2.Add(sum256, vec);
    }

    // Horizontal sum
    int sum = 0;
    for (int j = 0; j < 8; j++)
    {
        sum += sum256.GetElement(j);
    }

    // Tail
    for (; i < values.Length; i++)
    {
        sum += values[i];
    }

    return sum;
}

private static int SumSse2(ReadOnlySpan values)
{
    // 128-bit vectors (4 ints)
    Vector128 sum128 = Vector128.Zero;
    int i = 0;

    for (; i <= values.Length - 4; i += 4)
    {
        var vec = Vector128.Create(values.Slice(i, 4));
        sum128 = Sse2.Add(sum128, vec);
    }

    int sum = 0;
    for (int j = 0; j < 4; j++)
    {
        sum += sum128.GetElement(j);
    }

    for (; i < values.Length; i++)
    {
        sum += values[i];
    }

    return sum;
}

private static int SumScalar(ReadOnlySpan values)
{
    int sum = 0;
    foreach (var v in values)
    {
        sum += v;
    }
    return sum;
}

This provides AVX2, SSE2, and scalar fallbacks. Ship this and it'll use the best path available at runtime.

Optional (.NET 9+)
.NET 9 adds Vector512<T> for AVX-512 on supported hardware. The patterns are identical—just check Avx512F.IsSupported and process 64-byte vectors. Most servers don't have AVX-512 yet, so treat it as an extra fast path, not a requirement.

When SIMD Isn't Worth It

Small arrays (under 100 elements): setup cost dominates
Non-contiguous data: SIMD needs sequential memory access
Complex per-element logic: SIMD shines on arithmetic and comparisons, not branching

Profile first. Many "slow" loops are bottlenecked by cache misses or allocations, not instruction throughput. Fix those before reaching for SIMD.

System.IO.Pipelines Primer

System.IO.Pipelines is a high-performance I/O abstraction that decouples reading and writing with built-in backpressure. It's what powers Kestrel's HTTP parser. You get a PipeReader for consuming data and a PipeWriter for producing it, both working with ReadOnlySequence<byte>—a type that handles fragmented buffers.

Why Pipelines?

Traditional streams require you to manage buffers, handle partial reads, and coordinate async cancellation. Pipelines do this automatically: the writer fills buffers, the reader consumes them, and the pipe handles flow control. Perfect for network protocols, file parsing, or any producer-consumer scenario.

Basic PipeReader Example

Reading Lines from a Pipe

using System.IO.Pipelines;
using System.Text;

public static async Task ReadLinesAsync(PipeReader reader)
{
    while (true)
    {
        ReadResult result = await reader.ReadAsync();
        ReadOnlySequence buffer = result.Buffer;

        // Process lines in buffer
        while (TryReadLine(ref buffer, out ReadOnlySequence line))
        {
            ProcessLine(line);
        }

        // Tell the pipe how much we consumed
        reader.AdvanceTo(buffer.Start, buffer.End);

        if (result.IsCompleted) break;
    }

    await reader.CompleteAsync();
}

private static bool TryReadLine(ref ReadOnlySequence buffer, out ReadOnlySequence line)
{
    SequencePosition? position = buffer.PositionOf((byte)'\n');

    if (position == null)
    {
        line = default;
        return false;
    }

    line = buffer.Slice(0, position.Value);
    buffer = buffer.Slice(buffer.GetPosition(1, position.Value));
    return true;
}

private static void ProcessLine(ReadOnlySequence line)
{
    string text = Encoding.UTF8.GetString(line);
    Console.WriteLine($"Line: {text}");
}

ReadAsync returns when data is available or the writer completes. AdvanceTo tells the pipe what you consumed and what you examined. This enables intelligent backpressure: if you don't consume data, the writer pauses instead of buffering unbounded memory.

PipeWriter for Producing Data

Writing to a Pipe

public static async Task WriteDataAsync(PipeWriter writer)
{
    for (int i = 0; i < 100; i++)
    {
        // Get memory from the pipe
        Memory memory = writer.GetMemory(256);

        // Write data
        string message = $"Message {i}\n";
        int bytesWritten = Encoding.UTF8.GetBytes(message, memory.Span);

        // Advance the writer
        writer.Advance(bytesWritten);

        // Flush periodically
        await writer.FlushAsync();
    }

    await writer.CompleteAsync();
}

GetMemory requests a buffer from the pipe (it may be pooled). You write to it, call Advance to commit the bytes, then FlushAsync to make them available to the reader.

Backpressure in Action

If the reader lags behind, FlushAsync pauses until buffer space frees up. This prevents unbounded memory growth. Contrast with BufferedStream or MemoryStream, which allocate until you run out of memory.

Cancellation Support
Pass a CancellationToken to ReadAsync and FlushAsync. When cancelled, the pipe raises OperationCanceledException and cleans up resources. No need for manual cleanup logic in your reader/writer loops.

Creating a Pipe

Pipe Setup

var pipe = new Pipe();

// Start reader and writer concurrently
Task writerTask = WriteDataAsync(pipe.Writer);
Task readerTask = ReadLinesAsync(pipe.Reader);

await Task.WhenAll(writerTask, readerTask);

Pipe creates a producer-consumer channel. Writer and reader run independently, coordinated by the pipe's internal buffer management.

Build the Log/CSV Ingestor

Now we assemble everything into a complete log ingestor. It reads NDJSON (newline-delimited JSON) or CSV from a stream, parses fields with Span<T>, uses ArrayPool for buffers, optionally scans delimiters with SIMD, and exposes metrics via an ASP.NET Core endpoint.

Core Ingestor with Pipelines

LogIngestor.cs

using System.Buffers;
using System.IO.Pipelines;
using System.Text;

public class LogIngestor
{
    private long _linesProcessed;

    public long LinesProcessed => _linesProcessed;

    public async Task IngestAsync(Stream stream, CancellationToken ct = default)
    {
        var reader = PipeReader.Create(stream);

        while (!ct.IsCancellationRequested)
        {
            ReadResult result = await reader.ReadAsync(ct);
            ReadOnlySequence buffer = result.Buffer;

            while (TryReadLine(ref buffer, out ReadOnlySequence line))
            {
                ProcessLine(line);
                Interlocked.Increment(ref _linesProcessed);
            }

            reader.AdvanceTo(buffer.Start, buffer.End);

            if (result.IsCompleted) break;
        }

        await reader.CompleteAsync();
    }

    private bool TryReadLine(ref ReadOnlySequence buffer, out ReadOnlySequence line)
    {
        SequencePosition? position = buffer.PositionOf((byte)'\n');

        if (position == null)
        {
            line = default;
            return false;
        }

        line = buffer.Slice(0, position.Value);
        buffer = buffer.Slice(buffer.GetPosition(1, position.Value));
        return true;
    }

        private void ProcessLine(ReadOnlySequence line)
    {
        // Parse CSV: timestamp,level,message
        int bufferLength = (int)Math.Min(line.Length, 512);
        Span buffer = stackalloc byte[bufferLength];
        line.Slice(0, bufferLength).CopyTo(buffer);

        ReadOnlySpan span = buffer.Slice(0, bufferLength);

        // Find first comma
        int firstComma = span.IndexOf((byte)',');
        if (firstComma == -1) return;

        // Parse timestamp
        ReadOnlySpan timestampBytes = span.Slice(0, firstComma);
        // (For demo, skip parsing—just count)

        // Find second comma
        int secondComma = span.Slice(firstComma + 1).IndexOf((byte)',');
        if (secondComma == -1) return;

        // Parse level
        ReadOnlySpan levelBytes = span.Slice(firstComma + 1, secondComma);
        if (Utf8Parser.TryParse(levelBytes, out int level, out _))
        {
            // Process level (e.g., filter, aggregate)
        }

        // Message is remainder
        ReadOnlySpan message = span.Slice(firstComma + secondComma + 2);
        // (Process message as needed)
    }
}

This ingestor reads lines via PipeReader, uses stackalloc for small buffers, and parses with Utf8Parser. Zero allocations per line in the common case (lines under 512 bytes).

Adding SIMD Delimiter Scanning

Replace PositionOf with a SIMD scanner for larger files where newline scanning dominates.

SIMD Newline Finder

using System.Numerics;

private bool TryReadLineSIMD(ref ReadOnlySequence buffer, out ReadOnlySequence line)
{
    if (buffer.IsSingleSegment)
    {
        ReadOnlySpan span = buffer.FirstSpan;
        int index = FindNewlineSIMD(span);

        if (index == -1)
        {
            line = default;
            return false;
        }

        line = buffer.Slice(0, index);
        buffer = buffer.Slice(buffer.GetPosition(index + 1));
        return true;
    }

    // Fallback for multi-segment buffers
    SequencePosition? position = buffer.PositionOf((byte)'\n');
    if (position == null)
    {
        line = default;
        return false;
    }

    line = buffer.Slice(0, position.Value);
    buffer = buffer.Slice(buffer.GetPosition(1, position.Value));
    return true;
}

private int FindNewlineSIMD(ReadOnlySpan span)
{
    byte newline = (byte)'\n';
    int vectorSize = Vector.Count;

    int i = 0;
    for (; i <= span.Length - vectorSize; i += vectorSize)
    {
        var vector = new Vector(span.Slice(i, vectorSize));
        var matches = Vector.Equals(vector, new Vector(newline));

        if (matches != Vector.Zero)
        {
            for (int j = 0; j < vectorSize; j++)
            {
                if (span[i + j] == newline) return i + j;
            }
        }
    }

    // Scalar tail
    for (; i < span.Length; i++)
    {
        if (span[i] == newline) return i;
    }

    return -1;
}

On AVX2 hardware, this scans 32 bytes per iteration instead of 1. For multi-megabyte logs, this cuts parsing time significantly.

Console Runner

Program.cs

using System.Diagnostics;

class Program
{
    static async Task Main(string[] args)
    {
        string filePath = args.Length > 0 ? args[0] : "sample.csv";

        var ingestor = new LogIngestor();
        var sw = Stopwatch.StartNew();

        await using var stream = File.OpenRead(filePath);
        await ingestor.IngestAsync(stream);

        sw.Stop();

        Console.WriteLine($"Processed {ingestor.LinesProcessed:N0} lines in {sw.ElapsedMilliseconds} ms");
        Console.WriteLine($"Throughput: {ingestor.LinesProcessed / sw.Elapsed.TotalSeconds:N0} lines/sec");
    }
}

Run this on a 1 GB log file and watch it process millions of lines with minimal allocations. Measure with dotnet-counters to verify GC pressure stays low.

Optional: ASP.NET Core Metrics Endpoint

WebHost.cs

using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;

public static class WebHost
{
    public static void AddMetricsEndpoint(WebApplication app, LogIngestor ingestor)
    {
        app.MapGet("/metrics", () =>
        {
            return Results.Ok(new
            {
                LinesProcessed = ingestor.LinesProcessed,
                Timestamp = DateTime.UtcNow
            });
        });
    }
}

// In Program.cs (web mode)
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

var ingestor = new LogIngestor();

// Start ingestor in background
_ = Task.Run(async () =>
{
    await using var stream = File.OpenRead("sample.csv");
    await ingestor.IngestAsync(stream);
});

WebHost.AddMetricsEndpoint(app, ingestor);

app.Run();

Now you can query GET /metrics to see live ingestion progress. This pattern extends to Prometheus exporters, health checks, or OpenTelemetry spans.

Complete Code
A full runnable project with sample data, tests, and benchmarks is available at github.com/dotnet-guide-com/tutorials under csharp-performance/log-ingestor.

Benchmarks & Diagnostics

Speculation about performance is worthless. Measure. BenchmarkDotNet is the standard for micro-benchmarks in .NET. It handles warmup, statistical analysis, and memory diagnostics automatically.

Setting Up BenchmarkDotNet

Install BenchmarkDotNet

dotnet add package BenchmarkDotNet

Benchmark Example

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class ParsingBenchmarks
{
    private readonly byte[] _csvLine = "2025-01-13,42,Sample message"u8.ToArray();

    [Benchmark(Baseline = true)]
    public void ParseWithString()
    {
        string line = Encoding.UTF8.GetString(_csvLine);
        string[] parts = line.Split(',');
        int level = int.Parse(parts[1]);
    }

    [Benchmark]
    public void ParseWithSpan()
    {
        ReadOnlySpan span = _csvLine;

        int firstComma = span.IndexOf((byte)',');
        int secondComma = span.Slice(firstComma + 1).IndexOf((byte)',');

        ReadOnlySpan levelBytes = span.Slice(firstComma + 1, secondComma);
        Utf8Parser.TryParse(levelBytes, out int level, out _);
    }
}

class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run();
    }
}

Run with dotnet run -c Release. BenchmarkDotNet outputs mean time, allocations, and statistical confidence intervals. Typical results: the Span version is 3–5x faster and allocates zero bytes.

Analyzing Results

Sample BenchmarkDotNet Output

| Method           | Mean     | Allocated |
|----------------- |---------:|----------:|
| ParseWithString  | 245.3 ns |     152 B |
| ParseWithSpan    |  48.7 ns |       0 B |

The Span version eliminates 152 bytes of allocations per invocation. Multiply by millions of log lines and you're saving gigabytes of GC pressure.

dotnet-counters for Live Metrics

Monitor GC Activity

# Install dotnet-counters
dotnet tool install -g dotnet-counters

# Monitor your app (replace PID)
dotnet-counters monitor --process-id 12345 System.Runtime

# Watch:
# - GC Heap Size
# - Allocation Rate
# - Gen 0/1/2 collection counts

Low allocation rate (under 10 MB/sec) and infrequent Gen 2 collections indicate healthy memory usage. If you see Gen 2 collections every second, you're allocating too much—revisit your buffer pooling.

dotnet-trace for Detailed Profiling

Capture Trace

dotnet tool install -g dotnet-trace

# Collect trace
dotnet-trace collect --process-id 12345 --profile cpu-sampling

# Analyze with PerfView (Windows) or speedscope.app (cross-platform)

This identifies hot methods. If TryReadLine consumes 80% of CPU time, that's your optimization target. Add SIMD scanning there, re-profile, and verify the improvement.

Profiling Workflow
1. Identify slow scenario (e.g., "ingesting 1M lines takes 10 seconds"). 2. Profile with dotnet-trace to find hot methods. 3. Benchmark isolated hot path with BenchmarkDotNet. 4. Optimize (Span, SIMD, pooling). 5. Verify improvement with benchmark and end-to-end test. 6. Ship.

Production Checklist

High-performance code in production requires more than fast benchmarks. Here's what to verify before deploying.

Allocation Hygiene

✅ All hot paths use Span<T> or Memory<T> instead of string operations
✅ ArrayPool buffers are always returned in finally blocks
✅ No accidental boxing (check with [MemoryDiagnoser])
✅ stackalloc sizes stay under 1 KB
✅ No large object heap (LOH) allocations in loops (arrays > 85 KB)

Safety Checks

✅ No Span<T> crossing async boundaries
✅ Unsafe code isolated to reviewed methods with comments justifying use
✅ MemoryMarshal.Cast used only on aligned, size-compatible types
✅ SIMD intrinsics guarded by IsSupported checks with scalar fallbacks

Exception Handling

Don't Throw in Hot Paths
Exceptions allocate and unwind the stack. Use TryParse patterns and return error codes instead of throwing in loops that run millions of times. Reserve exceptions for truly exceptional cases (network failures, file corruption).

Error Handling Pattern

// ❌ Throws on every parse failure
public int ParseLevel(ReadOnlySpan bytes)
{
    if (!Utf8Parser.TryParse(bytes, out int level, out _))
    {
        throw new FormatException("Invalid level");
    }
    return level;
}

// ✅ Returns error indicator
public bool TryParseLevel(ReadOnlySpan bytes, out int level)
{
    return Utf8Parser.TryParse(bytes, out level, out _);
}

Culture Invariance

Parsing user input respects CultureInfo.CurrentCulture, but log/protocol parsing should use CultureInfo.InvariantCulture for deterministic behavior across locales.

Invariant Parsing

// Use invariant culture for protocols
if (int.TryParse(span, NumberStyles.Integer, CultureInfo.InvariantCulture, out int value))
{
    // Process value
}

Monitoring and Observability

Metrics: Expose throughput (lines/sec), allocation rate, error count
Health checks: Validate ingestor is consuming data within SLA
Distributed tracing: Add OpenTelemetry spans for end-to-end latency
Alerting: Trigger on Gen 2 GC frequency spikes or memory leaks

Testing Strategy

Unit tests for parsing logic with various input sizes
Stress tests: ingest 1 GB files and verify memory stays bounded
Fuzz testing: random/malformed inputs to catch edge cases
Regression benchmarks: CI runs BenchmarkDotNet to catch perf regressions

CI Integration
Run BenchmarkDotNet in CI with [SimpleJob] for faster execution. Compare results against baseline. Fail the build if allocations increase or throughput drops by >10%. This catches accidental perf regressions before they ship.

FAQ & Next Steps

When should I use Span<T> vs Memory<T>?

Use Span<T> for synchronous operations within a single method scope—it's a ref struct that can't cross async boundaries. Use Memory<T> when you need to store references across await points or return them from methods, as it's a regular struct that works everywhere. Convert Memory<T> to Span<T> via .Span when you need to process the data.

Is SIMD worth the complexity for most applications?

Only if profiling shows hot loops processing arrays or buffers. Start with Vector<T> for portable code. If that's insufficient, use Vector128/Vector256 with IsSupported guards. Most apps won't need SIMD—focus on reducing allocations first with Span<T> and ArrayPool. SIMD delivers 2–8x speedups in tight numeric loops, but allocation elimination often yields bigger wins.

How do I avoid memory leaks with ArrayPool?

Always pair Rent with Return in a try-finally block. Never store references to pooled arrays beyond the scope where you called Return. Consider using IMemoryOwner<T> from MemoryPool which implements IDisposable for automatic cleanup with using statements.

Can I use System.IO.Pipelines with ASP.NET Core?

Yes. Pipelines power Kestrel internally. You can expose PipeReader from HttpRequest.BodyReader or wrap custom streams. Useful for streaming large uploads, protocol parsers, or real-time data processing endpoints without buffering entire payloads. See the ingestor example for a pattern you can adapt to web endpoints.

What's the largest safe size for stackalloc?

Keep stackalloc buffers under 1 KB (approximately 1024 bytes). Larger allocations risk stack overflow, especially in recursive or deeply nested calls. For dynamic sizes, use ArrayPool<T>.Shared.Rent with a threshold check to decide between stack and pool. The default stack size is 1 MB per thread, but leave headroom for other allocations.

Next Steps

You've built a production-ready log ingestor with zero-allocation parsing, SIMD scanning, and Pipelines. Here's where to go deeper:

Extend the ingestor: Add JSON parsing with System.Text.Json's Utf8JsonReader (also span-based)
Integrate OpenTelemetry: Add spans to track parsing latency and throughput
Distribute processing: Use System.Threading.Channels to parallelize line processing across cores
Add compression: Wrap streams with GZipStream or BrotliStream (both work with PipeReader)

Overview & Safety Model

Zero-Copy Slicing

Compiler Safety

Pooled Buffers

SIMD Acceleration

Safety Rules You Need to Know

Span & ReadOnlySpan Fundamentals

Creating Spans

Slicing Operations

Common Pitfalls

Practical Example: Tokenizing CSV

Memory<T> & Pooling with IMemoryOwner

Memory<T> Basics

ArrayPool for Zero-Allocation Buffer Reuse

IMemoryOwner for Automatic Cleanup

Choosing Between Approaches

stackalloc & ref struct Patterns

Basic stackalloc Usage

Dynamic Threshold Pattern

Custom ref struct Types

Parsing with Span<char> & Utf8Parser

Parsing Numbers from Spans

UTF-8 Parsing with Utf8Parser

Tokenizing with IndexOf and Slicing

MemoryMarshal & Unsafe: When and How

MemoryMarshal.Cast for Type Reinterpretation

When to Use Unsafe

Practical Example: Summing Integers

SIMD Quick Wins with Vectorization

Portable SIMD with Vector<T>

SIMD for Delimiter Scanning

Explicit Intrinsics for Control

When SIMD Isn't Worth It

System.IO.Pipelines Primer

Why Pipelines?

Basic PipeReader Example

PipeWriter for Producing Data

Backpressure in Action

Creating a Pipe

Build the Log/CSV Ingestor

Core Ingestor with Pipelines

Adding SIMD Delimiter Scanning

Console Runner

Optional: ASP.NET Core Metrics Endpoint

Benchmarks & Diagnostics

Setting Up BenchmarkDotNet

Analyzing Results

dotnet-counters for Live Metrics

dotnet-trace for Detailed Profiling

Production Checklist

Allocation Hygiene

Safety Checks

Exception Handling

Culture Invariance

Monitoring and Observability

Testing Strategy

FAQ & Next Steps

When should I use Span<T> vs Memory<T>?

Is SIMD worth the complexity for most applications?

How do I avoid memory leaks with ArrayPool?

Can I use System.IO.Pipelines with ASP.NET Core?

What's the largest safe size for stackalloc?

Next Steps

Related Tutorials

Official Documentation