Handling Character Encoding and Data Transformation in .NET

Preventing Data Corruption

If you've ever opened a file and seen question marks or strange symbols instead of readable text, you've encountered encoding problems. A customer uploads their data with accented characters, but your system saves it incorrectly. When they download it back, names like "José" become "Jos?" and complaints start rolling in. The issue isn't your storage or network layer. You're using the wrong character encoding.

.NET's Encoding classes solve this by handling the conversion between strings and bytes correctly. When you save text to files, send it over networks, or store it in databases, you control exactly how characters map to bytes. Using the right encoding prevents corruption and ensures international users see their data correctly.

You'll learn how different encodings work, when to use UTF-8 versus UTF-16, how to transform data with Base64, and how to avoid common pitfalls that corrupt text. By the end, you'll handle text data confidently across file systems, APIs, and databases.

Character Encoding Fundamentals

Character encoding maps characters to byte sequences. ASCII uses one byte per character and supports only English characters. UTF-8 uses 1-4 bytes per character and supports all Unicode characters, making it ideal for international text. UTF-16 uses 2 or 4 bytes and is what .NET uses internally for string storage. Choosing the right encoding depends on your data and where it goes.

When you write a string to a file or network stream, you must convert it to bytes using an encoding. When you read bytes back, you must decode them with the same encoding. Mismatched encodings produce garbage characters or data loss. The Encoding class provides methods for these conversions.

Program.cs
using System;
using System.Text;

var text = "Hello, World! 你好世界 🌍";

// UTF-8 encoding (variable width, 1-4 bytes per char)
var utf8Bytes = Encoding.UTF8.GetBytes(text);
Console.WriteLine($"UTF-8: {utf8Bytes.Length} bytes");
Console.WriteLine($"Bytes: {BitConverter.ToString(utf8Bytes).Substring(0, 50)}...");

// UTF-16 encoding (.NET internal representation)
var utf16Bytes = Encoding.Unicode.GetBytes(text);
Console.WriteLine($"\nUTF-16: {utf16Bytes.Length} bytes");

// ASCII encoding (loses non-ASCII characters)
var asciiBytes = Encoding.ASCII.GetBytes(text);
var asciiText = Encoding.ASCII.GetString(asciiBytes);
Console.WriteLine($"\nASCII: {asciiBytes.Length} bytes");
Console.WriteLine($"ASCII result: {asciiText}");

// Decode bytes back to string
var decodedUtf8 = Encoding.UTF8.GetString(utf8Bytes);
var decodedUtf16 = Encoding.Unicode.GetString(utf16Bytes);

Console.WriteLine($"\nDecoded UTF-8: {decodedUtf8}");
Console.WriteLine($"Decoded UTF-16: {decodedUtf16}");
Console.WriteLine($"Match original: {decodedUtf8 == text}");
Output:
UTF-8: 33 bytes
Bytes: 48-65-6C-6C-6F-2C-20-57-6F-72-6C-64-21-20-E4-B...

UTF-16: 48 bytes

ASCII: 33 bytes
ASCII result: Hello, World! ??? ??

Decoded UTF-8: Hello, World! 你好世界 🌍
Decoded UTF-16: Hello, World! 你好世界 🌍
Match original: True

UTF-8 uses fewer bytes for ASCII characters but more for Asian characters. ASCII can't represent Chinese characters or emojis, so it replaces them with question marks. Always use UTF-8 for files and network data unless you have a specific reason to choose something else.

Reading and Writing Encoded Files

File I/O requires specifying encoding explicitly to avoid corruption. File.ReadAllText and File.WriteAllText default to UTF-8, but you can override this. StreamReader and StreamWriter give you more control over encoding, BOM handling, and buffer sizes.

The Byte Order Mark is a special character at the start of text files that indicates encoding. Some applications expect it, while others fail when they encounter it. You control BOM inclusion when creating UTF-8 encodings.

Program.cs
using System;
using System.IO;
using System.Text;

var content = "Data with special chars: café, naïve, résumé";
var filePath = "test.txt";

// Write with UTF-8 (with BOM)
var utf8WithBom = new UTF8Encoding(true);
File.WriteAllText(filePath, content, utf8WithBom);
Console.WriteLine("Written with UTF-8 (BOM)");

// Read back
var readContent = File.ReadAllText(filePath);
Console.WriteLine($"Read: {readContent}");
Console.WriteLine($"Matches: {readContent == content}");

// Write with UTF-8 (no BOM) - better for APIs
var utf8NoBom = new UTF8Encoding(false);
File.WriteAllText(filePath, content, utf8NoBom);
Console.WriteLine("\nWritten with UTF-8 (no BOM)");

// Using StreamWriter for more control
using (var writer = new StreamWriter(filePath, false, utf8NoBom))
{
    writer.WriteLine("Line 1: English text");
    writer.WriteLine("Line 2: Español");
    writer.WriteLine("Line 3: 日本語");
}

// Read line by line with encoding detection
using (var reader = new StreamReader(filePath, detectEncodingFromByteOrderMarks: true))
{
    Console.WriteLine("\nReading lines:");
    while (!reader.EndOfStream)
    {
        var line = reader.ReadLine();
        Console.WriteLine($"  {line}");
    }
    Console.WriteLine($"Detected encoding: {reader.CurrentEncoding}");
}

// Clean up
File.Delete(filePath);
Output:
Written with UTF-8 (BOM)
Read: Data with special chars: café, naïve, résumé
Matches: True

Written with UTF-8 (no BOM)

Reading lines:
  Line 1: English text
  Line 2: Español
  Line 3: 日本語
Detected encoding: UTF-8

StreamReader's encoding detection examines the first few bytes for BOM markers and adjusts automatically. This works for UTF-8, UTF-16, and UTF-32 files. For files without BOM, specify the encoding explicitly to avoid defaulting to UTF-8 when the file uses something different.

Base64 Encoding for Binary Data

Base64 converts binary data to text using only ASCII characters. This lets you embed binary data in JSON, XML, or URLs that expect text. It's not encryption or compression, just a representation change. Base64 increases size by about 33% because it uses six bits per character instead of eight.

Use Base64 when you need to transmit binary data through text-only channels. Images in HTML, attachments in JSON APIs, and cryptographic keys in configuration files commonly use Base64. Always decode on the receiving end to get the original bytes back.

Program.cs
using System;
using System.Text;

// Encode binary data to Base64
var binaryData = new byte[] { 0xFF, 0xD8, 0xFF, 0xE0, 0x00, 0x10 };
var base64String = Convert.ToBase64String(binaryData);

Console.WriteLine($"Original bytes: {BitConverter.ToString(binaryData)}");
Console.WriteLine($"Base64: {base64String}");
Console.WriteLine($"Length increase: {binaryData.Length} → {base64String.Length}");

// Decode back to binary
var decodedBytes = Convert.FromBase64String(base64String);
Console.WriteLine($"Decoded matches: {BitConverter.ToString(decodedBytes) ==
    BitConverter.ToString(binaryData)}");

// Practical example: embed image data in JSON
var imageBytes = new byte[] { 137, 80, 78, 71, 13, 10, 26, 10 }; // PNG header
var imageBase64 = Convert.ToBase64String(imageBytes);
var json = $$"""
{
  "filename": "logo.png",
  "contentType": "image/png",
  "data": "{{imageBase64}}"
}
""";

Console.WriteLine($"\nJSON with embedded image:\n{json}");

// Extract from JSON (simplified)
var extractedBase64 = imageBase64;
var extractedBytes = Convert.FromBase64String(extractedBase64);
Console.WriteLine($"\nExtracted {extractedBytes.Length} bytes");

// URL-safe Base64 (replace + and / for URLs)
var urlUnsafe = Convert.ToBase64String(new byte[] { 0xFB, 0xFF });
var urlSafe = urlUnsafe.Replace('+', '-').Replace('/', '_').TrimEnd('=');
Console.WriteLine($"\nStandard Base64: {urlUnsafe}");
Console.WriteLine($"URL-safe Base64: {urlSafe}");
Output:
Original bytes: FF-D8-FF-E0-00-10
Base64: /9j/4AAQ
Length increase: 6 → 8

Decoded matches: True

JSON with embedded image:
{
  "filename": "logo.png",
  "contentType": "image/png",
  "data": "iVBORw0KCg=="
}

Extracted 8 bytes

Standard Base64: +/8=
URL-safe Base64: -_8

The URL-safe variant replaces characters that have special meaning in URLs. This prevents encoding issues when passing Base64 data in query strings or path parameters. Some APIs require URL-safe Base64, while others accept standard Base64. Check your API documentation to know which format to use.

Security and Safety Considerations

Base64 is not encryption. Anyone can decode Base64 strings instantly without keys or passwords. Never use Base64 to protect sensitive data like passwords, API keys, or personal information. Use proper cryptographic functions from System.Security.Cryptography for security needs.

When processing user-provided Base64 data, validate the length before decoding. Malicious input can contain extremely long strings that consume excessive memory when decoded. Set reasonable limits based on your application's needs. A 100MB Base64 string decodes to about 75MB of binary data, which can cause OutOfMemoryException on constrained systems.

Encoding mismatches can expose sensitive data accidentally. If you encode data with UTF-8 but decode with ASCII, you might truncate or corrupt database values containing personally identifiable information. Always use the same encoding for writing and reading. Store encoding metadata alongside data when you can't guarantee consistent encoding across all systems.

Try It Yourself

Build a utility that converts between different encodings and Base64. This demonstrates how to handle real-world encoding scenarios.

Program.cs
using System;
using System.Text;

var converter = new EncodingConverter();

var original = "Test data: Café ☕ 中文";
Console.WriteLine($"Original: {original}\n");

// Convert to different encodings
var utf8 = converter.ConvertToBase64(original, Encoding.UTF8);
Console.WriteLine($"UTF-8 Base64: {utf8}");

var utf16 = converter.ConvertToBase64(original, Encoding.Unicode);
Console.WriteLine($"UTF-16 Base64: {utf16}\n");

// Convert back
var decoded = converter.ConvertFromBase64(utf8, Encoding.UTF8);
Console.WriteLine($"Decoded: {decoded}");
Console.WriteLine($"Matches: {decoded == original}\n");

// Show byte differences
converter.CompareEncodings(original);

class EncodingConverter
{
    public string ConvertToBase64(string text, Encoding encoding)
    {
        var bytes = encoding.GetBytes(text);
        return Convert.ToBase64String(bytes);
    }

    public string ConvertFromBase64(string base64, Encoding encoding)
    {
        var bytes = Convert.FromBase64String(base64);
        return encoding.GetString(bytes);
    }

    public void CompareEncodings(string text)
    {
        Console.WriteLine("Encoding comparison:");

        var encodings = new[]
        {
            ("UTF-8", Encoding.UTF8),
            ("UTF-16", Encoding.Unicode),
            ("UTF-32", Encoding.UTF32),
            ("ASCII", Encoding.ASCII)
        };

        foreach (var (name, encoding) in encodings)
        {
            var bytes = encoding.GetBytes(text);
            var decoded = encoding.GetString(bytes);
            var isLossless = decoded == text;

            Console.WriteLine($"  {name,-8}: {bytes.Length,3} bytes, " +
                $"Lossless: {isLossless}");
        }
    }
}
Output:
Original: Test data: Café ☕ 中文

UTF-8 Base64: VGVzdCBkYXRhOiBDYWbDqSDigJUg5Lit5paH
UTF-16 Base64: VABlAHMAdAAgAGQAYQB0AGEAOgAgAEMAYQBmAOkAIATKBcgTph4=

Decoded: Test data: Café ☕ 中文
Matches: True

Encoding comparison:
  UTF-8   :  26 bytes, Lossless: True
  UTF-16  :  42 bytes, Lossless: True
  UTF-32  :  84 bytes, Lossless: True
  ASCII   :  26 bytes, Lossless: False
Project.csproj
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>
</Project>

UTF-8 provides the best balance for international text. UTF-16 uses more bytes but matches .NET's internal representation. UTF-32 is wasteful for most text but simplifies character indexing. ASCII fails on non-English characters, showing False for lossless conversion.

Avoiding Common Mistakes

Assuming default encoding works everywhere is the most common mistake. File.ReadAllText defaults to UTF-8, but many Windows applications save files as Windows-1252 or UTF-16. Always specify encoding explicitly when you know what it should be. Use StreamReader with encoding detection as a fallback, but explicit specification is safer.

Mixing encodings creates corruption that's hard to debug. You write data with UTF-8 but read it as ASCII. Parts of the data look fine while special characters become garbage. This happens gradually as users add international names or special symbols. Test your encoding logic with non-ASCII characters like "café", "naïve", Chinese text, or emojis early in development.

Forgetting BOM requirements causes interoperability issues. Excel expects BOM in CSV files to detect UTF-8. Many Unix tools reject BOM as invalid data. Know your target system's requirements and configure encodings accordingly. When writing files for Excel, use UTF-8 with BOM. For web APIs and JSON, use UTF-8 without BOM.

Using the wrong encoding for the problem domain wastes space. UTF-32 for English text uses four bytes per character when UTF-8 uses one. This matters for large files or high-volume network traffic. UTF-8 is the right default for almost everything external to your application. Only use other encodings when you have specific compatibility requirements.

Frequently Asked Questions (FAQ)

Should I use UTF-8 or UTF-16 for my application?

Use UTF-8 for external data like files, network protocols, and APIs because it's compact and widely supported. .NET strings use UTF-16 internally, but UTF-8 saves space when storing or transmitting text. For in-memory string operations, let .NET handle the encoding automatically.

What happens if I decode with the wrong encoding?

You'll get corrupted text with replacement characters or garbled symbols. UTF-8 bytes decoded as ASCII produce mojibake. Always store or transmit encoding metadata with your data, or use standard formats like JSON and XML that specify encoding in headers.

Is Base64 encoding secure for passwords?

No, Base64 is encoding, not encryption. Anyone can decode Base64 instantly. Never use it for passwords or sensitive data. Use proper cryptographic hashing with salt for passwords, and encryption with keys for data that needs confidentiality. Base64 is for transport, not security.

How do I handle BOM (Byte Order Mark) in text files?

UTF8Encoding constructor accepts a parameter to include or exclude BOM. Use BOM when creating files for Windows applications that expect it. Skip BOM for web APIs and cross-platform data since many parsers don't expect it. StreamReader automatically detects and handles BOM when reading.

Can I convert between encodings without data loss?

Unicode encodings like UTF-8 and UTF-16 convert between each other without loss since they represent the same character set. Converting from Unicode to ASCII loses non-ASCII characters. Going from narrow encodings to Unicode is safe. Always convert through string objects rather than byte-to-byte manipulation.

Why does my text look corrupted after reading from a file?

File.ReadAllText uses UTF-8 by default. If your file uses a different encoding like Windows-1252 or UTF-16, specify it explicitly. Check the file's encoding in a hex editor or use StreamReader with encoding detection. Always match the read encoding to the write encoding used when creating the file.

Back to Articles