Encoding and Decoding Data in .Net

To develop globalized software applications, one of the main design criteria would be to provide service irrespective of the language or the region in which it is used. By having a system which provides facility to support all the possible languages across the world, it will reduce the time and effort required for customization of the application to suit to the local regional settings. The technology of Encoding and Decoding is used to cater to this need for providing multi-lingual support for applications.

Encoding and Decoding Data

Encoding is usually needed in scenarios like operating with legacy systems, creating HTML pages, manually generating e-mail messages, etc. In all these cases, the string resources may have to be encoded so as to support for multiple languages. Though ASCII (American Standard Code for Information Interchange) proved to be one of the best encoding type, it has its own drawbacks. ASCII assigned characters to 7-bit bytes using the numbers, 0 to 127. These were mapped to the English alphabets along with some special characters (like -, ‘, etc.). This number range was insufficient to cater to the needs of characters necessary for non-English alphabets.

Also, there was a possibility of different languages mapping different characters to a same value leading to conflicts. To resolve this issue, ASCII introduced the usage of code pages which are defined to support groups of languages that share common writing system. Windows code pages contain 256 code points (values) and are zero based.

The usage of encoding can be illustrated with an example of the way how a web page is generated. Since the content of web pages need to be created based on the region where they are rendered, each of these web page are tagged with an encoding type which represents the encoding format that needs to be used for displaying data. This would be defined as meta tag of the HTML as below:

<meta http-equiv="Content-Type" content="text/html;charset-utf-8">

In the above sample, the Unicode Transformation Format, UTF-8 is being used for encoding.

Unicode is a big code page having thousands of characters that support most languages and scripts in the world. The conversion of Unicode characters to a sequence of bytes is called Encoding while the conversion to Unicode character from a sequence of bytes is called Decoding. For every character in a Unicode supported script, a code point (basically a unique number) is assigned. The way to encode to the code point is termed as Unicode Transformation Format (UTF). Some of the popular UTFs are as below:

UTF-8: Each code point represented as a sequence of one to four bytes
UTF-16: Each code point considered as a sequence of one to two 16-bit bytes
UTF-32: Each code point represented as 32 bit integer
UTF-7: Each code point represented as 7 bit integers. It is rarely used for cases like mail, newsgroup, etc since it is not robust

The .Net framework itself internally uses UTF-16 format to store and retrieve text data.

.Net classes for Encoding and Decoding

The .Net framework has implemented the code for encoding and decoding characters in the class, Encoding. For this, the System.Text namespace has to be included in the code. Following are the different Unicode encodings supported by .Net framework:

ASCII encoding: encodes Unicode character to 7-bit value and its code page being 20127. Hence, it can support character values from U+0000 to U+007F. The .Net class, ASCIIEncoding can be used to convert characters to and from ASCII encoding.

UTF-8 encoding: supports all Unicode character values and its code page being 65001. The .Net class, UTF8Encoding can be used to convert characters to and from UTF-8 encoding.

UTF-7 encoding: supports all Unicode character values and its code page being 65000. The .Net class, UTF7Encoding can be used to convert characters to and from UTF-7 encoding.

UTF-16 encoding: supports all Unicode character values and its code pages being 1200 and 1201. The .Net class, UnicodeEncoding can be used to convert characters to and from UTF-16 encoding.

UTF-32 encoding: supports all Unicode character values and its code pages being 65005 and 65006. The .Net class, UTF32Encoding can be used to convert characters to and from UTF-32 encoding.

Selection of encoding class is based on the encodings used in legacy applications with which the newly generated code is expected to work with. In case any option given to choose the encoding type, it is recommended to use UTF8Encoding or UnicodeEncoding class. In case of ASCII contents, UTF8Encoding is preferred over ASCIIEncoding since the latter provides error detection and hence better security.

Using the Encoding class

The two important methods of the .Net class, System.Text.Encoding which helps in encoding and decoding data are as below:

GetEncoding - returns an Encoding object for a specified encoding format.

GetBytes - converts a Unicode string to its byte representation in a specified encoding

GetEncodings - used to fetch the details of the code pages supported by the .Net framework. Details such as number, official name and friendly name of the code base are stored in EncodingInfo object which gets returned on calling this method.

While reading a file, the .Net framework automatically decodes the most common encoding types and hence, there is no need to specify the encoding type. If it is necessary to do so, the Encoding object can be passed as parameter to the overloaded constructor of the StreamReader class which is used for reading the file. Similarly, to specify an encoding type while writing to a file, it is necessary to pass the Encoding object as parameter to its overloaded constructor of the StreamWriter class, used for writing to the file. By default (without passing the encoding type object), the .Net framework uses UTF-16.

Note:

• While reading files with UTF-7 encoding type, the encoding type has to be specified for reading the file correctly

• Notepad application should not be used for reading UTF-7 and UTF-32 files.
Thus, while the .Net framework eases the way to support developing world-ready applications by providing rich classes, it is necessary to understand thoroughly about the concept of Unicode standards and carefully implement it in the application so as to avoid issues that can arise later.

“Amazon and the Amazon logo are trademarks of Amazon.com, Inc. or its affiliates.”

| Privacy Policy for www.dotnet-guide.com | Disclosure | Contact |