Sunday, May 18, 2008

3 byte characters in 2 bytes?

Character encodings remain a challenge in many integration projects. Just had a customer asking: how can a 3 byte character (UTF-8) fit in a 2 byte (UTF-8) character?

Simple question I thought: (modern) programming languages and operating systems use 2 bytes to represent a single character. This gives room for 2^16 characters. Although not covering all characters, I was assuming that 2 bytes were sufficient.

Just learned that some characters in UTF-16 are encoded as 4 bytes (2 x 2 bytes). These are called surrogates. The first 2 bytes of such surrogate are in the range D800-DBFF, the last 2 bytes are in the range DC00-DFFF. As such, UTF-16 is developed to support little over a million characters.

For exchange of data in application integration scenario's, UTF-8 is recommended:
- No byte order, no need for a byte order marker
- No zero byte (making life easy for all those C-programmers
- ASCII represented unchanged
- Compact encoding of Western European characters

When diving into the Java doc of Char, the Char class is aware of surrogates (at least in 1.5). And I assume that some systems already use 4 bytes internally to represent characters, just to avoid the complexity of these surrogates (2 x 2 byte characters in UTF-16).

So another item for my totdo list: experiment a bit with conversion of text containing such surrogates from UTF-8 to UTF-16 and back. In particular in e.g. file adapters of integration solutions.


Emile said...
This comment has been removed by the author.
Emile said...

Thanks Guy. It will help me when (again) strugling with character encoding issues. ;-)


Emile Hermans