Sunday, May 18, 2008

3 byte characters in 2 bytes?

Character encodings remain a challenge in many integration projects. Just had a customer asking: how can a 3 byte character (UTF-8) fit in a 2 byte (UTF-8) character?

Simple question I thought: (modern) programming languages and operating systems use 2 bytes to represent a single character. This gives room for 2^16 characters. Although not covering all characters, I was assuming that 2 bytes were sufficient.

Just learned that some characters in UTF-16 are encoded as 4 bytes (2 x 2 bytes). These are called surrogates. The first 2 bytes of such surrogate are in the range D800-DBFF, the last 2 bytes are in the range DC00-DFFF. As such, UTF-16 is developed to support little over a million characters.

For exchange of data in application integration scenario's, UTF-8 is recommended:
- No byte order, no need for a byte order marker
- No zero byte (making life easy for all those C-programmers
- ASCII represented unchanged
- Compact encoding of Western European characters

When diving into the Java doc of Char, the Char class is aware of surrogates (at least in 1.5). And I assume that some systems already use 4 bytes internally to represent characters, just to avoid the complexity of these surrogates (2 x 2 byte characters in UTF-16).

So another item for my totdo list: experiment a bit with conversion of text containing such surrogates from UTF-8 to UTF-16 and back. In particular in e.g. file adapters of integration solutions.

2 comments:

Emile said...
This comment has been removed by the author.
Emile said...

Thanks Guy. It will help me when (again) strugling with character encoding issues. ;-)

Cheers,

Emile Hermans