Character encodings remain a challenge in many integration projects. Just had a customer asking: how can a 3 byte character (UTF-8) fit in a 2 byte (UTF-8) character?
Simple question I thought: (modern) programming languages and operating systems use 2 bytes to represent a single character. This gives room for 2^16 characters. Although not covering all characters, I was assuming that 2 bytes were sufficient.
Just learned that some characters in UTF-16 are encoded as 4 bytes (2 x 2 bytes). These are called surrogates. The first 2 bytes of such surrogate are in the range D800-DBFF, the last 2 bytes are in the range DC00-DFFF. As such, UTF-16 is developed to support little over a million characters.
For exchange of data in application integration scenario's, UTF-8 is recommended:
- No byte order, no need for a byte order marker
- No zero byte (making life easy for all those C-programmers
- ASCII represented unchanged
- Compact encoding of Western European characters
When diving into the Java doc of Char, the Char class is aware of surrogates (at least in 1.5). And I assume that some systems already use 4 bytes internally to represent characters, just to avoid the complexity of these surrogates (2 x 2 byte characters in UTF-16).
So another item for my totdo list: experiment a bit with conversion of text containing such surrogates from UTF-8 to UTF-16 and back. In particular in e.g. file adapters of integration solutions.
This comment has been removed by the author.
ReplyDeleteThanks Guy. It will help me when (again) strugling with character encoding issues. ;-)
ReplyDeleteCheers,
Emile Hermans