6. Using Unicode with MySQL++

6.1. A Short History of Unicode

...with a focus on relevance to MySQL++

In the old days, computer operating systems only dealt with 8-bit character sets. That only allows for 256 possible characters, but the modern Western languages have more characters combined than that alone. Add in all the other languages of the world plus the various symbols people use in writing, and you have a real mess!

Since no standards body held sway over things like international character encoding in the early days of computing, many different character sets were invented. These character sets weren’t even standardized between operating systems, so heaven help you if you needed to move localized Greek text on a DOS box to a Russian Macintosh! The only way we got any international communication done at all was to build standards on top of the common 7-bit ASCII subset. Either people used approximations like a plain “c” instead of the French “ç”, or they invented things like HTML entities (“ç” in this case) to encode these additional characters using only 7-bit ASCII.

Unicode solves this problem. It encodes every character used for writing in the world, using up to 4 bytes per character. Before emoji became popular, the subset covering the most economically valuable cases fit into the lower 65536 code points, so you could encode most texts using only two bytes per character. Many nominally Unicode-aware programs only support this subset, called the Basic Multilingual Plane, or BMP.

Unfortunately, Unicode was invented about two decades too late for Unix and C. Those decades of legacy created an immense inertia preventing a widespread move away from 8-bit characters. MySQL and C++ come out of these older traditions, and so they share the same practical limitations. MySQL++ doesn’t have any code in it for Unicode conversions, and it likely never will; it just passes data along unchanged from the underlying MySQL C API, so you still need to be aware of these underlying issues.

During the development of the Plan 9 operating system (a kind of successor to Unix) Ken Thompson invented the UTF-8 encoding. UTF-8 is a superset of 7-bit ASCII and is compatible with C strings, since it doesn’t use 0 bytes anywhere as multi-byte Unicode encodings do. As a result, many programs that deal in text will cope with UTF-8 data even though they have no explicit support for UTF-8. Follow the last link above to see how the design of UTF-8 allows this.

6.2. Unicode in MySQL

Since MySQL comes out of the Unix world, and it predates the widespread use of UTF-8 in Unix, the early versinos of MySQL had no explicit support for Unicode. From the start, you could store raw UTF-8 strings, but it wouldn’t know how to do things like sort a column of UTF-8 strings.

MySQL 4.1 added the first explicit support for Unicode. This version of MySQL supported only the BMP, meaning that if you told it to expect strings to be in UTF-8, it could only use up to 3 bytes per character.

MySQL 5.5 was the first release to completely support Unicode. Because the BMP-only Unicode support had been in the wild for about 6 years by that point, and changing to the new character set requires a table rebuild, the new one was called “utf8mb4” rather than change the longstanding meaning of “utf8” in MySQL. This release also added a new alias for the old UTF-8 subset character set, “utf8mb3.”

Finally, in MySQL 8.0, “utf8mb4” became the default character set. For backwards compatibility, “utf8” remains an alias for “utf8mb3.”

As of MySQL++ 3.2.4, we’ve defined the MYSQLPP_UTF8_CS and MYSQLPP_UTF8_COL macros which expand to “utf8mb4” and “utf8mb4_general_ci” when you build MySQL++ against MySQL 5.5 and newer and to “utf8” and “utf8_general_ci” otherwise. We use these macros in our resetdb example; you're welcome to use them in your code as well.

6.3. Unicode on Unixy Systems

Linux and Unix have system-wide UTF-8 support these days. If your operating system is of 2001 or newer vintage, it probably has such support.

On such a system, the terminal I/O code understands UTF-8 encoded data, so your program doesn’t require any special code to correctly display a UTF-8 string. If you aren’t sure whether your system supports UTF-8 natively, just run the simple1 example: if the first item has two high-ASCII characters in place of the “ü” in “Nürnberger Brats”, you know it’s not handling UTF-8.

If your Unix doesn’t support UTF-8 natively, it likely doesn’t support any form of Unicode at all, for the historical reasons I gave above. Therefore, you will have to convert the UTF-8 data to the local 8-bit character set. The standard Unix function iconv() can help here. If your system doesn’t have the iconv() facility, there is a free implementation available from the GNU Project. Another library you might check out is IBM’s ICU. This is rather heavy-weight, so if you just need basic conversions, iconv() should suffice.

6.4. Unicode on Windows

Each Windows API function that takes a string actually comes in two versions. One version supports only 1-byte “ANSI” characters (a superset of ASCII), so they end in 'A'. Windows also supports the 2-byte subset of Unicode called UCS-2[17]. Some call these “wide” characters, so the other set of functions end in 'W'. The MessageBox() API, for instance, is actually a macro, not a real function. If you define the UNICODE macro when building your program, the MessageBox() macro evaluates to MessageBoxW(); otherwise, to MessageBoxA().

Since MySQL uses the UTF-8 Unicode encoding and Windows uses UCS-2, you must convert data when passing text between MySQL++ and the Windows API. Since there’s no point in trying for portability — no other OS I’m aware of uses UCS-2 — you might as well use platform-specific functions to do this translation. Since version 2.2.2, MySQL++ ships with two Visual C++ specific examples showing how to do this in a GUI program. (In earlier versions of MySQL++, we did Unicode conversion in the console mode programs, but this was unrealistic.)

How you handle Unicode data depends on whether you’re using the native Windows API, or the newer .NET API. First, the native case:

// Convert a C string in UTF-8 format to UCS-2 format.
void ToUCS2(LPTSTR pcOut, int nOutLen, const char* kpcIn)
{
  MultiByteToWideChar(CP_UTF8, 0, kpcIn, -1, pcOut, nOutLen);
}

// Convert a UCS-2 string to C string in UTF-8 format.
void ToUTF8(char* pcOut, int nOutLen, LPCWSTR kpcIn)
{
  WideCharToMultiByte(CP_UTF8, 0, kpcIn, -1, pcOut, nOutLen, 0, 0);
}

These functions leave out some important error checking, so see examples/vstudio/mfc/mfc_dlg.cpp for the complete version.

If you’re building a .NET application (such as, perhaps, because you’re using Windows Forms), it’s better to use the .NET libraries for this:

// Convert a C string in UTF-8 format to a .NET String in UCS-2 format.
String^ ToUCS2(const char* utf8)
{
  return gcnew String(utf8, 0, strlen(utf8), System::Text::Encoding::UTF8);
}

// Convert a .NET String in UCS-2 format to a C string in UTF-8 format.
System::Void ToUTF8(char* pcOut, int nOutLen, String^ sIn)
{
  array<Byte>^ bytes = System::Text::Encoding::UTF8->GetBytes(sIn);
  nOutLen = Math::Min(nOutLen - 1, bytes->Length);
  System::Runtime::InteropServices::Marshal::Copy(bytes, 0,
    IntPtr(pcOut), nOutLen);
  pcOut[nOutLen] = '\0';
}

Unlike the native API versions, these examples are complete, since the .NET platform handles a lot of things behind the scenes for us. We don’t need any error-checking code for such simple routines.

All of this assumes you’re using Windows NT or one of its direct descendants: Windows 2000, Windows XP, Windows Vista, Windows 7, or any “Server” variant of Windows. Windows 95 and its descendants (98, ME, and CE) do not support Unicode. They still have the 'W' APIs for compatibility, but they just smash the data down to 8-bit and call the 'A' version for you.

6.5. For More Information

The Unicode FAQs page has copious information on this complex topic.

When it comes to Unix and UTF-8 specific items, the UTF-8 and Unicode FAQ for Unix/Linux is a quicker way to find basic information.



[17] Since Windows XP, Windows actually uses the UTF-16 encoding, not UCS-2. This means that if you use characters beyond the 16-bit BMP range, they get encoded as 4-byte characters. But again, since the most economically valuable subset of Unicode is the BMP if you ignore emoji, many programs ignore this distinction and assume Unicode strings on Windows are always 2 bytes per character.