MySQL++

Check-in [0bd33dc4fc]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Modified the "most economically valuable" stuff in the userman's Unicode chapter to handle the "except for emoji" case.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256:0bd33dc4fce31306a6af9c11415241c89ea1edef8ed09cac8bb47d0270a796c2
User & Date: tangent 2018-07-27 04:45:20
Context
2018-07-27
05:00
Squished Clang complaint in pedantic builds about beemutex's pmutex_ private member being unused when thread-awareness is not enabled. check-in: a014eece1d user: tangent tags: trunk
04:45
Modified the "most economically valuable" stuff in the userman's Unicode chapter to handle the "except for emoji" case. check-in: 0bd33dc4fc user: tangent tags: trunk
04:39
Polishing pass on the new Unicode material in the user manual. check-in: 8469cf623d user: tangent tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to doc/userman/unicode.dbx.

26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
...
139
140
141
142
143
144
145
146
147
148
149

150
151
152
153
154
155
156
157
158
159
    common 7-bit ASCII subset.  Either people used approximations like a
    plain “c” instead of the French “ç”,
    or they invented things like HTML entities
    (“ç” in this case) to encode these additional
    characters using only 7-bit ASCII.</para>

    <para>Unicode solves this problem. It encodes every character used
    for writing in the world, using up to 4 bytes per character.  The
    subset covering the most economically valuable cases takes two bytes
    per character, so many Unicode-aware programs only support this
    subset, storing characters as 2-byte values, rather than use 4-byte
    characters so as to cover all possible cases, however rare. This
    subset of Unicode is called the Basic Multilingual Plane, or
    BMP.</para>

    <para>Unfortunately, Unicode was invented about two decades too late
    for Unix and C. Those decades of legacy created an immense inertia
    preventing a widespread move away from 8-bit characters. MySQL and
    C++ come out of these older traditions, and so they share the same
    practical limitations. MySQL++ doesn&#x2019;t have any code in it
    for Unicode conversions, and it likely never will; it just passes
................................................................................
    in two versions. One version supports only 1-byte
    &#x201C;ANSI&#x201D; characters (a superset of ASCII), so they end
    in 'A'. Windows also supports the 2-byte subset of Unicode called
    <ulink
    url="http://en.wikipedia.org/wiki/UCS-2">UCS-2</ulink><footnote><para>Since
    Windows XP, Windows actually uses the <ulink
    url="http://en.wikipedia.org/wiki/UTF-16">UTF-16</ulink> encoding,
    not UCS-2.  This means that if you use characters beyond the 16-bit
    &ldquo;BMP&rdquo; range, they get encoded as 4-byte characters. But
    again, since the most economically valuable subset of Unicode is the
    BMP, many programs ignore this distinction and treat modern Windows

    as supporting 2-byte characters.</para></footnote>. Some call these
    &#x201C;wide&#x201D; characters, so the other set of functions end
    in 'W'. The <function><ulink
    url="http://msdn.microsoft.com/library/en-us/winui/winui/windowsuserinterface/windowing/dialogboxes/dialogboxreference/dialogboxfunctions/messagebox.asp">MessageBox</ulink>()</function>
    API, for instance, is actually a macro, not a real function. If you
    define the <symbol>UNICODE</symbol> macro when building your
    program, the <function>MessageBox()</function> macro evaluates to
    <function>MessageBoxW()</function>; otherwise, to
    <function>MessageBoxA()</function>.</para>








|
|
|
|
|
|
<







 







|
|
|
|
>
|
|
|







26
27
28
29
30
31
32
33
34
35
36
37
38

39
40
41
42
43
44
45
...
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
    common 7-bit ASCII subset.  Either people used approximations like a
    plain &#x201C;c&#x201D; instead of the French &#x201C;&ccedil;&#x201D;,
    or they invented things like HTML entities
    (&#x201C;&amp;ccedil;&#x201D; in this case) to encode these additional
    characters using only 7-bit ASCII.</para>

    <para>Unicode solves this problem. It encodes every character used
    for writing in the world, using up to 4 bytes per character.  Before
    emoji became popular, the subset covering the most economically
    valuable cases fit into the lower 65536 code points, so you could
    encode most texts using only two bytes per character.  Many
    nominally Unicode-aware programs only support this subset, called
    the Basic Multilingual Plane, or BMP.</para>


    <para>Unfortunately, Unicode was invented about two decades too late
    for Unix and C. Those decades of legacy created an immense inertia
    preventing a widespread move away from 8-bit characters. MySQL and
    C++ come out of these older traditions, and so they share the same
    practical limitations. MySQL++ doesn&#x2019;t have any code in it
    for Unicode conversions, and it likely never will; it just passes
................................................................................
    in two versions. One version supports only 1-byte
    &#x201C;ANSI&#x201D; characters (a superset of ASCII), so they end
    in 'A'. Windows also supports the 2-byte subset of Unicode called
    <ulink
    url="http://en.wikipedia.org/wiki/UCS-2">UCS-2</ulink><footnote><para>Since
    Windows XP, Windows actually uses the <ulink
    url="http://en.wikipedia.org/wiki/UTF-16">UTF-16</ulink> encoding,
    not UCS-2. This means that if you use characters beyond the 16-bit
    BMP range, they get encoded as 4-byte characters. But again, since
    the most economically valuable subset of Unicode is the BMP if you
    ignore emoji, many programs ignore this distinction and assume
    Unicode strings on Windows are always 2 bytes per
    character.</para></footnote>. Some call these &#x201C;wide&#x201D;
    characters, so the other set of functions end in 'W'. The
    <function><ulink
    url="http://msdn.microsoft.com/library/en-us/winui/winui/windowsuserinterface/windowing/dialogboxes/dialogboxreference/dialogboxfunctions/messagebox.asp">MessageBox</ulink>()</function>
    API, for instance, is actually a macro, not a real function. If you
    define the <symbol>UNICODE</symbol> macro when building your
    program, the <function>MessageBox()</function> macro evaluates to
    <function>MessageBoxW()</function>; otherwise, to
    <function>MessageBoxA()</function>.</para>