Enter RPN: R47: Character Set

Motivation

The R47 character set is based on Unicode, but it is not fully compliant with any released standard, not even the original 1.0 from 1991, much less the present Version 17.0.0. There are a number of discrepancies worth knowing about as an end user, particularly if you intend to program your R47 and make use of its XPORTP feature to get an RTF file for viewing the code off-calculator.

Before we begin, you may wish to know about my rejig tool: its many features include solutions for the issues brought up here.

Unicode Discrepancies

The R47 makes heavy use of math symbols, superscripts, Greek letters, and so forth. In support of this, they ship a custom font that must be installed on the host system in order for one of those RTF program files to display as intended.

(If you’re using the simulator, you should have done this already as part of the build instructions, else the simulated calculator display, bezel, and key caps will show incorrect glyphs for the same basic reason.)

As of this writing, this custom font defines 705 glyphs in a 15-bit subset of the available 21-bit range.

Unfortunately, there are a number of discrepancies, in several classes:

Count	Description
1	diacritic mismatch (ķ rendered as k̂)
1	“ℐ” drawn as double-struck capital I
1	“∜” drawn as xth-root, changing its meaning
1	“⇀” drawn as a short-armed arrow
1	“⇄” drawn in classic HP “swap” style
1	“ẝ” misused as the f-shift indicator
1	“Ϳ” misused as x-under-root (Coptic Greek “yot”)
1	“Ȳ” drawn as y-under-root (visually similar if you squint, but semantically different)
10	superscript Arabic digits at Roman Ⅰ thru Ⅹ
27	unassigned spots taken over; e.g. “x̅” in the block reserved for Coptic Greek
57	reassignments; e.g. Δ/∇-looking glyphs overlaying ⇉/⇋; x-over-y glyph overlaying ⧰
114	similar meaning but different rendering; e.g. “Ⓩ” used as “^Z”
217	TOTAL DISCREPANCIES

Much of this is harmless, as with the loss of lame characters like the parenthesized numerals, which can be adequately rendered without: Unicode ⑻ ≈ ASCII (8). Another example is that Unicode’s Roman numeral “Ⅷ” renders nearly identically to the plain ASCII alternative “VIII” in many fonts. We should not mourn the loss of these characters.

Where we have a problem is when meanings change.

Take the C47 font’s xth-root glyph, which overwrites what Unicode set aside as U+221C, the fourth-root glyph. One may say, “The C47/R47 doesn’t have a fourth-root feature, so we can safely take this over,” but that still makes us ask, “Why doesn’t the custom font put this nonstandard character up in the PUA where it belongs?”

I have no problem with the decision to leave the fourth-root character undefined in the custom font. My complaint is that it was overwritten with a glyph having a different meaning. If you copy-paste this character out of an XPORTP file into a document using a different font, it will visually change from xth-root to 4th-root! Pasting it back will restore the meaning, but why allow the confusion? If Unicode doesn’t define what you need, it is better that the paste shows an undefined character in the font you’re using to clue you into the problem rather than hide it under another meaning.

Test Method

Those curious about the method used to come to the conclusions above may wish to study:

The program used to extract the defined code points from the C47 source code and produce the raw TSV file I examined for this study. The header comments go into further details on my method.
The TSV data file distilled from that process. You might wish to load that up into a spreadsheet and reapply the font changes suggested in that script’s header comment to make your own local evaluation, for instance.

A Plan to Improve This Situation

Reworking the R47 at this late date to drag it into line with Unicode would be a tremendous amount of work, to no visual benefit, but to considerable semantic benefit: copy-pasting text from an XPORTP output file into one using a different font would not cause it to change meaning.

The following multi-step evolution would improve matters:

Strip the 16th bit.

The C47/R47 code restricts itself to a 15-bit subset¹ of Unicode by forcibly setting the high bit on all uint16_t character values beyond the 7-bit ASCII subset. This flags non-ASCII characters so that they can be recognized by testing the high bit in their first byte when stored in big-endian fashion, which in turn lets it use a single byte for text where ASCII suffices, saving considerable RAM and flash space.

The main downside — from our immediate perspective — is that reassigning the meaning of that top bit cuts off half the UCS-2 character space. That not only rules out these characters as sources of solutions to the discrepancies above, it means the C47 font can’t shove its nonstandard characters into a PUA block since the first is way up at U+E000–U+F8FF, requiring that sixteenth bit.

Much the same benefit results from the more complicated UTF-8 encoding scheme,² which solves the entire problem by encoding all Unicode characters in a variable-length encoding scheme taking 1-4 bytes per. We can dream of a UTF-8 based R47, but it won’t happen any time soon.
Move all wholly custom characters to the PUA.

That’s what it’s for. No compatibility will be lost with this change, because there is no cross-font compatibility in this case regardless.

This includes the characters currently squatting on unassigned spots in Unicode. Future standards may provide new characters here, which we won’t be able to take advantage of under the current situation.
Consider moving the rest on a case-by-case basis.

While the bulk of the current discrepancies are harmless, the main reason not to move everything to the PUA once it is open for use is to reduce the upset from doing the move. The fewer things that change at once, the simpler debugging those changes will be.

(You may now wish to return to my R47 article index.)

License

^{^} For instance, the square root symbol at U+221A is referenced in the code as "\xA2\x1A" = 0xA21A = 0x8000 | 0x221A
^{^} If the C47 scheme is standardized anywhere, I'm not aware of it.