<uchar.h>
Unicode utility functionsFunction | Description |
---|---|
c16rtomb() |
Convert a char16_t to a multibyte character |
c32rtomb() |
Convert a char32_t to a multibyte character |
mbrtoc16() |
Convert a multibyte character to a char16_t |
mbrtoc32() |
Convert a multibyte character to a char32_t |
These functions are restartable, meaning multiple threads can safely call them at once. They handle this by having their own conversion state variable (of type mbstate_t
) per call.
This header file defines four types.
Type | Description |
---|---|
char16_t |
Type to hold 16-bit characters |
char32_t |
Type to hold 32-bit characters |
mbstate_t |
Holds the conversion state for restartable funcitons (also defined in <wchar.h> ) |
size_t |
To hold various counts (also defined in <stddef.h> ) |
String literals for the character types are u
for char16_t
and U
for char32_t
.
*str1 = u"Hello, world!";
char16_t *str2 = U"Hello, world!";
char32_t
*chr1 = u'A';
char16_t *chr2 = U'B'; char32_t
Note that char16_t
and char32_t
might contain Unicode. Or not. If __STDC_UTF_16__
or __STDC_UTF_32__
is defined as 1
, then char16_t
and char32_t
use Unicode, respectively. Otherwise they don’t and the actual value stored depend on the locale. And if you’re not using Unicode, you have my commiserations.
This header file doesn’t exist on OS X—bummer. If you just want the types, you can:
#include <stdint.h>
typedef int_least16_t char16_t;
typedef int_least32_t char32_t;
But if you also want the functions, that’s all on you.
mbrtoc16()
mbrtoc32()
Convert a multibyte character to a char16_t
or char32_t
restartably
#include <uchar.h>
size_t mbrtoc16(char16_t * restrict pc16, const char * restrict s, size_t n,
* restrict ps);
mbstate_t
size_t mbrtoc32(char32_t * restrict pc32, const char * restrict s, size_t n,
* restrict ps); mbstate_t
Given a source string s
and a destination buffer pc16
(or pc32
for mbrtoc32()
), convert the first character of the source to char16_t
s (or char32_t
s for mbrtoc32()
).
Basically you have a regular character and you want it as char16_t
or char32_t
. Use these functions to do it. Note that only one character is converted no matter how many characters in s
.
As the functions scan s
, you don’t want them to overrun the end. So you pass in n
as the maximum number of bytes to inspect. The functions will quit after that many bytes or when they have a complete multibyte character, whichever comes first.
Since they’re restartable, pass in a conversion state variable for the functions to do their work.
And the result will be placed in pc16
(or pc32
for mbrtoc32()
).
When successful this function returns a number between 1
and n
inclusive representing the number of bytes that made up the multibyte character.
Or, also in the success category, they can return 0
if the source character is the NUL character (value 0
).
When not entirely successful, they can return a variety of codes. These are all of type size_t
, but negative values cast to that type.
Return Value | Description |
---|---|
(size_t)(-1) |
Encoding error—this isn’t a valid sequence of bytes. errno is set to EILSEQ . |
(size_t)(-2) |
n bytes were examined and were a partial valid character, but not a complete one. |
(size_t)(-3) |
A subsequent value of a character that can’t be represented as a single value. See below. |
Case (size_t)(-3)
is an odd one. Basically there are some characters that can’t be represented with 16 bits and so can’t be stored in a char16_t
. These characters are store in something called (in the Unicode world) surrogate pairs. That is, there are two 16-bit values back to back that represent a larger Unicode value.
For example, if you want to read the Unicode character \U0001fbc5
(which is a stick figure76—I’m just not putting it in the text because my font doesn’t render it) that’s more than 16 bits. But each call to mbrtoc16()
only returns a single char16_t
!
So subsequent calls to mbrtoc16()
resolves the next value in the surrogate pair and returns (size_t)(-3)
to let you know this has happened.
You can also pass NULL
for pc16
or pc32
. This will cause no result to be stored, but you can use it if you’re only interested in the return value from the functions.
Finally, if you pass NULL
for s
, the call is equivalent to:
(NULL, "", 1, ps) mbrtoc16
Since the character is a NUL in that case, this has the effect of setting the state in ps
to the initial conversion state.
Normal use case example where we get the first two character values from the multibyte string "€Zillion"
:
#include <uchar.h>
#include <stdio.h> // for printf()
#include <locale.h> // for setlocale()
#include <string.h> // for memset()
int main(void)
{
char *s = "\u20acZillion"; // 20ac is "€"
;
char16_t pc16size_t r;
;
mbstate_t mbs
(LC_ALL, "");
setlocale(&mbs, 0, sizeof mbs);
memset
// Examine the next 8 bytes to see if there's a character in there
= mbrtoc16(&pc16, s, 8, &mbs);
r
("%zu\n", r); // Prints a value >= 1 (3 in UTF-8 locale)
printf("%#x\n", pc16); // Prints 0x20ac for "€"
printf
+= r; // Move to next character
s
// Examine the next 8 bytes to see if there's a character in there
= mbrtoc16(&pc16, s, 8, &mbs);
r
("%zu\n", r); // Prints 1
printf("%#x\n", pc16); // Prints 0x5a for "Z"
printf}
Example with a surrogate pair. In this case we read plenty to get the entire character, but the result must be stored in two char16_t
s, requiring two calls to get them both.
#include <uchar.h>
#include <stdio.h> // for printf()
#include <string.h> // for memset()
#include <locale.h> // for setlocale()
int main(void)
{
char *s = "\U0001fbc5*"; // Stick figure glyph, more than 16 bits
char16_t pc16;
mbstate_t mbs;
size_t r;
setlocale(LC_ALL, "");
memset(&mbs, 0, sizeof mbs);
r = mbrtoc16(&pc16, s, 8, &mbs);
printf("%zd\n", r); // r is 4 bytes in UTF-8 locale
printf("%#x\n", pc16); // First value of surrogate pair
s += r; // Move to next character
r = mbrtoc16(&pc16, s, 8, &mbs);
printf("%zd\n", r); // r is (size_t)(-3) here to indicate...
printf("%#x\n", pc16); // ...Second value of surrogate pair
// Since r is -3, it means we're still processing the same
// character, so DON'T move to the next character this time
//s += r; // Commented out
r = mbrtoc16(&pc16, s, 8, &mbs);
printf("%zd\n", r); // 1 byte for "*"
printf("%#x\n", pc16); // 0x2a for "*"
}
Output on my system, indicating the first character is represented by the pair (0xd83e, 0xdfc5)
and the second character is represented by 0x2a
:
4
0xd83e
-3
0xdfc5
1 0x2a
c16rtomb()
c32rtomb()
Convert a char16_t
or char32_t
to a multibyte character restartably
#include <uchar.h>
size_t c16rtomb(char * restrict s, char16_t c16, mbstate_t * restrict ps);
size_t c32rtomb(char * restrict s, char32_t c32, mbstate_t * restrict ps);
If you have a character in a char16_t
or char32_t
, use these functions to convert them into a multibyte character.
These functions figure out how many bytes are needed for the multibyte character in the current locale and stores them in the buffer pointed to by s
.
But how big to make that buffer? Luckily there is a macro to help: it needs be no larger than MB_CUR_MAX
.
As a special case, if s
is NULL
, it’s the same as calling
(buf, L'\0', ps); // or...
c16rtomb(buf, L'\0', ps); c32rtomb
where buf
is a buffer maintained by the system that you don’t have access to.
This has the effect of setting the ps
state to the initial state.
Finally for surrogate pairs (where the character has been split into two char16_t
s), you call this once with the first of the pair—at this point, the function will return 0
. Then you call it again with the second of the pair, and the function will return the number of bytes and store the result in the array s
.
Returns the number of bytes stored in the array pointed to by s
.
Returns 0 if processing is not yet complete for the current character, as in the case of surrogate pairs.
If there is an encoding error, the functions return (size_t)(-1)
and errno
is set to EILSEQ
.
#include <uchar.h>
#include <stdlib.h> // for MB_CUR_MAX
#include <stdio.h> // for printf()
#include <string.h> // for memset()
#include <locale.h> // for setlocale()
int main(void)
{
char16_t c16 = 0x20ac; // Unicode for Euro symbol
char dest[MB_CUR_MAX];
size_t r;
mbstate_t mbs;
setlocale(LC_ALL, "");
memset(&mbs, 0, sizeof mbs); // Reset conversion state
// Convert
r = c16rtomb(dest, c16, &mbs);
printf("r == %zd\n", r); // r == 3 on my system
// And this should print a Euro symbol
printf("dest == \"%s\"\n", dest);
}
Output on my system:
r == 3 dest == "€"
This is a more complex example that converts a large-valued character in a multibyte string into a surrogate pair (as in the mbrtoc16()
example, above) and then converts it back again into a multibyte string to print.
#include <uchar.h>
#include <stdlib.h> // for MB_CUR_MAX
#include <stdio.h> // for printf()
#include <string.h> // for memset()
#include <locale.h> // for setlocale()
int main(void)
{
char *src = "\U0001fbc5*"; // Stick figure glyph, more than 16 bits
char dest[MB_CUR_MAX];
char16_t surrogate0, surrogate1;
mbstate_t mbs;
size_t r;
setlocale(LC_ALL, "");
memset(&mbs, 0, sizeof mbs); // Reset conversion state
// Get first surrogate character
r = mbrtoc16(&surrogate0, src, 8, &mbs);
// Get next surrogate character
src += r; // Move to next character
r = mbrtoc16(&surrogate1, src, 8, &mbs);
printf("Surrogate pair: %#x, %#x\n", surrogate0, surrogate1);
// Now reverse it
memset(&mbs, 0, sizeof mbs); // Reset conversion state
// Process first surrogate character
r = c16rtomb(dest, surrogate0, &mbs);
// r should be 0 at this point, because the character hasn't been
// processed yet. And dest won't have anything useful... yet!
printf("r == %zd\n", r); // r == 0
// Process second surrogate character
r = c16rtomb(dest, surrogate1, &mbs);
// Now we should be in business. r should have the number of
// bytes, and dest should hold the character.
printf("r == %zd\n", r); // r == 4 on my system
// And this should print a stick figure, if your font supports it
printf("dest == \"%s\"\n", dest);
}