Advertisement

Update String class to use C++20 char8_t* instead of char*

Started by June 08, 2023 08:52 PM
3 comments, last by kwikc 1 year, 6 months ago

Hi,
C++20 introduced char8_t to replace char in an UTF8 environment.
Should a String class, storing UTF8, now stores char8_t instead of char and replace all “text” or ‘c’ by u8"text" or u8"c" in the entire codebase?
It's important to note char* to char8_t* requires to copy value by value to avoid undefined behavior, which is an operation to avoid for performance.
Thanks!

It is not a change that can be made lightly. What are you doing with your strings? Can you assume, and check, that char is unsigned and 8 bits and thus the same as char8_t? If it isn't, what strings need to be converted in either direction?

Omae Wa Mou Shindeiru

Advertisement

I wrote that because of the author of this C++ change:
https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char

It's important to note char* to char8_t* requires to copy value by value to avoid undefined behavior,

If a char *p or char8_t *p8 contains only ASCII characters, it's safe to use (char8_t*)p or (char*)p8 because UTF8 is compatible with ASCII.

Should a String class, storing UTF8, now stores char8_t instead of char and replace all “text” or ‘c’ by u8"text" or u8"c" in the entire codebase?

If the String class uses char*, I think it's not necessary. For new UTF8 strings like u8"text", use reinterpret_cast<char*> to store, and use reinterpret_cast<char8_t*> after fetching out. reinterpret_cast can keep the underlying bytes intact.

See here for UTF8 encoding details, https://en.wikipedia.org/wiki/UTF-8.

None

This topic is closed to new replies.

Advertisement