Hi,
C++20 introduced char8_t to replace char in an UTF8 environment.
Should a String class, storing UTF8, now stores char8_t instead of char and replace all “text” or ‘c’ by u8"text" or u8"c" in the entire codebase?
It's important to note char* to char8_t* requires to copy value by value to avoid undefined behavior, which is an operation to avoid for performance.
Thanks!
Update String class to use C++20 char8_t* instead of char*
It is not a change that can be made lightly. What are you doing with your strings? Can you assume, and check, that char is unsigned and 8 bits and thus the same as char8_t? If it isn't, what strings need to be converted in either direction?
Omae Wa Mou Shindeiru
I wrote that because of the author of this C++ change:
https://stackoverflow.com/questions/57402464/is-c20-char8-t-the-same-as-our-old-char
It's important to note char* to char8_t* requires to copy value by value to avoid undefined behavior,
If a char *p or char8_t *p8 contains only ASCII characters, it's safe to use (char8_t*)p or (char*)p8 because UTF8 is compatible with ASCII.
Should a String class, storing UTF8, now stores char8_t instead of char and replace all “text” or ‘c’ by u8"text" or u8"c" in the entire codebase?
If the String class uses char*, I think it's not necessary. For new UTF8 strings like u8"text", use reinterpret_cast<char*> to store, and use reinterpret_cast<char8_t*> after fetching out. reinterpret_cast can keep the underlying bytes intact.
See here for UTF8 encoding details, https://en.wikipedia.org/wiki/UTF-8.
None