How to convert multi-byte UTF-8 character representation to one byte while retaining (non)alphanumeric property?

For each byte in the string: If it is an ASCII byte, just copy it. If it is a UTF-8 head byte, decode starting from that byte to wchar_t using mbrtowc, determine an ASCII character whose classification matches by comparing the results of the isw*() functions, and copy that ASCII character to the output. If it is anything else, skip it.

There's no way to do this in general, as letters outside the ASCII range (such as α) may be accented as well (ἄ). But you can apply the NFD Unicode normalization to decompose accented codepoints into their constituents, then check whether the components lie within the ASCII range. ICU has normalization support.

Unicode got total 1114111 (0x10FFFF) as highest code points, that means almost over a million characters. Single byte can represent 256 characters. So simple answer is you can't do it, that way.As far I understand from question, you want this for random access to characters in the string.

You use 32bit characters. (Correct me If I am wrong). Rather then handling it by writing your code use ICU, and using converter convert it into UTF-32 (4 byte character).

Ucnv_convertEx is the function to be used for this.

I don't want to convert to a fixed width encoding, and I don't mind replacing multi-byte UTF-8 character representations with ASCII characters that don't really correspond to them, as long as the replacement character's "alphanumeric" property corresponds to the replaced character's. So e.g. Ë could be replaced with a. – hasseg Mar 12 at 5:49 And this you want even for other languages?

. Eg Asian languages – Zimbabao Mar 12 at 5:58.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions