Checking Unicode string for whitespace - byte for byte?

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.

For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset. Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.

You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be '\0' (unless you typically work in another language).

For checking whitespace, every second byte being null does not really matter, but I guess I'll just drop it. UTF-8 is the one most widely used anyway, and is much better than nothing. Thank you.

Answer accepted (you were first, so I guess that'd be most fair, though the other answers are good too). Good news. – Øystein Oct 30 '10 at 0:15 The null bytes issues causes other problems, due to the fact that many ASCII-based functions interpret the null byte as the end-of-string marker.

You won't avoid your particular problem, because sometimes the high byte will happen to be 0x20, which coincides with the space character. – Marcelo Cantos Oct 30 '10 at 0:32 Okay, so UTF-8 it is then :) – Øystein Oct 30 '10 at 0:35.

In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.

However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces.

In order to not break existing code (which used stuff like strtok and sscanf to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8. You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.

Yes, that will be the user's responsibility - not mine. – Øystein Oct 30 '10 at 0:12.

Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way. For UTF-8, any continuation bytes always start with the bits 10, making them greater than 0x7f, no there's no chance they could be mistaken for a ASCII space. You can see this in the following table: Range Encoding Binary value ----------------- -------- -------------------------- U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx 10zzyyyy 10yyyyxx 10xxxxxx You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.

See wikipedia UTF-8 for more detail. UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value.

If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a 0x20 byte that is not a space (e.g. , the second byte of a 16-bit UTF-16 value). For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range 0xd800 through 0xdfff, there's no danger that these surrogate pair components could be mistaken for spaces either.

See wikipedia UTF-16 for more detail. Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.

– Øystein Oct 30 '10 at 0:19 @oystein, yes, that's why I said you shouldn't process them byte-by-byte - clarified. – paxdiablo Oct 30 '10 at 0:22 Sorry, I don't have a choice, but thanks for clarifying. – Øystein Oct 30 '10 at 0:27 @oystein, no problems, the bottom line is that what you're proposing is safe for UTF-8 but not for the other two encodings.

But I'm not sure I understand your reluctance, most C compilers would have a native 16-bit and 32-bit data type that you could use, with very little sacrificed in speed. However, you know more about your requirements and constraints than I do, so I won't try to second-guess you. – paxdiablo Oct 30 '10 at 0:35 1 It's more about what's already written than what can be written.

As commented below, rewriting and retesting thousands of lines of code just to get UTF-16 & UTF-32 support is not a nice thought... Unicode support is just something I figured I'd slap on now, if I could get it to work without rewriting too much. – Øystein Oct 30 '10 at 0:41.

It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and . Net) support unicode natively and also provide a mechanism for determining these kind of things.

For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.

3 Bah, Java :) I'm afraid I'm in the dark oblivion of some pretty nasty low-level-C++ code here, so I'm kind of on my own. If I had other options I'd probably grab them - fast. – Øystein Oct 30 '10 at 0:26 then you should be using the i18n library icu-project.Org/apiref/icu4c – Pangea Oct 30 '10 at 0:35 1 'fraid not, rewriting and retesting thousands of lines of code to use a different library just to get UTF-16 & UTF-32 support is not really an option... It might be usable for other people seeing this question, though.

– Øystein Oct 30 '10 at 0:39 2 If Java truly supported Unicode natively, you'd think its char could (always) hold one. But it can't, so it's all just a terrible kludge. – tchrist Oct 30 '10 at 2:16.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions