You shouldn't really be guessing at what your input is Define your code to accept either byte strings or Unicode character strings, and leave it to the caller to convert the input to the proper format (or provide some way for the caller to specify which kind of strings they're providing).
You shouldn't really be guessing at what your input is. Define your code to accept either byte strings or Unicode character strings, and leave it to the caller to convert the input to the proper format (or provide some way for the caller to specify which kind of strings they're providing). If you define your code to accept byte strings, then any characters above \xFF are an error.
If you define your code to accept Unicode character strings, then you can convert them to bytes with Encode::encode_utf8() (and should do so regardless of how they're internally represented by Perl). In any case, calling utf8::is_utf8() is usually a mistake — your program should not care about the internal representation of strings, only about the actual data (a sequence of characters) they contain. Whether some of those characters (in particular, those in the range \x80 to \xFF) are internally represented by one or two bytes should not matter.Ps.
Reading perldoc Encode may help to clarify issues with bytes and characters in Perl.
Perldoc -f length used to say, back in v5.8, ... you will get the number of characters, not the number of bytes. To get the length in bytes, use "do { use bytes; length(EXPR) }", see bytes. The modern docs for length don't mention bytes: length() normally deals in logical characters, not physical bytes.
For how many bytes a string encoded as UTF-8 would take up, use "length(Encode::encode_utf8(EXPR))" (you'll have to "use Encode" first). See Encode and perlunicode. But I don't think that deprecates the do { use bytes; ... } solution.
1 Yes, it does, actually. Bytes::length("\xE9") returns 1, but length encode("UTF-8", "\xE9") returns 2. – tchrist Oct 14 '11 at 19:47 2 @mob bytes.Pm has a big deprecation notice in the top of the POD in the current release and recommends dealing with encodings in a more explicit way instead.
– hobbs Oct 14 '11 at 19:47 5 Would you all excuse me? I have to go fix some code. – mob Oct 14 '11 at 20:03.
The sender: use Encode qw( encode_utf8 ); sub pack_text { my ($text) = @_; my $bytes = encode_utf8($text); die "Text too long" if length($bytes) > 4294967295; return pack('N/a*', $bytes); } The receiver: use Encode qw( decode_utf8 ); sub read_bytes { my ($fh, $to_read) = @_; my $buf = ''; while ($to_read > 0) { my $bytes_read = read($fh, $buf, $to_read, length($buf)); die $! If! Defined($bytes_read); die "Premature EOF" if!
$bytes_read; $to_read -= $bytes_read; } return $buf; } sub read_uint32 { my ($fh) = @_; return unpack('N', read_bytes($fh, 4)); } sub read_text { my ($fh) = @_; return decode_utf8(read_bytes($fh, read_uint32($fh))); }.
– ikegami Oct 14 '11 at 23:17 I upvoted your answer as it's sound, the only thing worth mention is that the loop around read() is superfluous since you are using buffered IO (PerlIO) which will only fail on error or EOF. – chansen Oct 27 '11 at 20:23 @chansen, I don't think so. Say you ask for 400 bytes, there's 200 bytes in the buffer, and there's an error or EOF is reached getting more.
I believe read will return the 200 bytes and say no error occurred. The next read will report the error or EOF. That is why I looped.
– ikegami Oct 27 '11 at 20:42.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.