The primary reason that your published code fails (even with only ascii characters! ) is that re.split() will not split on a zero-width match r'\b matches zero characters.
The primary reason that your published code fails (even with only ascii characters! ) is that re.split() will not split on a zero-width match. R'\b' matches zero characters: >>> re.
Split(r'\b', 'foo-BAR_baz') 'foo-BAR_baz' >>> re. Split(r'\W+', 'foo-BAR_baz') 'foo', 'BAR_baz' >>> re. Split(r'\W_+', 'foo-BAR_baz') 'foo', 'BAR', 'baz' Also, you need flags=re.
UNICODE to ensure that Unicode definitions of \b and \W etc are used. And using str() where you did is at best unnecessary. So it wasn't really a Unicode problem per se at all.
However some answerers tried to address it as a Unicode problem, with varying degrees of success ... here's my take on the Unicode problem: The general solution to this kind of problem is to follow the standard bog-simple advice that applies to all text problems: Decode your input from bytestrings to unicode strings as early as possible. Do all processing in unicode. Encode your output unicode into byte strings as late as possible.So: byte_string.
Decode('utf8').isupper() is the way to go. Hacks like byte_string. Decode('ascii', 'ignore').isupper() are to be avoided; they can be all of (complicated, unneeded, failure-prone) -- see below.
Some code: # coding: ascii import unicodedata tests = ( (u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase (u'R\xc9SUM\xc9', True), # RESUME with accents (u'R\xe9sum\xe9', False), # Resume with accents (u'R\xe9SUM\xe9', False), # ReSUMe with accents ) for ucode, expected in tests: print print 'unicode', repr(ucode) for uc in ucode: print 'U+%04X %s' % (ord(uc), unicodedata. Name(uc)) u8 = ucode. Encode('utf8') print 'utf8', repr(u8) actual1 = u8. Decode('utf8').isupper() # the natural way of doing it actual2 = u8. Decode('ascii', 'ignore').isupper() # @jathanism print expected, actual1, actual2 Output from Python 2.7.1: unicode u'\u041c\u041e\u0421\u041a\u0412\u0410' U+041C CYRILLIC CAPITAL LETTER EM U+041E CYRILLIC CAPITAL LETTER O U+0421 CYRILLIC CAPITAL LETTER ES U+041A CYRILLIC CAPITAL LETTER KA U+0412 CYRILLIC CAPITAL LETTER VE U+0410 CYRILLIC CAPITAL LETTER A utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90' True True False unicode u'R\xc9SUM\xc9' U+0052 LATIN CAPITAL LETTER R U+00C9 LATIN CAPITAL LETTER E WITH ACUTE U+0053 LATIN CAPITAL LETTER S U+0055 LATIN CAPITAL LETTER U U+004D LATIN CAPITAL LETTER M U+00C9 LATIN CAPITAL LETTER E WITH ACUTE utf8 'R\xc3\x89SUM\xc3\x89' True True True unicode u'R\xe9sum\xe9' U+0052 LATIN CAPITAL LETTER R U+00E9 LATIN SMALL LETTER E WITH ACUTE U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+006D LATIN SMALL LETTER M U+00E9 LATIN SMALL LETTER E WITH ACUTE utf8 'R\xc3\xa9sum\xc3\xa9' False False False unicode u'R\xe9SUM\xe9' U+0052 LATIN CAPITAL LETTER R U+00E9 LATIN SMALL LETTER E WITH ACUTE U+0053 LATIN CAPITAL LETTER S U+0055 LATIN CAPITAL LETTER U U+004D LATIN CAPITAL LETTER M U+00E9 LATIN SMALL LETTER E WITH ACUTE utf8 'R\xc3\xa9SUM\xc3\xa9' False False True The only differences with Python 3.
X are syntactical -- the principle (do all processing in unicode) remains the same.
Thank you for schooling me. Ha! :) – jathanism Jun 19 at 17:04 This answer, so far, has proved the most helpful to my problem.
Thank you. – matchew Jun 20 at 16:01.
As one comment above illustrates, it is not true for every character that one of the checks islower() vs isupper() will always be true and the other false. Unified Han characters, for example, are considered "letters" but are not lowercase, not uppercase, and not titlecase. So your stated requirements, to treat upper- and lower-case text differently, should be clarified.
I will assume the distinction is between upper-case letters and all other characters. Perhaps this is splitting hairs, but you ARE talking about non-English text here. First, I do recommend using Unicode strings (the unicode() built-in) exclusively for the string processing portions of your code.
Discipline your mind to think of the "regular" strings as byte-strings, because that's exactly what they are. All string literals not written u"like this" are byte-strings. This line of code then: tokens = re.
Split(r'\b', line.strip()) for line in input if line! = '\n' would become: tokens = re. Split(u'\\b', unicode(line.strip(), 'UTF-8')) for line in input if line!
= '\n' You would also test tokensi.isupper() rather than str(tokensi).isupper(). Based on what you have posted, it seems likely that other portions of your code would need to be changed to work with character strings instead of byte-strings.
I won't be able to test this solution until I return to the office, but it seems like this may also be a viable solution. The solution I had posted worked. But this may function better.Thanks.
– matchew Jun 17 at 23:33 -1 for 2 reasons: (1) re. Split(r'\b', ...) doesn't work.(2) unicode(blahblah) relying on the default encoding being UTF-8 -- it's ascii on e.g. Windows boxes and in any case sysadmins can fiddle with site.Py or whatever to change it. – John Machin Jun 18 at 22:48 (1) it seems to return the input string unchanged, so pointless, but I'm not sure "doesn't work" is justified (2) added encoding argument to unicode() built-in in my answer – wberry Jun 20 at 13:52.
Simple solution. I think tokens = re. Split(r'\b', line.strip()) for line in input if line!
= '\n' #remove blank lines becomes tokens = line.strip() for line in input if line! = '\n' then I am able to go with no need for str() or unicode() As far as I can tell. If tokensi.isupper(): #do stuff The word token and the re.
Split on word boundaries is legacy of when I was messing with nltk earlier this week. But ultimately I am processing lines, not tokens/words. This may change.
But for now this seems to work. I will leave this question open for now, in the hope of alternative solutions and comments.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.