Best way to decode unknown unicoding encoding in Python 2.5?

There are two general purpose libraries for detecting unknown encodings: chardet, part of Universal Feed Parser UnicodeDammit, part of Beautiful Soup chardet is supposed to be a port of the way that firefox does it You can use the following regex to detect utf8 from byte strings: import re utf8_detector = re. Compile(r"""^(?: \x09\x0A\x0D\x20-\x7E # ASCII | \xC2-\xDF\x80-\xBF # non-overlong 2-byte | \xE0\xA0-\xBF\x80-\xBF # excluding overlongs | \xE1-\xEC\xEE\xEF\x80-\xBF{2} # straight 3-byte | \xED\x80-\x9F\x80-\xBF # excluding surrogates | \xF0\x90-\xBF\x80-\xBF{2} # planes 1-3 | \xF1-\xF3\x80-\xBF{3} # planes 4-15 | \xF4\x80-\x8F\x80-\xBF{2} # plane 16 )*$""", re. X) In practice if you're dealing with English I've found the following works 99.9% of the time: if it passes the above regex, it's ascii or utf8 if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252 if it contains 0xa4, assume it's latin-15 otherwise assume it's latin-1.

I coded this up and put it here pastebin. Com/f76609aec – user132262 Nov 12 '09 at 12:50 There's a problem the code you pasted: ^(?:\xA4)*$ will match if the string is entirely \xA4 and no other characters. You just need re.

Compile(r'\xA4') and re. Compile(r'\x80-\xBF') for the two other regular expressions. – ʞɔıu Nov 12 '09 at 13:35.

Indeed. Ascii is a subset of utf-8 and will also correctly decode as utf-8, so you can leave ascii out. 8-bit encodings such as latin-1 will decode to something in all cases, so put one of these last.

– Thomas Nov 11 '09 at 15:22.

Since you are using Python, you might try UnicodeDammit. It is part of Beautiful Soup that you also may find useful. Like the name suggests, UnicodeDammit will try to do whatever it takes to get proper unicode out of the crap you may find in the world.

Tried that early on, but it failed quite a bit. – user132262 Nov 11 '09 at 16:20 Really! What were the problems?

It may be easier to get that working than to roll your own. – Adam Goode Nov 11 '09 at 17:28.

I've tackled the same problem and found that there's no way to determine a content's encoding type without metadata about the content. That's why I ended up with the same approach you're trying here. My only additional advice to what you've done is, rather than ordering the list of possible encoding in most-likely order, you should order it by specificity.

I've found that certain character sets are subsets of others, and so if you check utf_8 as your second choice, you'll miss ever finding the subsets of utf_8 (I think one of the Korean character sets uses the same number space as utf).

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions