The first thing I'd try is to convert the text to NFKD normalization form with the Normalize on strings method. This suggestion is mentioned in the answer to the question you linked, but I recommend using NFKD instead of NFD because NFKD will remove unwanted typographical distinctions (e.g. , NBSP? Space, or?
C) You might also be able to make generic replacements by Unicode category For example, Pd's can be replaced by Nd's can be replaced by the corresponding 0 9 digit, and Mn's can be replaced with the empty string (to remove accents) but somebody might have written a suitable lookup table I can re-use You could try using the data from the Unidecode program, or CLDR Edit : There's a huge substitution chart here.
The first thing I'd try is to convert the text to NFKD normalization form, with the Normalize on strings method. This suggestion is mentioned in the answer to the question you linked, but I recommend using NFKD instead of NFD because NFKD will remove unwanted typographical distinctions (e.g. , NBSP? Space, or?
C). You might also be able to make generic replacements by Unicode category. For example, Pd's can be replaced by -, Nd's can be replaced by the corresponding 0-9 digit, and Mn's can be replaced with the empty string (to remove accents).
But somebody might have written a suitable lookup table I can re-use. You could try using the data from the Unidecode program, or CLDR. Edit: There's a huge substitution chart here.
Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks"? In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application.To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test: ///Determine whether the supplied character is ///using ASCII encoding.
Bool IsAscii(char inputChar) { var ascii = new ASCIIEncoding(); var asciiChar = (char)(ascii. GetBytes(inputChar.ToString())0); return(asciiChar == inputChar); } I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.
Public static class StringExtensions { private static readonly Dictionary Replacements = new Dictionary(); /// Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent. Yes, this is lossy. /// A string.
/// The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents. /// This method is lossy.It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients. Public static string Asciify(this string s) { return (String.
Join(String. Empty, s. Select(c => Asciify(c)).ToArray())); } private static string Asciify(char x) { return Replacements.
ContainsKey(x)?(Replacementsx) : (x.ToString()); } static StringExtensions() { Replacements'’' = "'"; // 75151 occurrences Replacements'–' = "-"; // 23018 occurrences Replacements'‘' = "'"; // 9783 occurrences Replacements'â€?' = "\""; // 6938 occurrences Replacements'“' = "\""; // 6165 occurrences Replacements'…' = "..."; // 5547 occurrences Replacements'£' = "GBP"; // 3993 occurrences Replacements'•' = "*"; // 2371 occurrences Replacements' ' = " "; // 1529 occurrences Replacements'é' = "e"; // 878 occurrences Replacements'ï' = "i"; // 328 occurrences Replacements'´' = "'"; // 226 occurrences Replacements'—' = "-"; // 133 occurrences Replacements'·' = "*"; // 132 occurrences Replacements'„' = "\""; // 102 occurrences Replacements'€' = "EUR"; // 95 occurrences Replacements'®' = "(R)"; // 91 occurrences Replacements'¹' = "(1)"; // 80 occurrences Replacements'«' = "\""; // 79 occurrences Replacements'è' = "e"; // 79 occurrences Replacements'á' = "a"; // 55 occurrences Replacements'â„¢' = "TM"; // 54 occurrences Replacements'»' = "\""; // 52 occurrences Replacements'ç' = "c"; // 52 occurrences Replacements'½' = "1/2"; // 48 occurrences Replacements'Â' = "-"; // 39 occurrences Replacements'°' = " degrees "; // 33 occurrences Replacements'ä' = "a"; // 33 occurrences Replacements'É' = "E"; // 31 occurrences Replacements'‚' = ","; // 31 occurrences Replacements'ü' = "u"; // 30 occurrences Replacements'Ã' = "i"; // 28 occurrences Replacements'ë' = "e"; // 26 occurrences Replacements'ö' = "o"; // 19 occurrences Replacements'à ' = "a"; // 19 occurrences Replacements'¬' = " "; // 17 occurrences Replacements'ó' = "o"; // 15 occurrences Replacements'â' = "a"; // 13 occurrences Replacements'ñ' = "n"; // 13 occurrences Replacements'ô' = "o"; // 10 occurrences Replacements'¨' = ""; // 10 occurrences Replacements'Ã¥' = "a"; // 8 occurrences Replacements'ã' = "a"; // 8 occurrences Replacements'ˆ' = ""; // 8 occurrences Replacements'©' = "(c)"; // 6 occurrences Replacements'Ä' = "A"; // 6 occurrences Replacements'Ã?' = "I"; // 5 occurrences Replacements'ò' = "o"; // 5 occurrences Replacements'ê' = "e"; // 5 occurrences Replacements'î' = "i"; // 5 occurrences Replacements'Ãœ' = "U"; // 5 occurrences Replacements'Ã?' = "A"; // 5 occurrences Replacements'ß' = "ss"; // 4 occurrences Replacements'¾' = "3/4"; // 4 occurrences Replacements'È' = "E"; // 4 occurrences Replacements'¼' = "1/4"; // 3 occurrences Replacements'†' = "+"; // 3 occurrences Replacements'³' = "'"; // 3 occurrences Replacements'²' = "'"; // 3 occurrences Replacements'Ø' = "O"; // 2 occurrences Replacements'¸' = ","; // 2 occurrences Replacements'Ë' = "E"; // 2 occurrences Replacements'ú' = "u"; // 2 occurrences Replacements'Ö' = "O"; // 2 occurrences Replacements'û' = "u"; // 2 occurrences Replacements'Ú' = "U"; // 2 occurrences Replacements'Å’' = "Oe"; // 2 occurrences Replacements'º' = "?"; // 1 occurrences Replacements'‰' = "0/00"; // 1 occurrences Replacements'Ã…' = "A"; // 1 occurrences Replacements'ø' = "o"; // 1 occurrences Replacements'Ëœ' = "~"; // 1 occurrences Replacements'æ' = "ae"; // 1 occurrences Replacements'ù' = "u"; // 1 occurrences Replacements'‹' = ".
You should never try to convert Unicode to ASCII because you will end-up having more problems than solving. It's like trying to fit 1,114,112 codepoints (Unicode 6.0) into just 128 characters. Do you think you will succeed?
BTW, There are lots of quotes in Unicode, not only those mentioned by you and also if you will want to do the conversion anyway remember that the conversions will be dependent on the locale. Check ICU - that contains the most complete Unicode conversion routines.
2 Our market is quite geographically and culturally specific, and our customers aren't using Unicode on purpose. We get 10-15 calls a week about "gibberish in e-mails", compared to no complaints - ever - about not being able to send e-mails in Arabic or Hebrew. So yes, it's a stupid problem, but it's real :) – Dylan Beattie May 28 at 18:47 @Dylan there is no bad question :) BTW, can you specify more about why you cannot use Unicode?
I would really like to see why and if there are any workarounds. BTW, I updated my answer to include one place to check. – sorin May 28 at 19:03 Isn't ICU for C++ only?(i.e.
Not for . NET). – Uwe Keim May 28 at 19:08 No there is no such thing for .
NET as declared blogs.msdn. Com/b/michkap/archive/2008/12/18/9234330. Aspx - but if you write a server application you could add C components and call them.
– sorin May 28 at 19:26.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.