When you open the file with codec. Open('w', encoding='utf8') there is no point in writing byte arrays ( str objects) into the file. Instead, write unicode objects, like this.
When you open the file with codec. Open('w', encoding='utf8'), there is no point in writing byte arrays (str objects) into the file. Instead, write unicode objects, like this: corpusFile = codecs.
Open(filename, mode = 'w', encoding = 'utf-8') # ... tagged_token = '\xdcml\xe4ut' tagged_token = tagged_token. Decode('ISO-8859-1') corpusFile. Write(tagged_token) corpusFile.
Write(u'\n') This will write platform-dependent End-Of-Line characters. Alternatively, open a binary file and write byte arrays of already-encoded strings: corpusFile = open(filename, mode = 'wb') # ... tagged_token = '\xdcml\xe4ut' tagged_token = tagged_token. Decode('ISO-8859-1') corpusFile.
Write(tagged_token. Encode('utf-8')) corpusFile. Write('\n') This will write platform-independent EOLs.
If you want a platform-dependent EOL, print os. Sep instead of '\n'. Note that the encoding naming in Notepad++ is misleading: ANSI as UTF-8 is what you want.
No luck :( See my answer to your comment, in the question. – Metalcoder Nov 9 at 0:28 @Metalcoder Updated the answer with an explanation of why this code works ;) If you are certain the result is not UTF-8 (and if Notepad++ names it ANSI as UTF-8, it is UTF-8), can you post a hexdump of the file written by one of the two alternative executable programs in this answer? – phihag Nov 9 at 0:31 s/ISO-8859-1/cp1252/ – John Machin Nov 9 at 0:35 1 Or open("thefile", "rb").read().
Decode("utf8") – John Machin Nov 9 at 0:40 1 @Metalcoder You may want to read up on Unicode and character encodings. If you execute out print repr(open(filename, "rb"). Read(200)), what gets outputted when you use the first and second program in this answer?
– phihag Nov 97 at 23:08.
Try writing the file with a UTF-8 signature (aka BOM): def storeTaggedCorpus(corpus, filename): corpusFile = codecs. Open(filename, mode = 'w', encoding = 'utf-8-sig') for token in corpus: tagged_token = '/'. Join(str for str in token) # print(type(tagged_token)); break # tagged_token = tagged_token.
Decode('cp1252') corpusFile. Write(tagged_token) corpusFile. Write(u"\n") corpusFile.close() Note that this will only work properly if tagged_token is a unicode string.To check that, uncomment the first comment in the above code - it should print .
If tagged_token is not a unicode string, then you will need to decode it first using the second commented line. (NB: I've assumed a "cp1252" encoding, but if you're certain it's "iso-8859-1", then of course you will need to change it.).
Oh, man, it prints! I've tried switching to u'/' in the join thing, and it threw a UnicodeDecodeError. I didn't expected this, and I'm going to run some tests.
– Metalcoder Nov 10 at 22:28 1 @Metalcoder. Switching to u'/' won't work, because the rest of the string won't be decoded properly. To do that, remove the print statement and uncomment the second comment as shown above.
– ekhumoro Nov 10 at 23:05.
If you are seeing "mangled" characters from a file, you need to ensure that whatever you are using to view the file understands that the file is UTF-8-encoded. The files created by this code: import codecs for enc in "utf-8 utf-8-sig".split(): with codecs. Open(enc + ".
Txt", mode = 'w', encoding = enc) as corpusFile: tagged_token = '\xdcml\xe4ut' tagged_token = tagged_token. Decode('cp1252') # not 'ISO-8859-1' corpusFile. Write(tagged_token) # write unicode objects corpusFile.
Write(u'\n') are identified thusly: Notepad++ (version 5.7 (UNICODE)) : UTF-8 without BOM, UTF-8 Firefox (7.0.1): Western(ISO-8859-1), Unicode (UTF-8) Notepad (Windows 7): UTF-8, UTF-8 Putting a BOM in your UTF-8 file, while deprecated on Unix systems, gives you a much better chance on Windows that other software will be able to recognise your file as UTF-8-encoded.
I've tried sending a BOM before posting this question, and got the same problems. But the BOM should be a issue to find out the encoding of the file. I believe that it would have no effect while storing something into it.Am I wrong?
– Metalcoder Nov 10 at 22:28.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.