python unicode errors convert to printed values -


if have unicode this:

'\x00b\x007\x003\x007\x00-\x002\x00,\x001\x00p\x00w\x000\x000\x009\x00,\x00n\x00o\x00n\x00e\x00,\x00c\x00,\x005\x00,\x00j\x00,\x00j\x00,\x002\x009\x00,\x00g\x00a\x00r\x00y\x00,\x00 \x00w\x00i\x00l\x00l\x00i\x00a\x00m\x00s\x00,\x00 \x00p\x00a\x00r\x00e\x00n\x00t\x00i\x00,\x00 \x00f\x00i\x00n\x00n\x00e\x00y\x00 \x00&\x00 \x00l\x00e\x00w\x00i\x00s\x00,\x00u\x00s\x00,\x001\x00\r\x00' 

and it's read in csv in string format, i'd convert human readable form. works when print it, can't seem figure out approach command make save variable in human readable form. best approach?

you don't have unicode. not yet. have series of bytes, , bytes use utf-16 encoding. need decode bytes first:

data.decode('utf-16-be') 

printing works because console ignores of big-endian pair of each utf-16 codeunit.

your data missing byte order mark, had use utf-16-be, or big endian variant of utf-16, on assumption cut data @ right byte. little endian if didn't.

as had remove last \x00 null byte make decode; pasted odd, rather number of bytes, cut 1 utf-16 code unit (each 2 bytes) in half:

>>> s = '\x00b\x007\x003\x007\x00-\x002\x00,\x001\x00p\x00w\x000\x000\x009\x00,\x00n\x00o\x00n\x00e\x00,\x00c\x00,\x005\x00,\x00j\x00,\x00j\x00,\x002\x009\x00,\x00g\x00a\x00r\x00y\x00,\x00 \x00w\x00i\x00l\x00l\x00i\x00a\x00m\x00s\x00,\x00 \x00p\x00a\x00r\x00e\x00n\x00t\x00i\x00,\x00 \x00f\x00i\x00n\x00n\x00e\x00y\x00 \x00&\x00 \x00l\x00e\x00w\x00i\x00s\x00,\x00u\x00s\x00,\x001\x00\r\x00' >>> s[:-1].decode('utf-16-be') u'b737-2,1pw009,none,c,5,j,j,29,gary, williams, parenti, finney & lewis,us,1\r' 

the file read probably contains bom first 2 bytes. if so, tell whatever use read data use utf-16 codec, , it'll figure out right variant first bytes.

if using python 2 you'd want study examples section of csv module code can re-code data in form suitable module; if include unicodereader section you'd use this:

with open(yourdatafile) inputfile:     reader = unicodereader(inputfile, encoding='utf-16')     row in reader:         # row list unicode strings 

demo:

>>> stringio import stringio >>> import codecs >>> f = stringio(codecs.bom_utf16_be + s[:-1]) >>> r = unicodereader(f, encoding='utf-16') >>> next(r) [u'b737-2', u'1pw009', u'none', u'c', u'5', u'j', u'j', u'29', u'gary', u' williams', u' parenti', u' finney & lewis', u'us', u'1'] 

if using python 3, set encoding parameter open() function utf-16 , use csv module as-is.


Comments