if have unicode this:
'\x00b\x007\x003\x007\x00-\x002\x00,\x001\x00p\x00w\x000\x000\x009\x00,\x00n\x00o\x00n\x00e\x00,\x00c\x00,\x005\x00,\x00j\x00,\x00j\x00,\x002\x009\x00,\x00g\x00a\x00r\x00y\x00,\x00 \x00w\x00i\x00l\x00l\x00i\x00a\x00m\x00s\x00,\x00 \x00p\x00a\x00r\x00e\x00n\x00t\x00i\x00,\x00 \x00f\x00i\x00n\x00n\x00e\x00y\x00 \x00&\x00 \x00l\x00e\x00w\x00i\x00s\x00,\x00u\x00s\x00,\x001\x00\r\x00'
and it's read in csv in string format, i'd convert human readable form. works when print it, can't seem figure out approach command make save variable in human readable form. best approach?
you don't have unicode. not yet. have series of bytes, , bytes use utf-16 encoding. need decode bytes first:
data.decode('utf-16-be')
printing works because console ignores of big-endian pair of each utf-16 codeunit.
your data missing byte order mark, had use utf-16-be
, or big endian variant of utf-16, on assumption cut data @ right byte. little endian if didn't.
as had remove last \x00
null byte make decode; pasted odd, rather number of bytes, cut 1 utf-16 code unit (each 2 bytes) in half:
>>> s = '\x00b\x007\x003\x007\x00-\x002\x00,\x001\x00p\x00w\x000\x000\x009\x00,\x00n\x00o\x00n\x00e\x00,\x00c\x00,\x005\x00,\x00j\x00,\x00j\x00,\x002\x009\x00,\x00g\x00a\x00r\x00y\x00,\x00 \x00w\x00i\x00l\x00l\x00i\x00a\x00m\x00s\x00,\x00 \x00p\x00a\x00r\x00e\x00n\x00t\x00i\x00,\x00 \x00f\x00i\x00n\x00n\x00e\x00y\x00 \x00&\x00 \x00l\x00e\x00w\x00i\x00s\x00,\x00u\x00s\x00,\x001\x00\r\x00' >>> s[:-1].decode('utf-16-be') u'b737-2,1pw009,none,c,5,j,j,29,gary, williams, parenti, finney & lewis,us,1\r'
the file read probably contains bom first 2 bytes. if so, tell whatever use read data use utf-16
codec, , it'll figure out right variant first bytes.
if using python 2 you'd want study examples section of csv
module code can re-code data in form suitable module; if include unicodereader
section you'd use this:
with open(yourdatafile) inputfile: reader = unicodereader(inputfile, encoding='utf-16') row in reader: # row list unicode strings
demo:
>>> stringio import stringio >>> import codecs >>> f = stringio(codecs.bom_utf16_be + s[:-1]) >>> r = unicodereader(f, encoding='utf-16') >>> next(r) [u'b737-2', u'1pw009', u'none', u'c', u'5', u'j', u'j', u'29', u'gary', u' williams', u' parenti', u' finney & lewis', u'us', u'1']
if using python 3, set encoding
parameter open()
function utf-16
, use csv
module as-is.
Comments
Post a Comment