python - How to Concat a DataFrame with String to a DataFrame with Unicode and normalize datatype -


i'm having issues when concatenating 2 dataframes different types of strings in python2. 1 has normal py2 strings, other unicode string. concatenation works, types inside numpy arrays internally remain same (by design i'm sure).

import pandas pd pandas import dataframe, multiindex datetime import datetime dt  df = dataframe(data={'data': ['a', 'bbb', 'cc']},                     index=multiindex.from_tuples([(dt(2016, 1, 1), 2),                                                   (dt(2016, 1, 1), 3),                                                   (dt(2016, 1, 2), 2)],                                                  names=['date', 'id']))  df2 = dataframe(data={'data': [u'aaaaaaa']},                      index=multiindex.from_tuples([(dt(2016, 1, 2), 4)],                                                   names=['date', 'id']))  df3 = pd.concat([df, df2]) 

output:

>>> df.data.values array(['a', 'bbb', 'cc'], dtype=object)  >>> df2.data.values array([u'aaaaaaa'], dtype=object)  >>> df3.data.values array(['a', 'bbb', 'cc', u'aaaaaaa'], dtype=object) 

as can see, array 'mixed', has strings , unicode. there way force typecast 1 or other? if not, there easy way check if 1 side unicode or not, , convert column str or unicode?

(i care because pd.lib.infer_dtype mark dtype of numpy array "mixed" , need marked either 'string' or 'unicode' differentiate other objects can stored in pandas/numpy arrays)

use applymap , encode

df3.applymap(lambda s: s.encode('utf8')) 

enter image description here

df3.applymap(lambda s: s.encode('utf8')).data.values  array(['a', 'bbb', 'cc', 'aaaaaaa'], dtype=object) 

Comments