Russell Russell - 3 months ago 28
Python Question

Unicode in the standard TensorFlow format

Following the documentation here, I am trying to create features from unicode strings. Here is what the feature creation method looks like,

def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


This will raise an exception,

File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 512, in init
copy.extend(field_value)
File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/containers.py", line 275, in extend
new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
File "/home/rklopfer/.virtualenvs/tf/local/lib/python2.7/site-packages/google/protobuf/internal/type_checkers.py", line 108, in CheckValue
raise TypeError(message)
TypeError: u'Gross' has type <type 'unicode'>, but expected one of: (<type 'str'>,)


Naturally if I wrap the
value
in a
str
, it fails on the first actual unicode character it encounters.

Answer

BytesList definition is in feature.proto and it is of type repeated bytes, this means that you need to pass it something that's convertible to a list of byte sequences.

There's more than one way to turn unicode into list of bytes, hence ambiguity. You could do it manually instead. IE, to use UTF-8 encoding

value.encode("utf-8")
Comments