Dhruv Ghulati Dhruv Ghulati - 3 months ago 9
Python Question

Cannot get unique IDs for a string python2.7

I am trying to make unique ID from a list of words. I want these numbers to be globally unique. For example, if another list appears, I want the unique ID to be the same e.g. for "density", the ID might be

151111911
, and this will be the same if "density" occurs in a different list.

As you can see, my current method is not working using
id
and
intern
- the ID for
rrb
is exactly the same as
lrb
.

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

featureVector = mydefaultdict(mydouble)

for featureID,featureVal in enumerate(featureList):
print "featureID is",featureID
print "featureVal is ",featureVal
print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))
featureVector[featureID] = featureVal


featureID is 0
featureVal is guinea
Encoded feature value is 4569583120.0
featureID is 1
featureVal is bissau
Encoded feature value is 4569581632.0
featureID is 2
featureVal is compared
Encoded feature value is 4569583120.0
featureID is 3
featureVal is countriesthe
Encoded feature value is 4567944360.0
featureID is 4
featureVal is population
Encoded feature value is 4347153072.0
featureID is 5
featureVal is density
Encoded feature value is 4455561472.0
featureID is 6
featureVal is guinea
Encoded feature value is 4569581632.0
featureID is 7
featureVal is bissau
Encoded feature value is 4569583120.0
featureID is 8
featureVal is similar
Encoded feature value is 4496118144.0
featureID is 9
featureVal is iran
Encoded feature value is 4569583120.0
featureID is 10
featureVal is afghanistan
Encoded feature value is 4569581632.0
featureID is 11
featureVal is cameroon
Encoded feature value is 4569583120.0
featureID is 12
featureVal is panama
Encoded feature value is 4569581632.0
featureID is 13
featureVal is montenegro
Encoded feature value is 4569583120.0
featureID is 14
featureVal is guinea
Encoded feature value is 4569581632.0
featureID is 15
featureVal is belarus
Encoded feature value is 4569583120.0
featureID is 16
featureVal is palau
Encoded feature value is 4569581632.0
featureID is 17
featureVal is location_slot
Encoded feature value is 4567944360.0
featureID is 18
featureVal is south
Encoded feature value is 4569583120.0
featureID is 19
featureVal is africa
Encoded feature value is 4569581632.0
featureID is 20
featureVal is respective
Encoded feature value is 4569583120.0
featureID is 21
featureVal is population
Encoded feature value is 4347153072.0
featureID is 22
featureVal is density
Encoded feature value is 4455561472.0
featureID is 23
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 24
featureVal is capita
Encoded feature value is 4569581632.0
featureID is 25
featureVal is per
Encoded feature value is 4455914152.0
featureID is 26
featureVal is square
Encoded feature value is 4347127296.0
featureID is 27
featureVal is kilometer
Encoded feature value is 4569581632.0
featureID is 28
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 29
featureVal is global
Encoded feature value is 4346597072.0
featureID is 30
featureVal is rank
Encoded feature value is 4346629984.0
featureID is 31
featureVal is number_slot
Encoded feature value is 4569583120.0
featureID is 32
featureVal is years
Encoded feature value is 4569581632.0
featureID is 33
featureVal is growthguinea
Encoded feature value is 4567944360.0
featureID is 34
featureVal is bissau
Encoded feature value is 4569583120.0
featureID is 35
featureVal is population
Encoded feature value is 4347153072.0
featureID is 36
featureVal is density
Encoded feature value is 4455561472.0
featureID is 37
featureVal is positive
Encoded feature value is 4514096160.0
featureID is 38
featureVal is growth
Encoded feature value is 4569583120.0
featureID is 39
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 40
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 41
featureVal is last
Encoded feature value is 4346568112.0
featureID is 42
featureVal is years
Encoded feature value is 4569583120.0
featureID is 43
featureVal is lrb
Encoded feature value is 4537993216.0
featureID is 44
featureVal is rrb
Encoded feature value is 4537993216.0
featureID is 45
featureVal is LOCATION_SLOT~-appos+LOCATION~-prep_of
Encoded feature value is 4538026784.0
featureID is 46
featureVal is LOCATION~-prep_of+that~-prep_to
Encoded feature value is 6043251168.0
featureID is 47
featureVal is that~-prep_to+similar~prep_with
Encoded feature value is 6043251168.0
featureID is 48
featureVal is similar~prep_with+density~prep_of
Encoded feature value is 6043251168.0
featureID is 49
featureVal is density~prep_of+NUMBER~appos
Encoded feature value is 6043251168.0
featureID is 50
featureVal is NUMBER~appos+NUMBER~amod
Encoded feature value is 6043247024.0
featureID is 51
featureVal is NUMBER~amod+NUMBER_SLOT
Encoded feature value is 6043247024.0


What am I doing wrong? The reason I need to convert these into floats or numbers is that the above sentence would go into a classifier that needs to use numerical/vectorized features.

Answer

From the docs

Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.

At the time the next string is interned the previous strings may be deleted, and the new one may occasionally get the same id. So keep the references in a container. I'll use a dict:

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

# dict of id:featureVal pairs 
seen = {}

for featureID,featureVal in enumerate(featureList):
    print "featureID is",featureID
    print "featureVal is ",featureVal
    interned = intern(str(featureVal.encode("utf-8")))
    interned_id = id(interned)

    # ensure that no other string with the same id has been seen
    assert interned_id not in seen or seen[interned_id] == featureVal

    # change this to seen[interned_id] = None and you'll (probably) get AssertionError
    # from the line above
    seen[interned_id] = interned

    print "Encoded feature value is", interned_id