wen wen - 3 months ago 8
Python Question

Why I get random result from seemingly non-random code with python sklearn?

I updated the question based on the responses.



I have a list of strings named "str_tuple". I want to compute some similarity measures between the first element in the list and the rest of the elements. I run the following six-line code snippet.

What completely baffles me is that the outcome seems to be completely random every time I run the code. However, I cannot see any randomness introduced in my six-liner.

Update:



It is pointed out that TruncatedSVD() has a "random_state" argument. Specifying "random_state" will give fixed result (which is completely True). However, if you change the "random_state", the result will change. But with other strings (e.g. str2), the result is the same regardless how you change "random_state". In fact, these strings are from the HOME_DEPOT Kaggle competition. I have a pd.Series containing thousands of such strings, most of them give non-random results behaving like str2 (no matter what "random_state" is set). For some unknown reasons, str1 is one of the examples that give random results every time you change "random_state". I start to think maybe some intrinsic characters with str1 make the difference.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

# str1 yields random results
str1 = [u'l bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']
# str2 yields non-random result
str2 = [u'angl bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']

vectorizer = CountVectorizer(token_pattern=r"\d+\.\d+|\d+\/\d+|\b\w+\b")
# replacing str1 with str2 gives non-ramdom result regardless of random_state
cmat = vectorizer.fit_transform(str1).astype(float) # sparse matrix
cmat = TruncatedSVD(2).fit_transform(cmat) # dense numpy array
cmat = Normalizer().fit_transform(cmat) # dense numpy array
sim = np.dot(cmat, cmat.T)
sim[0,1:].tolist()

Answer

By default, Truncated SVD follows a randomized algorithm. So, you must specify the RandomState value to be set as numpy.random.seed value.

cmat = TruncatedSVD(n_components=2, random_state=42).fit_transform(cmat)

Docs

class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)


Inorder for it to produce non-random output, the starting element of the list must be present more than once. That is to say, if the starting elements of str1 are either angl, versatile or simpson, then it would give non random results. As, str2 has angl repeated atleast more than once at the start of the list, it doesn't return random output.

Hence, randomness is a measure of dissimilarity among the occurences of elements in a given list. And, in those cases specifying the RandomState would be useful to generate a unique output.
[credit to @wen for pointing this out]