boltthrower boltthrower - 2 months ago 25
Python Question

How to represent or shape data with >700 features for data mining?

I have a train data file that contains 0 or 1 class labels with a string containing numbers. The string is the molecular structure of the drug and class label indicates whether the drug is active.

The file looks like this: http://paste.ee/p/0G6fP




0 87 149 433 704 711 892 988 1056 1070 1234 1246 1289 1642 1669 1861 1924 1956 2081 2150 2909 3038 3070 3082 3589 3708 3709 3713 4011 4266 4404 4489 4534 4674 4688 5114 5133 5190 5253 5815 6114 6645 6750 6767 6862 6880 6960 6986 7028 7080 7112 7262 7426 7492 7494 7522 7614 8100 8258 8581 8631 8799 8824 8872 8958 9011 9146 9197 9202 9247 9249 9300 9324 9353 9391 9392 9669 10234 10314 10323 10341 10455 10471 10764 10811 10871 10938 10973 11210 11277 11317 11331 11470 11581 11588 11670 11820 12199 12250 12274 12372 12425 12471 12504 12505 12540 12575 12764 12801 13424 13457 13561 13587 13650 13700 13832 13873 13916 13974 14044 14203 14246 14386 14454 14676 14942 14952 15372 15555 15570 15938 16176 16233 16268 16274 16419 16765 16820 17236 17260 17287 17307 17319 17324 17369 17674 17714 17749 18091 18154 18327 18630 18957 19072 19395 19943 19962 20179 20355 20728 20807 20850 20958 21068 21424 21890 22029 22165 22314 22316 22548 22620 22764 22820 23018 23197 23326 23671 23707 24003 24178 24205 24258 24324 24347 24401 24405 24569 24820 24939 25172 25352 25541 25783 25952 26022 26376 26523 26764 26971 27111 27296 27330 27345 27414 27471 27491 27900 27961 27982 28070 28110 28115 28187 28250 28304 28366 28467 29026 29067 29100 29159 29169 29409 29483 29592 29601 30091 30201 30275 30315 30570 31499 31620 31713 31763 31779 32053 32072 32098 32167 32186 32199 32209 32287 32360 32378 32472 32531 32623 32648 32687 32783 32925 33298 33367 33406 33451 33767 33789 33814 33879 33930 34020 34173 34355 34633 34805 34830 35082 35615 35705 35975 36258 36295 36435 36605 36732 36931 37155 37242 37263 37347 37420 37431 37496 37589 37627 37824 38249 38385 38481 38551 38715 38752 38915 39157 39281 39426 39466 39474 39488 39854 39920 39974 40094 40169 40264 40530 40635 41160 41227 41237 41376 41807 41828 41989 42426 42497 42692 42790 43046 43078 43159 43387 43427 43437 43531 43550 43579 43712 43745 43754 44029 44044 44157 44319 44338 44462 44716 44750 44762 44948 44994 45072 45332 45372 45402 45407 45438 45640 45722 45730 45770 45881 4595


Each string has a different number of segments (sequence of numbers).
I need to do some feature reduction on this training set possibly using RandomForests or another approach.
I'm unclear on how I should represent this data so that I can work on it and pass it to a model in scikit-learn. I tried putting it into a dataframe in Python but then that leads to a "jagged" dataframe which is hard to work with. I also need to calculate Variance Threshold.

Any suggestions on how to use this file?

Answer

You need to vectorize your data so that you have a square matrix with one column for each possible value. You can do this using a CountVectorizer (this is usually used for processing text but it will work for your data as well). The output will be a sparse matrix, depending on the model that you want to use, you may have to convert this to a dense array using np.array

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True, vocabulary = [str(i) for i in range(100000)])
X = vec.fit_transform(df[1])
X
# <162x56905 sparse matrix of type '<class 'numpy.int64'>'
#   with 147915 stored elements in Compressed Sparse Row format>
X.toarray()
# array([[0, 0, 0, ..., 0, 1, 0],
#        [0, 0, 0, ..., 0, 0, 0],
#        [0, 0, 0, ..., 0, 0, 0],
#        ...,
#        [0, 0, 0, ..., 0, 0, 0],
#        [0, 0, 0, ..., 0, 0, 0],
#        [0, 0, 0, ..., 0, 0, 0]])
Comments