not_a_robot not_a_robot - 12 days ago 7
Python Question

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a

predict
method back with the original data in a
pandas.DataFrame
object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)


To merge these predictions back with the original
df
, I try this:

df['y_hats'] = y_hats


But that raises:


ValueError: Length of values does not match length of index


I know I could split the
df
into
train_df
and
test_df
and this problem would be solved, but in reality I need to follow the path above to create the matrices
X
and
y
(my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my
df
, since the
y_hats
array is zero-indexed and seemingly all information about which rows were included in the
X_test
and
y_test
is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in
train
with
np.nan
values in the dataframe.

Answer

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
Comments