I have a question regarding my following code,
I have a data set and a list , I want to compare each data value of my data set with two conditions, if the condition is true then keep the previous value of the data frame otherwise make it as None, My code works perfectly for small data set however it will takes too much time and without any values for my big data set. Is there better solution?
new_data=data
for col in df.columns:
for i in range(len(df)):
if (df.iloc[i][col] >list_min[i] ) & (df.iloc[i][col]<list_max[i]):
new_data.set_value(i,col,df.iloc[i][col])
else:
new_data.set_value(i,col,None)
data = pd.read_csv('./dataset/RMSSD/RMSSD_Exam_new.csv')
i=0
data = data.applymap(np.log)
data = data.drop('time', axis=1)
q75_list = []
q25_list = []
iqr_list = []
min_list = []
max_list = []
new_data=data
for col in data.columns.values:
q75_list.append(np.nanpercentile(data[col], 75))
q25_list.append(np.nanpercentile(data[col], 25))
iqr_list = np.array(q75_list) - np.array(q25_list)
min_list = np.array(q25_list) - (np.array(iqr_list * 1.5))
max_list = np.array(q75_list) + (np.array(iqr_list * 1.5))
print("Max :\n",max_list,"\n Min :\n",min_list)
for col in data.columns:
for (i, j) in [(i, j) for i in range(len(data)) for j in range(len(min_list))]:
if (data.iloc[i][col] >min_list[j] ) & (data.iloc[i][col]<max_list[j]):
new_data.set_value(i,col,data.iloc[i][col])
else:
new_data.set_value(i,col,None)
new_data.to_csv('./dataset/outlier_result1.csv',index=False)
If I am correctly understanding what you are doing, there are a couple places you could try to vectorize things. See if this speeds things up:
q75s = data.quantile(.75)
q25s = data.quantile(.25)
mins = 2.5*q25s - 1.5*q75s
maxs = 2.5*q75s - 1.5*q25s
newdata = data.copy()
newdata[(data < mins) | (data > maxs)] = None