alphanumeric alphanumeric - 1 month ago 15
Python Question

How to remove data from DataFrame permanently

After reading CSV data file with:

import pandas as pd
df = pd.read_csv('data.csv')
print df.shape


I get DataFrame 99 rows (indexes) long:

(99, 2)


To cleanup DataFrame I go ahead and apply dropna() method which reduces it to 33 rows:

df = df.dropna()
print df.shape


which prints:

(33, 2)


Now when I iterate the columns it prints out all 99 rows like they weren't dropped:

for index, value in df['column1'].iteritems():
print index


which gives me this:

0
1
2
.
.
.
97
98
99


It appears the
dropna()
simply made the data "hidden". That hidden data returns back when I iterate DataFrame. How to assure the dropped data is removed from DataFrame instead just getting hidden?

Answer

You're being confused by the fact that the row labels have been preserved so the last row label is still 99.

Example:

In [2]:
df = pd.DataFrame({'a':[0,1,np.NaN, np.NaN, 4]})
df

Out[2]:
    a
0   0
1   1
2 NaN
3 NaN
4   4

After calling dropna the index row labels are preserved:

In [3]:
df = df.dropna()
df

Out[3]:
   a
0  0
1  1
4  4

If you want to reset so that they are contiguous then call reset_index(drop=True) to assign a new index:

In [4]:
df = df.reset_index(drop=True)
df

Out[4]:
   a
0  0
1  1
2  4
Comments