Castor Castor - 1 year ago 48
Python Question

What's the best way to tell the missing row in pandas DataFrame?

I'm new to Python - pandas, currently trying to use it to check whether the data in DataFrame is continuous. For example:

thread sequence start end
14 1 114 1647143 1672244
15 1 115 1672244 1689707
16 1 116 1689707 1713090
17 1 118 1735352 1760283
18 1 119 1760283 1788062
19 1 120 1788062 1789885
20 1 121 1789885 1790728


Every row owns 4 columns, in general sequence should be increased with step of 1, so if everything is correct, it would look like 116,117,118 ... , like a range() function. But example here missing the row with sequence == 117.

I tried to find it, but I don't know how to do it. If I just check the sequence one by one, it would be inefficient. The desired output would be to tell the missing row or fill the missing row with NaN.

Any good tips or suggestion would be helpful.

Answer Source

A faster method using RangeIndex:

In [415]: seq = pd.RangeIndex(df.sequence.min(), df.sequence.max())

In [416]: seq[~seq.isin(df.sequence)].values
Out[416]: array([117])
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download