Runner Bean Runner Bean - 2 months ago 18
Python Question

Python: Pandas Series - Why use loc?

Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed

%timeit df_user1 = df.loc[df.user_id=='5561']

100 loops, best of 3: 11.9 ms per loop


%timeit df_user1_noloc = df[df.user_id=='5561']

100 loops, best of 3: 12 ms per loop

So why use loc?

  • Explicit is better than implicit.

    df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:

    In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
       False  True 
    0      3      1
    1      4      2
    2      5      3

    You might want to use df[[True]] to select the True column. Instead it raises a ValueError:

    In [230]: df[[True]]
    ValueError: Item wrong length 1 instead of 3.

    In contrast, the following does not raise ValueError even though the structure of df2 is almost the same:

    In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
       A  B
    0  1  3
    1  2  4
    2  3  5
    In [259]: df2[['B']]
    0  3
    1  4
    2  5

    Also note that

    In [231]: df.loc[[True]]
       False  True 
    0      3      1

    Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is unambiguous -- unlike df[boolean_mask], you don't need to know if df.columns contains boolean values to understand what df.loc[boolean_mask] will do.

  • df.loc[row_indexer, column_index] can select rows and columns by label. df[indexer] can only select rows or columns depending on type(indexer) and the type of column values df has (again, are they boolean?).

    In [237]: df2.loc[[True,False,True], 'B']
    0    3
    2    5
    Name: B, dtype: int64
  • When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:

    In [239]: df2.loc[1:2]
       A  B
    1  2  4
    2  3  5
    In [271]: df2[1:2]
       A  B
    1  2  4