khpeek khpeek - 3 months ago 12x
Python Question

Interpretation of numpy polyfit on a big dataset

I'm analyzing a publicly available dataset: an assessment of properties in San Francisco for tax purposes ( It can be downloaded as a CSV file, which assumes the filename 'Historic_Secured_Property_Tax_Rolls.csv'.

Using this file, I'm trying to figure out the annual growth rate of the Land Values, excluding zero values. The dataset is so large that I get errors if I try to plot it, so I'm firstly trying to rely on my understanding of how


I've used the following code to derive a linear fit of the natural logarithm of the 'Land Value' column plotted against the 'Fiscal Year' column:

import pandas as pd

# Read in data downloaded from
df = pd.read_csv('Historic_Secured_Property_Tax_Rolls.csv')

df_nz = df[df['Closed Roll Assessed Land Value'] > 0] # Only consider non-zero Land Values

p = np.polyfit(df_nz['Closed Roll Fiscal Year'], np.log(df_nz['Closed Roll Assessed Land Value']), 1)

This yields the following values for

In [42]: p
Out[42]: array([ 4.18802559e-02, -7.23804441e+01])

As I understand it, the slope of the linear fit should be represented by
. However, this would represent an implausible growth rate of -724% per year. If it were
, however, it would be a more plausible 4.2% per year.

I'm wondering if I haven't somehow misinterpreted the result, and whether the growth rate is somehow represented by
instead of

p : ndarray, shape (M,) or (M, K)
    Polynomial coefficients, highest power first.  If `y` was 2-D, the
    coefficients for `k`-th data set are in ``p[:,k]``.

This tells me that the 4.2% is the coefficient on the log term.

More to come...