khpeek - 1 year ago 154
Python Question

# Interpretation of numpy polyfit on a big dataset

I'm analyzing a publicly available dataset: an assessment of properties in San Francisco for tax purposes (https://data.sfgov.org/Housing-and-Buildings/Historic-Secured-Property-Tax-Rolls/wv5m-vpq2). It can be downloaded as a CSV file, which assumes the filename 'Historic_Secured_Property_Tax_Rolls.csv'.

Using this file, I'm trying to figure out the annual growth rate of the Land Values, excluding zero values. The dataset is so large that I get errors if I try to plot it, so I'm firstly trying to rely on my understanding of how

`polyfit`
works.

I've used the following code to derive a linear fit of the natural logarithm of the 'Land Value' column plotted against the 'Fiscal Year' column:

``````import pandas as pd

df_nz = df[df['Closed Roll Assessed Land Value'] > 0]   # Only consider non-zero Land Values

p = np.polyfit(df_nz['Closed Roll Fiscal Year'], np.log(df_nz['Closed Roll Assessed Land Value']), 1)
``````

This yields the following values for
`p`
:

``````In [42]: p
Out[42]: array([  4.18802559e-02,  -7.23804441e+01])
``````

As I understand it, the slope of the linear fit should be represented by
`p[1]`
. However, this would represent an implausible growth rate of -724% per year. If it were
`p[0]`
, however, it would be a more plausible 4.2% per year.

I'm wondering if I haven't somehow misinterpreted the result, and whether the growth rate is somehow represented by
`p[0]`
`p[1]`
?

``````Returns
This tells me that the `4.2%` is the coefficient on the log term.