khpeek khpeek - 4 months ago 31
Python Question

Interpretation of numpy polyfit on a big dataset

I'm analyzing a publicly available dataset: an assessment of properties in San Francisco for tax purposes (https://data.sfgov.org/Housing-and-Buildings/Historic-Secured-Property-Tax-Rolls/wv5m-vpq2). It can be downloaded as a CSV file, which assumes the filename 'Historic_Secured_Property_Tax_Rolls.csv'.

Using this file, I'm trying to figure out the annual growth rate of the Land Values, excluding zero values. The dataset is so large that I get errors if I try to plot it, so I'm firstly trying to rely on my understanding of how

polyfit
works.

I've used the following code to derive a linear fit of the natural logarithm of the 'Land Value' column plotted against the 'Fiscal Year' column:

import pandas as pd

# Read in data downloaded from https://data.sfgov.org/api/views/wv5m-vpq2/rows.csv?accessType=DOWNLOAD
df = pd.read_csv('Historic_Secured_Property_Tax_Rolls.csv')

df_nz = df[df['Closed Roll Assessed Land Value'] > 0] # Only consider non-zero Land Values

p = np.polyfit(df_nz['Closed Roll Fiscal Year'], np.log(df_nz['Closed Roll Assessed Land Value']), 1)


This yields the following values for
p
:

In [42]: p
Out[42]: array([ 4.18802559e-02, -7.23804441e+01])


As I understand it, the slope of the linear fit should be represented by
p[1]
. However, this would represent an implausible growth rate of -724% per year. If it were
p[0]
, however, it would be a more plausible 4.2% per year.

I'm wondering if I haven't somehow misinterpreted the result, and whether the growth rate is somehow represented by
p[0]
instead of
p[1]
?

Answer
Returns
-------
p : ndarray, shape (M,) or (M, K)
    Polynomial coefficients, highest power first.  If `y` was 2-D, the
    coefficients for `k`-th data set are in ``p[:,k]``.

This tells me that the 4.2% is the coefficient on the log term.

More to come...