Akash Chandra Akash Chandra - 17 days ago 5
Python Question

How to replace outlier data in pandas?

I have a stock data grabbed from Yahoo finance, adjusted close data is wrong somehow.

adj_close close ratio
date
2014-10-16 240.4076 2466.40 0.097473
2014-10-17 245.8173 2521.90 0.097473
2014-10-20 250.4522 2569.45 0.097473
2014-10-21 251.8850 2584.15 0.097473
2014-10-22 251.0175 2575.25 0.097473
2014-10-23 251.3392 2578.55 0.097473
2014-10-27 253.2155 2597.80 0.097473
2014-10-28 258.9616 2656.75 0.097473
2014-10-29 257.6944 2643.75 0.097473
2014-10-30 257.1339 2638.00 0.097473
2014-10-31 26.3450 2702.80 0.009747
2014-11-03 26.5463 2723.45 0.009747
2014-11-05 27.1160 2781.90 0.009747
2014-11-07 26.7320 2742.50 0.009747
2014-11-10 26.7027 2739.50 0.009747


Here's a plot of the adjusted close data:

plot of ad_close data

How can I replace data like this using any methods like interpolation or something?

Answer

Try this:

In [71]: import pandas_datareader.data as web

In [110]: df = web.DataReader('SBIN.NS', 'yahoo', '2014-10-21', '2014-11-25')

In [111]: df
Out[111]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200    26.3450   # <bad_data>
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600    26.5463
2014-11-04   272.3450   272.3450   272.3450   272.3450         0    26.5463
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100    27.1160
2014-11-06   278.1900   278.1900   278.1900   278.1900         0    27.1160
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000    26.7320
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800    26.7027
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900    26.7110
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200    26.7666
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000    26.5059
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600    27.1740
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100    28.6629
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300    28.7131
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900    28.3695   # </bad_data>
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

NOTE: Adj Close column has recovered starting from 2014-11-20, other columns - not, so i'll concentrate on Adj Close only:

let's find outliers (i'm checking for those that have changed for 50+% from the previous day - you may want to change this threshold):

In [112]: bad_idx = df.index[df['Adj Close'].pct_change().abs().ge(0.5)]

In [113]: bad_idx
Out[113]: DatetimeIndex(['2014-10-31', '2014-11-20'], dtype='datetime64[ns]', name='Date', freq=None)

In [114]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] *= 10

In [115]: df
Out[115]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200   263.4500
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600   265.4630
2014-11-04   272.3450   272.3450   272.3450   272.3450         0   265.4630
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100   271.1600
2014-11-06   278.1900   278.1900   278.1900   278.1900         0   271.1600
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000   267.3200
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800   267.0270
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900   267.1100
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200   267.6660
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000   265.0590
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600   271.7400
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100   286.6290
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300   287.1310
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900   283.6950
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

Here is another solution, which uses interpolation:

In [119]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] = np.nan

In [120]: df
Out[120]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200        NaN
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600        NaN
2014-11-04   272.3450   272.3450   272.3450   272.3450         0        NaN
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100        NaN
2014-11-06   278.1900   278.1900   278.1900   278.1900         0        NaN
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000        NaN
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800        NaN
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900        NaN
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200        NaN
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000        NaN
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600        NaN
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100        NaN
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300        NaN
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900        NaN
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

In [122]: df['Adj Close'] = df['Adj Close'].interpolate()

In [123]: df
Out[123]:
                 Open       High        Low      Close    Volume   Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300  251.885000
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100  251.017500
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200  251.339200
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0  251.339200
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500  253.215500
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400  258.961600
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900  257.694400
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200  257.133900
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200  259.297807
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600  261.461713
2014-11-04   272.3450   272.3450   272.3450   272.3450         0  263.625620
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100  265.789527
2014-11-06   278.1900   278.1900   278.1900   278.1900         0  267.953433
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000  270.117340
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800  272.281247
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900  274.445153
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200  276.609060
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000  278.772967
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600  280.936873
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100  283.100780
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300  285.264687
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900  287.428593
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500  289.592500
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200  297.780200
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400  301.045600
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600  296.756800