Stephen Stephen - 1 year ago 63
Python Question

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.

I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.

So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).

My current code below simply takes the first 5 rows because I know that's how many times this date occurs.

import os
from numpy import genfromtxt
import pandas as pd

path = 'Z:\\folderwithcsvfile'

for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))

Answer Source

This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.

Here's a simple CSV that is similar to your structure:


Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])

# find the latest datetime
latest_date = df['DATE'].max()

# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)

# now you can perform your operations on latest_rows

In my example, this will print:

0 2016-10-11       123              123
1 2016-10-11       123              123
4 2016-10-11       123              123