Adam Warner Adam Warner - 2 years ago 109
Python Question

Data Conditional Statements

So I am working on web-scraping some sites and I look at substitution data and I want to know the score at the same time. So I have when the time the subs occur and the times the goals happen. I then want to link what the score is at the specific time of the substitution. Here is an example:

import pandas as pd
df_stack = ['31:12',
'34:12',
'34:12',
'57:50',
'57:50',
'67:03',
'68:48',
'77:18',
'80:00',
'90:00']

# This df_stack that is commented works.
#df_stack = ['34:40', '36:53', '55:38', '56:03', '67:31', '74:43', '84:38',
# '86:58', '86:58']

In = ['a']*len(df_stack)
Out = ['b']*len(df_stack)
Subs = pd.DataFrame(data = [In,Out]).T
Subs.columns = ['In','Out']
Subs.index = [df_stack]


### This score works

#Score = ['0-0','0-1','1-1']
#Score = pd.DataFrame(data = [Score]).T
#Score.columns = ['Score']
#Score.index = ['61:37','61:38','81:45']

### This Score Doesn't Work
Score = ['0-0','0-1','1-1','2-1']
Score = pd.DataFrame(data = [Score]).T
Score.columns = ['Score']
Score.index = ['58:39', '58:40', '83:31', '89:41']


k = 0
j = 0
q = 0

overall_score = []
time = []
for i in Subs.index.tolist():
try:
if i < Score.index.tolist()[k]:
overall_score.append(Score['Score'][k])
time.append([Score.index[k],i,k,'top',Score['Score'][k]])
q += 1
else:


if (k > 0 and i > Score.index.tolist()[k] and i < Score.index.tolist()[k+1]):
overall_score.append(Score['Score'][k])
time.append([Score.index[k],i,Score.index[k+1],k,'No Change',q,Score['Score'][k]])
j += 1
q += 1

if (k == 0 and i > Score.index.tolist()[k]):
k += 1
q += 1

overall_score.append(Score['Score'][k])
time.append([Score.index[k],i,Score.index[k+1],k,'First Goal',Score['Score'][k]])

if (j >= 1 and i > Score.index.tolist()[k+j]):
h = 0
h += k + j
if k >= len(Score):
h = len(Score)-1
overall_score.append(Score['Score'][h])
time.append([Score.index[h],i,k,'Another Goal',j,Score['Score'][k]])


except IndexError:
#overall_score.append(Score['Score'][k-1])
overall_score.append(Score['Score'][len(Score)-1])


I know it is a lot of code, but the desired output of overall_score should be:

['0-0', '0-0', '0-0', '0-0', '0-0', '0-1', '0-1', '0-1','0-1' '2-1']


There might be a much easier way to do this, I am also willing to put the whole scraping code online, but it is fairly long. So the substitutions with the overall score would look like:

In Out Score
31:12 a b 0-0
34:12 a b 0-0
34:12 a b 0-0
57:50 a b 0-0
57:50 a b 0-0
67:03 a b 0-1
68:48 a b 0-1
77:18 a b 0-1
80:00 a b 0-1
90:00 a b 2-1

Answer Source

SOLUTION 1

One solution is using the apply method on your dataframe, given that you have a function that applies the correct conditional logic to your rows.

This solution uses a dictionary of scores, where the key is the time, and the value is the score. The dictionary then gets passed as an additional argument to the function that will apply the logic to your dataframe.

I re-created your data below, but instead of using time as the index, I created an actual time column:

df_stack = ['31:12', '34:12', '34:12', '57:50', '57:50', '67:03', '68:48', '77:18', '80:00', '90:00']
subs = pd.DataFrame({'time': df_stack})
subs['in'] = 'a'
subs['out'] = 'b'

Now here's the scores dictionary:

scores = {'58:39': '0-0', '58:40': '0-1', '83:31': '1-1', '89:41': '2-1'}

Now this is the function that you will pass to apply. Note, that this function sorts the dictionary by key before iterating the values to determine the correct score. The function also assumes that all scores start at "0-0". You could also explicitly define this assumption in your dictionary as well by adding a key/value record of '00:00': '0-0'.

def map_score_to_time(time, scores):
    score_at_sub = '0-0'
    for score_time, score in sorted(scores.items(), key=lambda kv: kv[0]):
        if time >= score_time:
            score_at_sub = score
    return score_at_sub

Now, with your function defined, you can now apply to your dataframe:

subs['score'] = subs['time'].apply(map_score_to_time, scores=scores)

Result:

    time in out score
0  31:12  a   b   0-0
1  34:12  a   b   0-0
2  34:12  a   b   0-0
3  57:50  a   b   0-0
4  57:50  a   b   0-0
5  67:03  a   b   0-1
6  68:48  a   b   0-1
7  77:18  a   b   0-1
8  80:00  a   b   0-1
9  90:00  a   b   2-1

SOLUTION 2

This alternate solution assumes your scores is a dataframe, like how you have created in your example. However, for this solution to work, you have to explicitly define what the score is at time 00:00. Let's assume a game's score is always 0-0 at time 00:00.

Our subs dataframe is still going to be built the same as earlier, so let's build our scores_df dataframe. Note, I explicitly added a record to the dataframe to have a record for time 00:00.

scores_df = pd.DataFrame({'time': ['00:00', '58:39', '58:40', '83:31', '89:41'], 'score': ['0-0', '0-0', '0-1', '1-1', '2-1']})

Now, we have to do a cartesian join between the two dataframes. This is an intermediary step so that we can have the time column from subs and the time column from scores. In order do this join, we have to create a dummy join key so you'll want to create that for both dataframes.

# Create dummy keys
scores_df['key'] = 1
subs['key'] = 1

# Now join
merged_df = subs.merge(scores_df, how='inner', on='key')

After join, you want to filter out records where time_x (the time from subs) is greater than time_y (the time from scores), group by time_x, in, and out, and then grab the last record per group.

final_df = merged_df[merged_df['time_x'] > merged_df['time_y']].groupby(['time_x', 'in', 'out']).tail(1)

Result:

   time_x in out  key score time_y
0   31:12  a   b    1   0-0  00:00
10  34:12  a   b    1   0-0  00:00
20  57:50  a   b    1   0-0  00:00
27  67:03  a   b    1   0-1  58:40
32  68:48  a   b    1   0-1  58:40
37  77:18  a   b    1   0-1  58:40
42  80:00  a   b    1   0-1  58:40
49  90:00  a   b    1   2-1  89:41

Note that duplicate records by time_x, in, and out are dropped. You can drop the key and time_y columns if you want.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download