Adam Warner - 2 years ago 109

Python Question

So I am working on web-scraping some sites and I look at substitution data and I want to know the score at the same time. So I have when the time the subs occur and the times the goals happen. I then want to link what the score is at the specific time of the substitution. Here is an example:

`import pandas as pd`

df_stack = ['31:12',

'34:12',

'34:12',

'57:50',

'57:50',

'67:03',

'68:48',

'77:18',

'80:00',

'90:00']

# This df_stack that is commented works.

#df_stack = ['34:40', '36:53', '55:38', '56:03', '67:31', '74:43', '84:38',

# '86:58', '86:58']

In = ['a']*len(df_stack)

Out = ['b']*len(df_stack)

Subs = pd.DataFrame(data = [In,Out]).T

Subs.columns = ['In','Out']

Subs.index = [df_stack]

### This score works

#Score = ['0-0','0-1','1-1']

#Score = pd.DataFrame(data = [Score]).T

#Score.columns = ['Score']

#Score.index = ['61:37','61:38','81:45']

### This Score Doesn't Work

Score = ['0-0','0-1','1-1','2-1']

Score = pd.DataFrame(data = [Score]).T

Score.columns = ['Score']

Score.index = ['58:39', '58:40', '83:31', '89:41']

k = 0

j = 0

q = 0

overall_score = []

time = []

for i in Subs.index.tolist():

try:

if i < Score.index.tolist()[k]:

overall_score.append(Score['Score'][k])

time.append([Score.index[k],i,k,'top',Score['Score'][k]])

q += 1

else:

if (k > 0 and i > Score.index.tolist()[k] and i < Score.index.tolist()[k+1]):

overall_score.append(Score['Score'][k])

time.append([Score.index[k],i,Score.index[k+1],k,'No Change',q,Score['Score'][k]])

j += 1

q += 1

if (k == 0 and i > Score.index.tolist()[k]):

k += 1

q += 1

overall_score.append(Score['Score'][k])

time.append([Score.index[k],i,Score.index[k+1],k,'First Goal',Score['Score'][k]])

if (j >= 1 and i > Score.index.tolist()[k+j]):

h = 0

h += k + j

if k >= len(Score):

h = len(Score)-1

overall_score.append(Score['Score'][h])

time.append([Score.index[h],i,k,'Another Goal',j,Score['Score'][k]])

except IndexError:

#overall_score.append(Score['Score'][k-1])

overall_score.append(Score['Score'][len(Score)-1])

I know it is a lot of code, but the desired output of overall_score should be:

`['0-0', '0-0', '0-0', '0-0', '0-0', '0-1', '0-1', '0-1','0-1' '2-1']`

There might be a much easier way to do this, I am also willing to put the whole scraping code online, but it is fairly long. So the substitutions with the overall score would look like:

`In Out Score`

31:12 a b 0-0

34:12 a b 0-0

34:12 a b 0-0

57:50 a b 0-0

57:50 a b 0-0

67:03 a b 0-1

68:48 a b 0-1

77:18 a b 0-1

80:00 a b 0-1

90:00 a b 2-1

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

**SOLUTION 1**

One solution is using the `apply`

method on your dataframe, given that you have a function that applies the correct conditional logic to your rows.

This solution uses a dictionary of scores, where the key is the time, and the value is the score. The dictionary then gets passed as an additional argument to the function that will apply the logic to your dataframe.

I re-created your data below, but instead of using time as the index, I created an actual `time`

column:

```
df_stack = ['31:12', '34:12', '34:12', '57:50', '57:50', '67:03', '68:48', '77:18', '80:00', '90:00']
subs = pd.DataFrame({'time': df_stack})
subs['in'] = 'a'
subs['out'] = 'b'
```

Now here's the `scores`

dictionary:

```
scores = {'58:39': '0-0', '58:40': '0-1', '83:31': '1-1', '89:41': '2-1'}
```

Now this is the function that you will pass to `apply`

. Note, that this function sorts the dictionary by key before iterating the values to determine the correct score. The function also assumes that all scores start at "0-0". You could also explicitly define this assumption in your dictionary as well by adding a key/value record of `'00:00': '0-0'`

.

```
def map_score_to_time(time, scores):
score_at_sub = '0-0'
for score_time, score in sorted(scores.items(), key=lambda kv: kv[0]):
if time >= score_time:
score_at_sub = score
return score_at_sub
```

Now, with your function defined, you can now apply to your dataframe:

```
subs['score'] = subs['time'].apply(map_score_to_time, scores=scores)
```

Result:

```
time in out score
0 31:12 a b 0-0
1 34:12 a b 0-0
2 34:12 a b 0-0
3 57:50 a b 0-0
4 57:50 a b 0-0
5 67:03 a b 0-1
6 68:48 a b 0-1
7 77:18 a b 0-1
8 80:00 a b 0-1
9 90:00 a b 2-1
```

**SOLUTION 2**

This alternate solution assumes your scores is a dataframe, like how you have created in your example. However, for this solution to work, you have to explicitly define what the score is at time `00:00`

. Let's assume a game's score is always `0-0`

at time `00:00`

.

Our `subs`

dataframe is still going to be built the same as earlier, so let's build our `scores_df`

dataframe. Note, I explicitly added a record to the dataframe to have a record for time `00:00`

.

```
scores_df = pd.DataFrame({'time': ['00:00', '58:39', '58:40', '83:31', '89:41'], 'score': ['0-0', '0-0', '0-1', '1-1', '2-1']})
```

Now, we have to do a cartesian join between the two dataframes. This is an intermediary step so that we can have the time column from `subs`

and the time column from `scores`

. In order do this join, we have to create a dummy join key so you'll want to create that for both dataframes.

```
# Create dummy keys
scores_df['key'] = 1
subs['key'] = 1
# Now join
merged_df = subs.merge(scores_df, how='inner', on='key')
```

After join, you want to filter out records where `time_x`

(the time from `subs`

) is greater than `time_y`

(the time from `scores`

), group by `time_x`

, `in`

, and `out`

, and then grab the last record per group.

```
final_df = merged_df[merged_df['time_x'] > merged_df['time_y']].groupby(['time_x', 'in', 'out']).tail(1)
```

Result:

```
time_x in out key score time_y
0 31:12 a b 1 0-0 00:00
10 34:12 a b 1 0-0 00:00
20 57:50 a b 1 0-0 00:00
27 67:03 a b 1 0-1 58:40
32 68:48 a b 1 0-1 58:40
37 77:18 a b 1 0-1 58:40
42 80:00 a b 1 0-1 58:40
49 90:00 a b 1 2-1 89:41
```

Note that duplicate records by `time_x`

, `in`

, and `out`

are dropped. You can drop the `key`

and `time_y`

columns if you want.

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**