PL3 PL3 - 3 years ago 106
Python Question

python regex match and replace beginning and end of string but keep the middle

I have a dataframe with holiday names. I have a problem that on some days, holidays are observed on different days, sometimes on the day of another holiday. Here are some example problems:

1 "Independence Day (Observed)"
2 "Christmas Eve, Christmas Day (Observed)"
3 "New Year's Eve, New Year's Day (Observed)"
4 "Martin Luther King, Jr. Day"

I want to replace all ' (Observed)' with '' and everything before a comma only if ' (Observed)' is matched. Output should be:

1 "Independence Day"
2 "Christmas Day"
3 "New Year's Day"
4 "Martin Luther King, Jr. Day"

I was able to do both independently:

.replace(to_replace=' \(Observed\)', value='', regex=True)
.replace(to_replace='.+, ', value='', regex=True))

but that caused a problem with 'Martin Luther King, Jr. Day'.

Answer Source

import re

input = [
    "Independence Day (Observed)",
    "Christmas Eve, Christmas Day (Observed)",
    "New Year's Eve, New Year's Day (Observed)",
    "Martin Luther King, Jr. Day"

for holiday in input:
    print re.sub('^(.*?, )?(.*?)( \(Observed\))$', '\\2', holiday)


> python 
Independence Day
Christmas Day
New Year's Day
Martin Luther King, Jr. Day


  • ^: Match at start of string.
  • (.*?, )?: Match anything followed by a command and a space. Make it a lazy match, so it doesn't consume the portion of the string we want to keep. The last ? makes the whole thing optional, because some of the sample input doesn't have a comma at all.
  • (.*?): Grab the part we want for later use in a capturing group. This part is also a lazy match because...
  • ( \(Observed\)): Some strings might have " (Observed)" on the end, so we declare that in a separate group here. The lazy match in the prior piece won't consume this.
  • $: Match at end of string.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download