eijen eijen - 1 year ago 83
Python Question

Hyphen at beginning of regex causes it to stop matching (python 2.7) - but at the end it's fine?

I'm writing a simple script to dump the tracks, artists, and times of a bandcamp album (https://nihonkizuna.bandcamp.com/album/nihon-kizuna), but I'm having trouble with the regex. For context, the track titles are in the format "Artist - Title". I'm trying to separate the dumped track titles so that I have the artist in one list and the title in another, then writing these and the time to a csv.

For some reason, the expression:

(.*) -

Finds the artist correctly, but:

- (.*)

Fails to find the title correctly. Instead I get:

AttributeError: 'NoneType' object has no attribute 'group'

I've tried escaping the hyphen, but python returns "None" for a match as long as it's the first character. I've tried testing it by regexing an actual title, "- 9 Samurai", and it still fails.

import pandas as pd
from lxml import html
import re
import requests

page = requests.get("https://nihonkizuna.bandcamp.com/album/nihon-kizuna")
tree = html.fromstring(page.content)

tracks = tree.xpath('//table[@id ="track_table"]//td[@class="title-col"]/div[@class="title"]/a/span/text()')
time = tree.xpath('//table[@id ="track_table"]//td[@class="title-col"]/div[@class="title"]/span/text()')
newtimes = []
artists = []
newtracks = []

for item in time:
newitem = item.strip()

for item in tracks:
track_item = re.match("(.*) -", item)
newitem2 = re.match("- (.*)", item)

raw_data = {"track": newtracks, "artist": artists, "time": newtimes}

df = pd.DataFrame(raw_data, columns = ["track", "artist", "time"])
df.index += 1

df.to_csv(raw_input("Input the csv path."))

Answer Source

As the documentation to re.match states:

If zero or more characters at the beginning of string match the regular expression pattern, (...).

Use re.search instead.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download