Blairg23 Blairg23 - 4 months ago 6x
Python Question

Suggestions on Word Parsing

I have a set of folders and files that have arbitrary names. My end goal is to parse through the folders and files and create a nicely sorted and named set of folders. These titles sometimes have spaces as delimiters and sometimes have periods (I haven't found any examples with anything other than those as delimiters). I want to display these filenames without delimiters and with only the real words (specific the title of the file and a date if the date is relevant). I'm not worrying about the dates for now, I have a lookup table to figure out the dates based on the correctly spelled title.

Examples of titles:

  1. a.bad.title.asdf.1975
    (where asdf is the author or website the file was scraped from).

The title should read:
A Bad Title (1975)

  1. another bad title 1975

Should read:
Another Bad Title (1975)

  1. a really.bad title[1975]

Should read:
A Really Bad Title (1975)

What I've tried:

Possible Solution: Parse through the words using the delimiters to pull out each separate word and do a word search with a large dictionary I have to figure out if the given element of the array is a word.

Problem 1:
('a', 'bad', 'title', '1975')
and I can work with it without a problem. However, a really.bad title[1975] becomes
('a', 'really', 'bad', 'title[1975]')
and can't be dealt with.

Problem 2: Some of the titles are numbers or parts of numbers like
2001: A Space Odyssey
, so I can't just parse through what real words are.

EDIT (Examples of problem 2):

Filename 1:

Filename 2:
2012 [2009].txt

Filename 3:


In other words, my problem is that I want to be able to remove a given date or random numbers, but I want to keep the date if it pertains to the title (as some titles are dates or years) and some of the words in the title are attached (without spaces) to the year in the title and can't be parsed out.

My last idea is possibly giving scores to each possible title based on how many words they have in common, but that still leaves the "year as a title" problem unsolved.

If anyone has any suggestions that might help me think about this problem, please let me know!


Quick n' Dirty:

import re

for title in [
        "another bad title 1975",
        "a really.bad title[1975]"]:
   print(" ".join(map(str.title, re.findall(r"\w+", title))))


A Bad Title Asdf 1975
Another Bad Title Asdf Com 1975
A Really Bad Title 1975

In this form, it should be easy to compare against known titles.