SuperNoobAttack SuperNoobAttack - 4 days ago 4
Python Question

using a regex to grab titles of folders from text file (python)

I am attempting to use a regex to read through a text file and make a folder in a certain directory based on what the regex finds. The text file I'm reading through is some HTML source code for the page that I wanted to grab titles of folders from. (that's why the regex is searching for an odd value)

This is the file I'm reading from. (it's super long)

Here is my code:

import os
import re
with open('folders.txt','r', encoding='utf-8') as f:
lines = f.readlines()

match = re.search(r'>[\w\.-]+</a></td>', lines)
match = match.rstrip("</a></td>")
match = match.lstrip(">")
newpath = r'C:\Desktop\scriptFolders\%s' %match
if not os.path.exists(newpath): os.makedirs(newpath)


When I throw this code into a shell it gives me the following error:

Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "C:\Python34\lib\re.py", line 170, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer


How far off track am I?

Answer

There are a number of mistakes in and potential improvements to your code. They're not easy to explain in prose, so here's a working version of the code, with comments highlighting the changes and the reasons behind them.

import os
import re

# Precompile the regex so it only happens once. This saves a bit of time,
# especially if your file is large.
# I've also modified the regex to include a capture group [1] for the part
# between the > and the <, allowing us to grab the string there later. There
# are other ways to do it (e.g. with lookbehind and lookahead), but this is the
# simplest.
regex = re.compile(r'>([\w\.-]+)</a></td>')

with open('folders.txt', 'r', encoding='utf-8') as f:
    # Loop through the lines in f.
    # Alternatively, you can also do
    #     lines = f.readlines()
    #     for line in lines:
    #         ...
    # but it's less memory-efficient because it puts the whole file in memory.
    for line in f:
        match = regex.search(line)
        # re.search returns a match object [2], or None if the string doesn't
        # match the regex.
        if not match:  # Throw away non-matching lines.
            continue
        # Get the value of capture group #1.
        match = match.group(1)
        newpath = r'C:\Desktop\scriptFolders\%s' % match
        if not os.path.exists(newpath):
            os.makedirs(newpath)

References:

  1. Capture groups
  2. Match objects