mazkopolo mazkopolo - 2 months ago 16
Python Question

python extract URLs from a text file with no html tags

I have found most of the posts here are approaching tag to find the urls in a text file. But not all text files necessarily got html tags next to them. I am looking for a solution that works in both situations. The following regex is:

'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'


regex to obtain the urls from a text file using below code but the problem is it also takes unnecessary characters such as '>'

Here is my code:

import re
def extractURLs(fileContent):
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
print urls
return urls

myFile = open("emailBody.txt")
fileContent = myFile.read()
URLs = URLs + extractURLs(fileContent)


The example of output is as below:

http://saiconference.com/ficc2018/submit
http://52.21.30.170/sendy/unsubscribe/qhiz2s763l892rkps763chacs52ieqkagf8rbueme9n763jv6da/hs1ph7xt5nvdimnwwfioya/qg0qteh7cllbw8j6amo892ca>
https://www.youtube.com/watch?v=gvwyoqnztpy>
http://saiconference.com/ficc
http://saiconference.com/ficc>
http://saiconference.com/ficc2018/submit>


As you can see there are some characters (such as '>') that are causing problems. What am I doing wrong?

Answer Source

Quick solution, assuming '>' is the only character that appears at the end: url.rstrip('>')

Removes the last occurrence of the character for a single string. So, you will have to iterate through the list and remove the character.

Edit: Just got a PC with python, so giving a regex answer, after testing.

import re
def extractURLs(fileContent):
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
    cleanUrls = []
    for url in urls:
        lastChar = url[len(url) - 1] # get the last character
        # if the last character is not (^ - not) an alphabet, or a number,
        # or a '/' (some websites may have that. you can add your own ones), then enter IF condition
        if (bool(re.match(r'[^a-zA-Z0-9/]', lastChar))): 
            cleanUrls.append(url[:-1]) # stripping last character, no matter what
        else:
            cleanUrls.append(url) # else, simply append to new list
    print(cleanUrls)
    return cleanUrls

URLs = extractURLs("http://saiconference.com/ficc2018/submit>")

But, if its just one character, it is simpler to use the .rstrip().