I have found most of the posts here are approaching tag to find the urls in a text file. But not all text files necessarily got html tags next to them. I am looking for a solution that works in both situations. The following regex is:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower())
myFile = open("emailBody.txt")
fileContent = myFile.read()
URLs = URLs + extractURLs(fileContent)
Quick solution, assuming '>' is the only character that appears at the end:
Removes the last occurrence of the character for a single string. So, you will have to iterate through the list and remove the character.
Edit: Just got a PC with python, so giving a regex answer, after testing.
import re def extractURLs(fileContent): urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', fileContent.lower()) cleanUrls =  for url in urls: lastChar = url[len(url) - 1] # get the last character # if the last character is not (^ - not) an alphabet, or a number, # or a '/' (some websites may have that. you can add your own ones), then enter IF condition if (bool(re.match(r'[^a-zA-Z0-9/]', lastChar))): cleanUrls.append(url[:-1]) # stripping last character, no matter what else: cleanUrls.append(url) # else, simply append to new list print(cleanUrls) return cleanUrls URLs = extractURLs("http://saiconference.com/ficc2018/submit>")
But, if its just one character, it is simpler to use the .rstrip().