Noobie Noobie - 1 year ago 92
Python Question

how to extract a headline form a url?

I have a dataset of headlines, such as

I need to extract from these kind of links the proper headline, that is:

  • this-is-a-very-nice-headline-my-friend

  • another-very-nice

  • hello-another-one-here

  • hello-one-here-that-is-cool

  • the-real-one

  • the-good-one

  • hello-world-here-is-a-weird-character

so the rule seems to find the longest string of the form
- that has a
at the right or left border and without considering

  1. words with more than 3 digits (for instance
    in the first link, or
    in the third one ,

  2. excluding stuff like

How can I do that using regex in Python? I believe regex is the only viable solution here unfortunately. Packages such as
can capture the path of the url, but then I am back to using regex to get the headline..

Many thanks!

Jan Jan
Answer Source

After all, regular expressions might not be your best bet.
However, with the specifications you came up with, you could do the following:

import re

urls = ['',

regex = re.compile(r'(?<=/)([-\w]+)(?=[.?/#]|$)')
digits = re.compile(r'-?\d{3,}-?')

for url in urls:
    substrings = regex.findall(url)
    longest = max(substrings, key=len)
    headline = re.sub(digits, '', longest)
    print headline

This will print


See a demo on


Here, the regex uses lookarounds to look for a / behind and one of .?/# ahead. Any word character and dash in between is captured.
This is not very specific but if you're looking for the longest substring and eliminate more then three consecutive digits afterwards, it might be a good starting point.
As already said in the comments, you might perhaps be better off using linguistic tools.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download