tadm123 tadm123 - 10 months ago 51
Python Question

Matching words with Regex (Python 3)

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.


Given the following input strings, find all possible output words 5 characters or longer.

  1. qwertyuytresdftyuioknn

  2. gijakjthoijerjidsdfnokg

Your program should find all possible words (5+ characters) that can be derived from the strings supplied.
Use http://norvig.com/ngrams/enable1.txt as your search dictionary.
The order of the output words doesn't matter.

  1. queen question

  2. gaeing garring gathering gating geeing gieing going

Assumptions about the input strings:

  • QWERTY keyboard

  • Lowercase a-z only, no whitespace or punctuation

  • The first and last characters of the input string will always match
    the first and last characters of the desired output word.

  • Don't assume users take the most efficient path between letters

  • Every letter of the output word will appear in the input string

Attempted solution:

First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):

import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):

I'm then trying to match the words I need using regex. The problem is that I don't know how to format my
to achieve this.

import re
input = 'qwertyuytresdftyuioknn' #example
fp= open('words.txt')
string = fp.read()

regex = re.compile(input[0]+'\w{3,}'+input[-1]) #wrong need help here

As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with
. Any help into this would be greatly appreciated.

Answer Source

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.


If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.