tadm123 tadm123 - 25 days ago 8
Python Question

Matching words with Regex (Python 3)

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.

Problem:

Given the following input strings, find all possible output words 5 characters or longer.


  1. qwertyuytresdftyuioknn

  2. gijakjthoijerjidsdfnokg



Your program should find all possible words (5+ characters) that can be derived from the strings supplied.
Use http://norvig.com/ngrams/enable1.txt as your search dictionary.
The order of the output words doesn't matter.


  1. queen question

  2. gaeing garring gathering gating geeing gieing going
    goring



Assumptions about the input strings:


  • QWERTY keyboard

  • Lowercase a-z only, no whitespace or punctuation

  • The first and last characters of the input string will always match
    the first and last characters of the desired output word.

  • Don't assume users take the most efficient path between letters

  • Every letter of the output word will appear in the input string



Attempted solution:

First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):

import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
fp.write(chunk)
fp.close()


I'm then trying to match the words I need using regex. The problem is that I don't know how to format my
re.compile()
to achieve this.

import re
input = 'qwertyuytresdftyuioknn' #example
fp= open('words.txt')
string = fp.read()

regex = re.compile(input[0]+'\w{3,}'+input[-1]) #wrong need help here
regex.findall(string)


As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with
\w{3,}
. Any help into this would be greatly appreciated.

Answer

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.

Addition:

If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.