rassar rassar - 12 days ago 5
Python Question

Generate strings that always contain a certain letter python

I want to generate words plus a letter. But all of the generated words must contain that letter. I am generating a very high quantity of words. So it is very inefficient to say:

(word for word in self.getWords(letters, 8) if letter in word)


or something equivalent.

getWords
code:

def getWords(self, iterable, maxDepth):
allWords = []
for depth in range(3, maxDepth + 1):
for word in itertools.permutations(iterable, depth):
allWords.append("".join(word))


I would like to have
getWords
only think about words with
letter
in them. Is there way to use
itertools
to achieve this result?

Answer

First, generate the subset of words which contain the letter you want:

def subset(char, words):
    return set([word for word in words if char in word.lower()])

bsub = subset("b", words)

Then you can take a random sample of those words:

# Take 100 random words which contain the letter b.
result = random.sample(bsub, 100)

Alternatively, by modifying getWords we can filter out words that don't contain the required letter:

def getWords(self, iterable, requiredLetter, maxDepth):
    allWords = set()
    for depth in range(3, maxDepth + 1):
        for word in itertools.permutations(iterable, requiredLetter, depth):
            if requiredLetter in word:
                allWords.add(word)  # or maybe word.lower() if it's case insensitive

It's also worth mentioning: if every word in allWords is unique, converting it into a set() will reduce the membership test from O(n) to O(1).

Sets are faster because it doesn't have to iterate through the entire list to test membership. Strings are immutable, so they are hashed, which makes membership tests take constant time.

In your case, you aren't doing a membership, so converting to sets won't give an appreciable increase in speed, but making a subset to choose from will speed it up, as then each generation does not require testing for validity.