ilalex - 4 months ago 21x
Python Question

# Fastest method to generate big random string with lower Latin letters

I'm trying to solve this problem from Timus Online Judge. To solve this problem you need generate a sequence of 1 000 000 lowercase Latin letters and write it to stdin in 1 second.

It is easy to solve this problem with C++ or Java. I have python solution here:

``````import os
from random import randint

s = ''.join(chr(97 + randint(0, 25)) for i in range(1000000))
os.write(1, bytes(s, 'utf8'))
``````

It takes 1.7s:

``````\$ time python3.3 1219.py > /dev/null

real    0m1.756s
user    0m1.744s
sys     0m0.008s
``````

And I got "Time limit exceeded" in result. So the question is "How to do it faster?"

UPD1:
Using
`randint(97, 122)`
reduces time at 16ms. Now it is 1.740s

UPD2:
Solution by @Martijn Pieters takes 0.979s, but it doesn't pass test either.

UPD3
Martijn Pieters suggested a very good solutions, but it's still slow:

``````from sys import stdin
from random import choice
from string import ascii_lowercase

s = ''.join([choice(ascii_lowercase) for _ in range(1000000)])
stdout.write(s)
``````

Takes 0.924s

``````from sys import stdout
from random import choice
from string import ascii_lowercase

for _ in range(1000000):
stdout.write(choice(ascii_lowercase))
``````

Takes 1.173s

``````from sys import stdout
from random import choice
from string import ascii_lowercase
bal = [c.encode('ascii') for c in ascii_lowercase]
out = stdout.buffer

for _ in range(1000000):
out.write(choice(bal))
``````

Takes 1.155s

``````from sys import stdout
from random import choice
from string import ascii_lowercase

bal = [c.encode('ascii') for c in ascii_lowercase]
stdout.buffer.write(b''.join([choice(bal) for _ in range(1000000)]))
``````

Takes 0.901s

UPD4

Some guy just solved problem on Timus. I hope he will share his solution :)

UPD5
Thanks to Ashwini Chaudhary for sharing his Python 2.x solution with us:

``````from random import choice
from string import ascii_lowercase
lis=list(ascii_lowercase)
print ''.join(choice(lis) for _ in xrange(1000000))
``````

It takes 0.527s on my computer and it passes tests on Timus. But problem with Python3.x still remains.

UPD6
Thanks to Markku K. this code:

``````import os
from random import random
from string import ascii_lowercase

bal = [c.encode('ascii') for c in ascii_lowercase]
os.write(1, b''.join([bal[int(random() * 26)] for _ in range(1000000)]))
``````

Takes 0.445s, but still didn't pass the test

Here's Python 3 code that generates 1000000 "random" lowercase letters in `0.28` seconds (see also `0.11`-seconds solution at the end; @Ashwini Chaudhary's code from the question takes `0.55` seconds on my machine, @Markku K.'s code -- `0.53`):

``````#!/usr/bin/env python3
import os
import sys

def write_random_lowercase(n):
min_lc = ord(b'a')
len_lc = 26
ba = bytearray(os.urandom(n))
for i, b in enumerate(ba):
ba[i] = min_lc + b % len_lc # convert 0..255 to 97..122
sys.stdout.buffer.write(ba)

write_random_lowercase(1000000)
``````

`% len_lc` skews the distribution (see at the end on how to fix it) though It still satisfies the conditions (ascii, lowercase, frequencies of 1, 2, 3 letter sequences):

``````\$ python3 generate-random.py | python3 check-seq.py
``````

where `check-seq.py`:

``````#!/usr/bin/env python3
import sys
from collections import Counter
from string import ascii_lowercase

def main():
limits = [40000, 2000, 100]

s = sys.stdin.buffer.readline() # a single line
assert 1000000 <= len(s) <= 1000002 # check length +/- newline
s.decode('ascii','strict') # check ascii
assert set(s) == set(ascii_lowercase.encode('ascii')) # check lowercase

for n, lim in enumerate(limits, start=1):
freq = Counter(tuple(s[i:i+n]) for i in range(len(s)))
assert max(freq.values()) <= lim, freq

main()
``````

Note: on acm.timus.ru `generate-random.py` gives "Output limit exceeded".

To improve performance, you could use `bytes.translate()` method (`0.11` seconds):

``````#!/usr/bin/env python3
import os
import sys

# make translation table from 0..255 to 97..122
tbl = bytes.maketrans(bytearray(range(256)),
bytearray([ord(b'a') + b % 26 for b in range(256)]))
# generate random bytes and translate them to lowercase ascii
sys.stdout.buffer.write(os.urandom(1000000).translate(tbl))
``````

## How to fix `% len_lc` skew

`256` (number of bytes) is not evenly divisible by `26` (number of lower Latin letters) therefore the formula `min_lc + b % len_lc` makes some values appear less often than others e.g.:

``````#!/usr/bin/env python3
"""Find out skew: x = 97 + y % 26 where y is uniform from [0, 256) range."""
from collections import Counter, defaultdict

def find_skew(random_bytes):
char2freq = Counter(chr(ord(b'a') + b % 26) for b in random_bytes)
freq2char = defaultdict(set)
for char, freq in char2freq.items():
return {f: ''.join(sorted(c)) for f, c in freq2char.items()}

print(find_skew(range(256)))
# -> {9: 'wxyz', 10: 'abcdefghijklmnopqrstuv'}
``````

Here, the input `range(256)` is uniformly distributed (each byte occurs exactly once) but `'wxyz'` letters in the output are less often then the rest `9` vs. `10` occurrences. To fix it, unaligned bytes could be dropped:

``````print(find_skew(range(256 - (256 % 26))))
# -> {9: 'abcdefghijklmnopqrstuvwxyz'}
``````

Here, the input is uniformly distributed bytes in the range `[0, 234)` the output is uniformly distributed ascii lowercase letters.

`bytes.translate()` accepts the second argument to specify bytes to delete:

``````#!/usr/bin/env python3
import os
import sys

nbytes = 256
nletters = 26
naligned = nbytes - (nbytes % nletters)
tbl = bytes.maketrans(bytearray(range(naligned)),
bytearray([ord(b'a') + b % nletters
for b in range(naligned)]))
bytes2delete = bytearray(range(naligned, nbytes))
R = lambda n: os.urandom(n).translate(tbl, bytes2delete)

def write_random_ascii_lowercase_letters(write, n):
"""*write* *n* random ascii lowercase letters."""
while n > 0:
# R(n) expected to drop `(nbytes - nletters) / nbytes` bytes
# to compensate, increase the initial size
n -= write(memoryview(R(n * nbytes // naligned + 1))[:n])

write = sys.stdout.buffer.write
write_random_ascii_lowercase_letters(write, 1000000)
``````

If the random generator (`os.urandom` here) produces long sequences of the bytes that are outside of the aligned range (`>=234`) then the `while` loop may execute many times.

The time performance can be improved another order of maginute if `random.getrandbits(8*n).to_bytes(n, 'big')` is used instead of `os.urandom(n)`. The former uses Mersenne Twister as the core generator that may be faster than `os.urandom()` that uses sources provided by the operating system. The latter is more secure if you use the random string for secrets.