Haxet - 1 year ago 53

Python Question

Is this considered an efficient way to check if two strings are permutations of eachother? What would be the difference between using two dictionaries instead? Is Counter more optimized for python?

I guess I should clarify, the reason for me posting this is this comment : Check if two unordered lists are equal mentions that it's more efficient to check if two strings are the same by using sort(s1)==sort(s2).Since this is o(nlog(n)), you would assume Counter would have a better run time.

Answer Source

In the other question you referenced, J.F. Sebastian theorized that using `sort`

would be faster than using `Counter`

in nearly all cases, but he wisely noted that if you really care you should measure: "In practice it might always be faster then `collections.Counter()`

(despite asymptotically `O(n)`

time being better then `O(n*log(n))`

for `.sort()`

). Measure it, if it is important."

Such a measurement is shown below. The high points:

Up to list sizes of roughly

`10^6`

,`sort`

was faster.That advantage steadily dwindles as the list size grows.

These experiments use lists that have many discrete values. The benchmarks might be different if the lists had a moderately high percentage of repeated values.

These measurements suggest that the overhead of creating a

`Counter`

is generally larger than the cost of just sorting the data in place. One way to think about this is to view the overhead of the`Counter`

creation relative to the`log(n)`

portion of the`sort`

approach. Not until`n`

gets fairly large does the algorithmic advantage of the`Counter`

data structure begin to pay off enough to justify that overhead.**But remember**: for small data sizes, the speed advantage might be irrelevant -- unless you are repeating the operation at high volume. In this light, the "algorithmically correct" approach (use a`Counter`

) might still be better from a code design point of view for most use cases -- because it's faster if the data is large and it's fast-enough if the data is small.

The code:

```
import timeit, sys, random
from collections import Counter
def prep():
random.shuffle(xs)
random.shuffle(ys)
def use_counter():
prep()
return Counter(xs) == Counter(ys)
def use_sort():
prep()
xs.sort()
ys.sort()
return xs == ys
experiments = [
(3, 10, 100000),
(3, 100, 10000),
(3, 1000, 1000),
(3, 10000, 100),
(3, 100000, 10),
(3, 1000000, 1),
(1, 10000000, 1),
]
for e in experiments:
repeat, list_size, timeit_n = e
xs = list(range(list_size))
ys = list(range(list_size))
r1 = timeit.repeat(use_counter, repeat = repeat, number = timeit_n)
r2 = timeit.repeat(use_sort, repeat = repeat, number = timeit_n)
print
print e
print 'total ', sum(r1), sum(r2)
print 'm1/m2 ', min(r1) / min(r2)
```

Example output (ratio > 1 means `Counter`

is slower):

```
(3, 10, 100000)
total 5.06751918793 2.15432405472
m1/m2 2.34850470872
(3, 100, 10000)
total 3.16299915314 2.06651735306
m1/m2 1.52879303981
(3, 1000, 1000)
total 3.017786026 2.42989587784
m1/m2 1.24086325316
(3, 10000, 100)
total 3.06426525116 2.74061489105
m1/m2 1.11802891855
(3, 100000, 10)
total 3.66198205948 3.35467290878
m1/m2 1.1028159291
(3, 1000000, 1)
total 5.19361901283 5.08777713776
m1/m2 1.03125948765
(1, 10000000, 1)
total 20.0118789673 24.6061840057
m1/m2 0.813286569044
```