Haxet - 1 year ago 66
Python Question

# Why is Sort(s1)==Sort(s2) better to use than Counter(s1)==Counter(s2)

Is this considered an efficient way to check if two strings are permutations of eachother? What would be the difference between using two dictionaries instead? Is Counter more optimized for python?

I guess I should clarify, the reason for me posting this is this comment : Check if two unordered lists are equal mentions that it's more efficient to check if two strings are the same by using sort(s1)==sort(s2).Since this is o(nlog(n)), you would assume Counter would have a better run time.

In the other question you referenced, J.F. Sebastian theorized that using `sort` would be faster than using `Counter` in nearly all cases, but he wisely noted that if you really care you should measure: "In practice it might always be faster then `collections.Counter()` (despite asymptotically `O(n)` time being better then `O(n*log(n))` for `.sort()`). Measure it, if it is important."

Such a measurement is shown below. The high points:

• Up to list sizes of roughly `10^6`, `sort` was faster.

• These experiments use lists that have many discrete values. The benchmarks might be different if the lists had a moderately high percentage of repeated values.

• These measurements suggest that the overhead of creating a `Counter` is generally larger than the cost of just sorting the data in place. One way to think about this is to view the overhead of the `Counter` creation relative to the `log(n)` portion of the `sort` approach. Not until `n` gets fairly large does the algorithmic advantage of the `Counter` data structure begin to pay off enough to justify that overhead.

• But remember: for small data sizes, the speed advantage might be irrelevant -- unless you are repeating the operation at high volume. In this light, the "algorithmically correct" approach (use a `Counter`) might still be better from a code design point of view for most use cases -- because it's faster if the data is large and it's fast-enough if the data is small.

The code:

``````import timeit, sys, random
from collections import Counter

def prep():
random.shuffle(xs)
random.shuffle(ys)

def use_counter():
prep()
return Counter(xs) == Counter(ys)

def use_sort():
prep()
xs.sort()
ys.sort()
return xs == ys

experiments = [
(3, 10, 100000),
(3, 100, 10000),
(3, 1000, 1000),
(3, 10000, 100),
(3, 100000, 10),
(3, 1000000, 1),
(1, 10000000, 1),
]

for e in experiments:
repeat, list_size, timeit_n = e
xs = list(range(list_size))
ys = list(range(list_size))
r1 = timeit.repeat(use_counter, repeat = repeat, number = timeit_n)
r2 = timeit.repeat(use_sort,    repeat = repeat, number = timeit_n)
print
print e
print 'total  ', sum(r1), sum(r2)
print 'm1/m2  ', min(r1) / min(r2)
``````

Example output (ratio > 1 means `Counter` is slower):

``````(3, 10, 100000)
total   5.06751918793 2.15432405472
m1/m2   2.34850470872

(3, 100, 10000)
total   3.16299915314 2.06651735306
m1/m2   1.52879303981

(3, 1000, 1000)
total   3.017786026 2.42989587784
m1/m2   1.24086325316

(3, 10000, 100)
total   3.06426525116 2.74061489105
m1/m2   1.11802891855

(3, 100000, 10)
total   3.66198205948 3.35467290878
m1/m2   1.1028159291

(3, 1000000, 1)
total   5.19361901283 5.08777713776
m1/m2   1.03125948765

(1, 10000000, 1)
total   20.0118789673 24.6061840057
m1/m2   0.813286569044
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download