Haxet Haxet - 7 days ago 5
Python Question

Why is Sort(s1)==Sort(s2) better to use than Counter(s1)==Counter(s2)

Is this considered an efficient way to check if two strings are permutations of eachother? What would be the difference between using two dictionaries instead? Is Counter more optimized for python?

I guess I should clarify, the reason for me posting this is this comment : Check if two unordered lists are equal mentions that it's more efficient to check if two strings are the same by using sort(s1)==sort(s2).Since this is o(nlog(n)), you would assume Counter would have a better run time.

FMc FMc
Answer

In the other question you referenced, J.F. Sebastian theorized that using sort would be faster than using Counter in nearly all cases, but he wisely noted that if you really care you should measure: "In practice it might always be faster then collections.Counter() (despite asymptotically O(n) time being better then O(n*log(n)) for .sort()). Measure it, if it is important."

Such a measurement is shown below. The high points:

  • Up to list sizes of roughly 10^6, sort was faster.

  • That advantage steadily dwindles as the list size grows.

  • These experiments use lists that have many discrete values. The benchmarks might be different if the lists had a moderately high percentage of repeated values.

  • These measurements suggest that the overhead of creating a Counter is generally larger than the cost of just sorting the data in place. One way to think about this is to view the overhead of the Counter creation relative to the log(n) portion of the sort approach. Not until n gets fairly large does the algorithmic advantage of the Counter data structure begin to pay off enough to justify that overhead.

  • But remember: for small data sizes, the speed advantage might be irrelevant -- unless you are repeating the operation at high volume. In this light, the "algorithmically correct" approach (use a Counter) might still be better from a code design point of view for most use cases -- because it's faster if the data is large and it's fast-enough if the data is small.

The code:

import timeit, sys, random
from collections import Counter

def prep():
    random.shuffle(xs)
    random.shuffle(ys)

def use_counter():
    prep()
    return Counter(xs) == Counter(ys)

def use_sort():
    prep()
    xs.sort()
    ys.sort()
    return xs == ys

experiments = [
    (3, 10, 100000),
    (3, 100, 10000),
    (3, 1000, 1000),
    (3, 10000, 100),
    (3, 100000, 10),
    (3, 1000000, 1),
    (1, 10000000, 1),
]

for e in experiments:
    repeat, list_size, timeit_n = e
    xs = list(range(list_size))
    ys = list(range(list_size))
    r1 = timeit.repeat(use_counter, repeat = repeat, number = timeit_n)
    r2 = timeit.repeat(use_sort,    repeat = repeat, number = timeit_n)
    print
    print e
    print 'total  ', sum(r1), sum(r2)
    print 'm1/m2  ', min(r1) / min(r2)

Example output (ratio > 1 means Counter is slower):

(3, 10, 100000)
total   5.06751918793 2.15432405472
m1/m2   2.34850470872

(3, 100, 10000)
total   3.16299915314 2.06651735306
m1/m2   1.52879303981

(3, 1000, 1000)
total   3.017786026 2.42989587784
m1/m2   1.24086325316

(3, 10000, 100)
total   3.06426525116 2.74061489105
m1/m2   1.11802891855

(3, 100000, 10)
total   3.66198205948 3.35467290878
m1/m2   1.1028159291

(3, 1000000, 1)
total   5.19361901283 5.08777713776
m1/m2   1.03125948765

(1, 10000000, 1)
total   20.0118789673 24.6061840057
m1/m2   0.813286569044
Comments