GBR24 GBR24 - 2 months ago 9
Python Question

Sorting words in a list of strings based on their relative frequencies, not regular sorting?

Suppose I have a

pandas.Series
object:

import pandas as pd

s = pd.Series(["hello there you would like to sort me",
"sorted i would like to be", "the banana does not taste like the orange",
"my friend said hello", "hello there amigo", "apple apple banana orange peach pear plum",
"orange is my favorite color"])


I want to sort the words inside each row based on the frequency with which each word occurs in the entire
Series
.

I can create a dictionary of the word: frequency key-value pairs easily:

from collections import Counter

def create_word_freq_dict(series):
return Counter(word for row in series for word in row.lower().split())

word_counts = create_word_freq_dict(s)


Without procedurally going through each row in the
Series
, how can I sort the word in this object by their relative frequencies? That is to say, for example, that "hello" occurs more frequently than "friend," and so should be further to the left in the resultant "sorted" string.

This is what I have:

for row in s:
ordered_words = []
words = row.split()
if len(words) == 1:
ordered_words.append(words[0])
else:
i = 1
prevWord = words[0]
prevWord_freq = word_counts[prevWord]
while i < len(words):
currWord = words[i]
currWord_freq = word_counts[currWord]
if currWord_freq > prevWord_freq:
prevWord = currWord
prevWord_freq = currWord_freq
words.append(currWord)
...


It's not complete yet, but is there a better way (as opposed to recursion) of sorting in this manner?

Answer

All you have to do is create custom comparator based on your counter and call sorting

s = ["hello there you would like to sort me", 
    "sorted i would like to be", "the banana does not taste like the orange", 
    "my friend said hello", "hello there amigo", "apple apple banana orange peach pear plum", 
    "orange is my favorite color"]


from collections import Counter

def create_word_freq_dict(series):
    return Counter(word for row in series for word in row.lower().split())

word_counts = create_word_freq_dict(s)

for row in s:
    print sorted(row.lower().split(), lambda x, y: word_counts[y] - word_counts[x])

So all I do here is simply call sorted with custom comparison operator, which ignores the word, and instead uses word_counts mapping to determine which one should be first.

and effect

['hello', 'like', 'there', 'would', 'to', 'you', 'sort', 'me']
['like', 'would', 'to', 'sorted', 'i', 'be']
['like', 'orange', 'the', 'banana', 'the', 'does', 'not', 'taste']
['hello', 'my', 'friend', 'said']
['hello', 'there', 'amigo']
['orange', 'apple', 'apple', 'banana', 'peach', 'pear', 'plum']
['orange', 'my', 'is', 'favorite', 'color']

and to prove it really sorts according to frequencies:

for row in s:
    sorted_row = sorted(row.split(), lambda x, y: word_counts[y] - word_counts[x])
    print zip(sorted_row, map(lambda x: word_counts[x], sorted_row))

produces

[('hello', 3), ('like', 3), ('there', 2), ('would', 2), ('to', 2), ('you', 1), ('sort', 1), ('me', 1)]
[('like', 3), ('would', 2), ('to', 2), ('sorted', 1), ('i', 1), ('be', 1)]
[('like', 3), ('orange', 3), ('the', 2), ('banana', 2), ('the', 2), ('does', 1), ('not', 1), ('taste', 1)]
[('hello', 3), ('my', 2), ('friend', 1), ('said', 1)]
[('hello', 3), ('there', 2), ('amigo', 1)]
[('orange', 3), ('apple', 2), ('apple', 2), ('banana', 2), ('peach', 1), ('pear', 1), ('plum', 1)]
[('orange', 3), ('my', 2), ('is', 1), ('favorite', 1), ('color', 1)]
Comments