Dance Party2 Dance Party2 - 16 days ago 5
Python Question

Pandas N-Grams to Columns

Given the following data frame:

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
strings
0 Hello
1 Helloworld


I'd like to split each string into chunks of 3 characters and use those as headers to create a matrix of 1s or 0s, depending on if a given row has the chunk of 3 characters.

Like this:

Strings Hel low orl
0 Hello 1 0 0
1 Helloworld 1 1 1


Notice that the string "Hello" has a 0 for the "low" column as it is only assigning a 1 for exact partial matches. If there is more than 1 match (i.e. if the string were "HelHel", it would still only assign a 1 (though it would also be nice to know how to count it and thus assign a 2 instead).

Ultimately, I'm trying to prepare my data for us in an LSHForest via SKLearn.
Therefore, I anticipate many different string values.

Here's what I've tried so far:

#Split into chunks of exactly 3
def split(s, chunk_size):
a = zip(*[s[i::chunk_size] for i in range(chunk_size)])
return [''.join(t) for t in a]
cols=[split(s,3) for s in f['strings']]
cols

[['Hel'], ['Hel', 'low', 'orl']]

#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(cols))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['orl', 'Hel', 'low']


So now, all I need to do is create a column in f for each element in colsunq and add 1 if the string in the 'strings' column has a match with the chunk for each given column header.

Thanks in advance!

Note:
In case shingling is preferred:

#Shingle into strings of exactly 3
def shingle(word):
a = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
return [''.join(t) for t in a]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
a=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(a))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

Answer
def str_chunk(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = j, j + k

def chunkit(s, k):
    return [_ for _ in str_chunk(s, k)]

def count_chunks(s, k):
    return pd.value_counts(chunkit(s, k))

demonstration

f.strings.apply(chunkit, k=3)

0              [Hel]
1    [Hel, low, orl]
Name: strings, dtype: object

f.strings.apply(count_chunks, k=3).fillna(0)

enter image description here

Comments