Shubhangi Sharma Shubhangi Sharma - 6 months ago 26
R Question

Count comma separated unique values in a string

The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string.
I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method?
The actual data set contains 500k rows and 53 columns.
Sample dataset :

df

c1 c2 c3
aa 11 1,13,4,5,4,7,9
bb 22 2,5,2,4,5,7,11,
cc 33 11,14,3,1,
dd 44 1,1,2,4,5,6,15,
ee 55 4,3,3,1,14,17,


desired output:

c1 c2 c3 c4
------ | ------ | ------ | -----
aa | 11 | 1,13,4,5,4,7,9 | 6
------ | ------ | ------ | -----
bb | 22 | 2,5,2,4,5,7,11, | 5
------ | ------ | ------ | -----
cc | 33 | 11,14,3,1, | 4
------ | ------ | ------ | -----
dd | 44 | 1,1,2,4,5,6,15, | 6
------ | ------ | ------ | -----
ee | 55 | 4,3,3,1,7,17,7, | 5
------ | ------ | ------ | -----


Any help would be appreciated!

Answer Source

We can use stri_extract to extract all the numbers, then loop through the list, find the length of unique elements

library(stringi)
df1$Count <- sapply(stri_extract_all_regex(df1$col3, "[0-9]+"), 
                     function(x) length(unique(x)))