Héctor Héctor - 3 months ago 47
R Question

tidyr::separate does'nt work with large data.frame

I have a database with the following structure.I want to separate the last column, a part number and the other with the description, separating by "-". In the example it works ok. But when I use this in my database (50,700 cases) I get an error message.

rut<-c("50001780", "50001810", "50001820",
"50001850", "50001890", "50001940")
econ.activ<-c("552010 - RESTAURANTES",
"930990 - OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.",
"523911 - COMERCIO AL POR MENOR DE ARTICULOS FOTOGRAFICOS",
"930990 - OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.",
"930990 - OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.",
"930990 - OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.")

df<-data.frame(rut,econ.activ)

df %>% separate(col = "econ.activ", into = c("folio", "descripción"), sep = "\\-")


This is the error message

Warning message:
Too many values at 174 locations: 3645, 4401, 5118, 10013, 10018, 10023, 10905, 10907, 10921, 10923, 10928, 10930, 10935, 10937, 10942, 10944, 11586, 13556, 13557, 13614, ...


Thank's in advance.

Answer

There could be more than one - in some rows and that may be the reason for the warning. One option is cSplit from splitstackshape. It will split the 'econ.activ' into 'n' number of columns based on the maximum number (i.e. 'n') of '-' in a particular row. By default, it will pad NA to columns if there are less number of '-' in certain rows.

library(splitstackshape)
cSplit(df, "econ.activ", sep="-")

If we need to reproduce the warning, just insert another - in one of the elements

df$econ.activ <- as.character(df$econ.activ)
df$econ.activ[3] <- "930990 - OTRAS - SEP"
df %>% 
   separate(col = "econ.activ", into = c("folio", "descripción"), sep = "\\-")
---
---

Warning message: Too many values at 1 locations: 3

It is because the separate have an argument extra which is by default warn. There are other options like drop or merge. If we use the drop it remove the substring from the second - onwards

df %>% 
  separate(col = "econ.activ", into = c("folio", "descripción"), sep = "\\-", extra="drop")
#       rut   folio                                       descripción
#1 50001780 552010                                       RESTAURANTES
#2 50001810 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#3 50001820 930990                                             OTRAS 
#4 50001850 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#5 50001890 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#6 50001940 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.

and with extra = "merge", it retains that substring in the "descripción" column. So, basically, it depends on what the OP wants as output

df %>% 
    separate(col = "econ.activ", into = c("folio", "descripción"), 
                 sep = "\\-", extra="merge")
#      rut   folio                                       descripción
#1 50001780 552010                                       RESTAURANTES
#2 50001810 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#3 50001820 930990                                        OTRAS - SEP
#4 50001850 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#5 50001890 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.
#6 50001940 930990   OTRAS ACTIVIDADES DE SERVICIOS PERSONALES N.C.P.