AlamoSoul AlamoSoul - 1 month ago 6
R Question

How to separate data into multiple columns with strange separator using separate()?

Hey so I have a tibble with head() printed like this:

# A tibble: 6 × 1
id.make.model.year
<chr>
1 27550?????AM General?????DJ Po Vehicle 2WD?????1984
2 28426?????AM General?????DJ Po Vehicle 2WD?????1984
3 27549?????AM General?????FJ8c Post Office?????1984
4 28425?????AM General?????FJ8c Post Office?????1984
5 1032?????AM General?????Post Office DJ5 2WD?????1985
6 1033?????AM General?????Post Office DJ8 2WD?????1985


with only one column. I want to seperate this into four columns with those four column names. I tried to use
separate()


A %>%
separate(id.make.model.year,into=c("id","make"),sep="?????")


and

A %>%
separate(id.make.model.year,into=c("id","make"),sep="\\?????")


but they both return the following error:


Error in stringi::stri_split_regex(value, sep, n_max) :
Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)


Yet another try...:

A %>%
separate(id.make.model.year,into=c("id","make"),sep="[?????]")


which returns

# A tibble: 33,439 × 2
id make
* <chr> <chr>
1 27550
2 28426
3 27549
4 28425
5 1032
6 1033
7 3347
8 13309
9 13310
10 13311
# ... with 33,429 more rows
Warning message:
Too many values at 33439 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...


I also tried dropping sep, but all the spaces are clearly counted as separators.

What's the right way to do this? Thanks in advance.

Answer

The regex to match one question mark is \?, or [?]. However if you have five of them, [?????] still only one matches one occurrence of that character, just like [aaaaa] would only match one letter a, not five, because [...] defines a character class. So to capture the five repetitions I think you want \?{5} or [?]{5} (or \?\?\?\?\? or [?][?][?][?][?]).

Until you post data with dput() I can't confirm.