Mislav - 2 months ago 16
R Question

I am using R, version 3.3.1. I have following column:

``````my_column <-
c("1. SuvlasniÄŤki dio: 1/21. SuvlasniÄŤki dio: 1/2 ", "CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42 ",
"2. SuvlasniÄŤki dio: 1/22. SuvlasniÄŤki dio: 1/2 ", "CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42 ",
"*1. VlasniÄŤki dio: 1/1*1. VlasniÄŤki dio: 1/1 ", "*MUHVIÄ† IVAN, ANTUNOV, GAREĹ NICA, MATIJE GUPCA 3*MUHVIÄ† IVAN, ANTUNOV, GAREĹ NICA, MATIJE GUPCA 3 ",
"2. SuvlasniÄŤki dio: 1/22. SuvlasniÄŤki dio: 1/2 ", "ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 ",
"3. SuvlasniÄŤki dio: 1/23. SuvlasniÄŤki dio: 1/2 ", "ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156 "
)
``````

Strings in column starts with letter, number, *number or *letter. I would like to remove all strings that start with number and *number. I tried following code:

``````my_column[grepl(pattern = "(?=^[^\\*]\\D{2})(?=^\\D)", x = my_column, perl = TRUE)]
# [1] "CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42 "
# [2] "CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42 "
# [3] "ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
# [4] "ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156 "
``````

But it returns only strings that start with letter and not * letter words?

From the start (`^`) of the string, we match zero or more `*`(`\\**`) followed by a number (`[0-9]`)`and negate`!` to extract the elements.

``````my_column[!grepl("^(\\**[0-9])", my_column)]
#[1] "CRKVENAC ANDRIJA, GAREL NICA KBR. 42CRKVENAC ANDRIJA, GAREL NICA KBR. 42 "
#[2] "CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42 "
#[3] "*MUHVIÄ† IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3*MUHVIÄ† IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3 "
#[4] "ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
#[5] "ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156 "
``````

Or use `grep` with `invert=TRUE`,

``````grep("^(\\**[0-9])", my_column, invert=TRUE, value=TRUE)
#[1] "CRKVENAC ANDRIJA, GAREL NICA KBR. 42CRKVENAC ANDRIJA, GAREL NICA KBR. 42 "
#[2] "CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42 "
#[3] "*MUHVIÄ† IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3*MUHVIÄ† IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3 "
#[4] "ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIÄ† BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
#[5] "ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156 "
``````

NOTE: Based on the OP's post `But it returns only strings that start with letter and not * letter words?`