Mislav Mislav - 2 months ago 16
R Question

Word doesnt start with number or asterix and number (regex)

I am using R, version 3.3.1. I have following column:

my_column <-
c("1. SuvlasniÄŤki dio: 1/21. SuvlasniÄŤki dio: 1/2 ", "CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42 ",
"2. SuvlasniÄŤki dio: 1/22. SuvlasniÄŤki dio: 1/2 ", "CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42 ",
"*1. Vlasnički dio: 1/1*1. Vlasnički dio: 1/1 ", "*MUHVIĆ IVAN, ANTUNOV, GAREŠNICA, MATIJE GUPCA 3*MUHVIĆ IVAN, ANTUNOV, GAREŠNICA, MATIJE GUPCA 3 ",
"2. SuvlasniÄŤki dio: 1/22. SuvlasniÄŤki dio: 1/2 ", "ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 ",
"3. SuvlasniÄŤki dio: 1/23. SuvlasniÄŤki dio: 1/2 ", "ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156 "
)


Strings in column starts with letter, number, *number or *letter. I would like to remove all strings that start with number and *number. I tried following code:

my_column[grepl(pattern = "(?=^[^\\*]\\D{2})(?=^\\D)", x = my_column, perl = TRUE)]
# [1] "CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42CRKVENAC ANDRIJA, GAREĹ NICA KBR. 42 "
# [2] "CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUĹ , GAREĹ NICA KBR. 42 "
# [3] "ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
# [4] "ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREĹ NICA, MATIJE GUPCA KBR. 156 "


But it returns only strings that start with letter and not * letter words?

Answer

From the start (^) of the string, we match zero or more *(\\**) followed by a number ([0-9])and negate!` to extract the elements.

my_column[!grepl("^(\\**[0-9])", my_column)]
#[1] "CRKVENAC ANDRIJA, GAREL NICA KBR. 42CRKVENAC ANDRIJA, GAREL NICA KBR. 42 "                                    
#[2] "CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42 "          
#[3] "*MUHVIĆ IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3*MUHVIĆ IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3 "        
#[4] "ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
#[5] "ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156 "  

Or use grep with invert=TRUE,

grep("^(\\**[0-9])", my_column, invert=TRUE, value=TRUE)
#[1] "CRKVENAC ANDRIJA, GAREL NICA KBR. 42CRKVENAC ANDRIJA, GAREL NICA KBR. 42 "                                    
#[2] "CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42CRKVENAC LJUBICA ROÄ. VERTUL , GAREL NICA KBR. 42 "          
#[3] "*MUHVIĆ IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3*MUHVIĆ IVAN, ANTUNOV, GAREL NICA, MATIJE GUPCA 3 "        
#[4] "ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4ANÄAL-MLINARIĆ BRIGITA, BJELOVAR, V. LISINSKOG KBR. 4 "
#[5] "ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156ANÄAL LIDIJA, GAREL NICA, MATIJE GUPCA KBR. 156 " 

NOTE: Based on the OP's post But it returns only strings that start with letter and not * letter words?

Comments