Dmitry Leykin Dmitry Leykin - 3 months ago 6
R Question

single letter regex operations in R

I'm trying to identify in Hebrew text incidents where i have a word (with 2 or more words) followed by single letter. I need to match these instances, and then concatenate the single letter to its' preceding word. Any text might have multiple incidents of that:
Example:

texts <- c("שלום חברי צה ל היקרים", "נכון לא נכון קשק ש בבטחון", "צה ל ינצח ")


I need to replace it to:

texts <- c("שלום חברי צהל היקרים", "נכון לא נכון קשקש בבטחון", "צהל ינצח ")


Thank you for the suggestions

Answer

From here, the hebrew letter unicode range is from 05D0-05F2, so you can specify the unicode range in the character class which will then match a single hebrew letter. Specifying the space as the word boundary on each side, you can match a single letter word and substitute with the capture group to remove the space before the letter.

gsub("\\s([\u05D0-\u05F2]\\s)", "\\1", texts)  # hebrew letter unicode range
# [1] "שלום חברי צהל היקרים"     "נכון לא נכון קשקש בבטחון" "צהל ינצח "

Hebrew symbols unicode range from here, you can adjust accordingly based on what you need.

gsub("\\s([\u0590-\u05FF]\\s)", "\\1", texts)  
# [1] "שלום חברי צהל היקרים"     "נכון לא נכון קשקש בבטחון" "צהל ינצח " 
Comments