sohnyrin sohnyrin - 5 months ago 23
Perl Question

Using perl to replace accented word with no accent and associated number

Any accented word needs the accent to be removed and then a corresponding number should be added at the end of the word.


gàr must appear as gar3

▶ Words will only show acute accent and grave accent which should translate respectively as 2 and 3 added at the end of the word.

▶ Words might be concomitant with spaces, tabs, return, hyphen ( long or short ), parenthesis, question marks, etc.

▶ Words will include non ASCII chars like shin ( s with a funny upside down hat on it )

Can anyone suggest the right structure, regex and replacement pattern?

Thanks !

Here is a sample for testing :

14 IGI <DIŠ>⌈x⌉-èr-ra
15 IGI <DIŠ>bu-ṣí-ia
16 IGI <DIŠ>su-ka-lum
17 IGI <DIŠ>ì-lí-tu-[...x-...x]

It should result in :

14 IGI <DIŠ>⌈x⌉-er3-ra
15 IGI <DIŠ>bu-ṣi2-ia
16 IGI <DIŠ>su-ka-lum
17 IGI <DIŠ>i3-li2-tu-[...x-...x]


This is an inactive question, but since it might be useful for people searching similar problems, here's the code that does exactly what you asked for:

use utf8;
use Unicode::Normalize;

my $text='IGI <DIŠ>bu-ṣí-ia'; #your input data

my $x=NFD($text); #Normalization Form D (1)
$x=~s/\x{300}/3/g; #substitute grave accents with number 3 (2)
$x=~s/\x{301}/2/g; #substitute acute accents with number 2 (2)
$x=NFC($x); #Normalization Form C (1)
print $x; #prints "IGI <DIŠ>bu-ṣi2-ia"

1 I'm not a Unicode expert so I'm inadequate to explain clearly & properly what those functions do exactly. These articles or Google might give you a better idea.

2 Check the Unicode table for values 300 & 301.