CodeFarmer CodeFarmer - 5 months ago 15
Perl Question

How to batch delete foreign language in subtitle

I have some sample like this:

2
00:01:32,288 --> 00:01:33,208
¬O¥L­Ì¶Ü¡H
How are you?

3
00:01:36,768 --> 00:01:39,648
€Ñ°Ú¡A¥L­Ì¥ŽºâŽN³o»ò°µ¶Ü¡H
âŽN³o»ò°µ¶Ü¡H
I am fine
And you ?


--------------------Here is my solution but it's incomplete

#!/usr/bin/perl -w
$lineIndex = 0;
while($line=<>){
$lineIndex++; #line index start from 1
$content{$lineIndex}=$line; #copy to content
for($i = 0; $i < length ($line); $i++){
$char = substr $line,$i,1;
if($char =~ /\W/){
#print $char;
$count{$lineIndex}++; #how many special char this line
}
}
}
# if line contains more than 14 special char,then skip
print "\n";
for $i (keys %count){
if($count{$i} > 14){ #<----------------see here
delete $content{$i};#delete from content
}
}

for $j (sort keys %content){ #output
print $content{$j};
}





my solution has this problem:
���O�J�b�յۺ��X is miss match, because its length <= 14
if change threshold to small number eg.6 string like 00:01:33,208 will be matched, thus delete from content

Is there a good way to check char in utf-8 ?

Answer

Here's a much simpler solution:

while($line = <>) {
    print $line unless $line =~ /[^\x00-\x7e]/;
}

The character set [\x00-\x7e] covers all basic ASCII characters (including control characters).