CodeFarmer CodeFarmer - 4 months ago 6x
Perl Question

How to batch delete foreign language in subtitle

I have some sample like this:

00:01:32,288 --> 00:01:33,208
How are you?

00:01:36,768 --> 00:01:39,648
I am fine
And you ?

--------------------Here is my solution but it's incomplete

#!/usr/bin/perl -w
$lineIndex = 0;
$lineIndex++; #line index start from 1
$content{$lineIndex}=$line; #copy to content
for($i = 0; $i < length ($line); $i++){
$char = substr $line,$i,1;
if($char =~ /\W/){
#print $char;
$count{$lineIndex}++; #how many special char this line
# if line contains more than 14 special char,then skip
print "\n";
for $i (keys %count){
if($count{$i} > 14){ #<----------------see here
delete $content{$i};#delete from content

for $j (sort keys %content){ #output
print $content{$j};

my solution has this problem:
���O�J�b�յۺ��X is miss match, because its length <= 14
if change threshold to small number eg.6 string like 00:01:33,208 will be matched, thus delete from content

Is there a good way to check char in utf-8 ?


Here's a much simpler solution:

while($line = <>) {
    print $line unless $line =~ /[^\x00-\x7e]/;

The character set [\x00-\x7e] covers all basic ASCII characters (including control characters).