Piskot Piskot - 4 months ago 24
Pascal Question

Some characters count twice

Right now I'm trying to find the longest sentence in text and print out number of characters including spaces and things like that. The problem is when I encounter characters like 'š' or 'á' it counts them twice. I tried to subtract one in those cases, but that doesn't seem to work either, because it subtracts them twice too. Any idea how I could fix that? Here is the code for the counter.

for i:=1 to length(text) do
case text[i] of
'.','!','?': begin
if len>p2 then p2:=len;
len:=0
end;
else inc(len);
end;


p2 is a counter for longest sentence and len is current sentence.

Answer

This works for me with ANSI characters, including those with diacritics. As you've not mentioned any specific character set, and your question is simply tagged as , it should work for you as well. If you're dealing with other character sets, then you need to indicate which specific Pascal compiler you're using, as support for multi-byte characters differs between various Pascal dialects.

function LongestSentenceCharCount(const Text: string): Integer;
var
  Len: Integer;
  LongLen: Integer;
  i, CurrLen: Integer;
begin
  Len := Length(Text);
  CurrLen := 0;
  LongLen := 0;
  for I := 1 to Len do
  begin
    if Text[i] in ['.', '!', '?'] then
    begin
      if CurrLen > LongLen then
        LongLen := CurrLen;
      CurrLen := 0;
    end
    else
      Inc(CurrLen);

  end;
  Result := LongLen;
end;

To deal with multi-byte character sets such as UTF-8 and Unicode -

Based on some code donated to Cary Jensen for his white paper (PDF) Delphi Unicode Migration for Mere Mortals: Stories and Advice from the Front Lines by Seppy Bloom (at the time Team Leader for RTL/VCL at Embarcadero), you can use some of the normalization functionality available in Windows since Vista and later. I've adapted my function above to use that code from Seppy (included below), along with a sample app to demonstrate using it. The code was developed, compiled and tested in Delphi 10.1 Berlin, so if you're using a different compiler you'll have to adjust it, and clearly it won't work if you're not running under Windows Vista or higher.

program Project1;

{$APPTYPE CONSOLE}

uses
  System.SysUtils, WinAPI.Windows;

const
  NormalizationOther = 0;
  NormalizationC     = 1;
  NormalizationD     = 2;
  NormalizationKC    = 5;
  NormalizationKD    = 6;

function IsNormalizedString(NormForm: Integer; lpString: LPCWSTR;
  cwLength: Integer): BOOL; stdcall; external 'Normaliz.dll';

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR;
  cwSrcLength: Integer; lpDstString: LPWSTR;
  cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function NormalizedStringLength(const Str: string): Integer;
var
  Buf: string;
begin
  if not IsNormalizedString(NormalizationC, PChar(Str), -1) then
  begin
    SetLength(Buf, NormalizeString(NormalizationC, PChar(Str),
                                   Length(Str), nil, 0));
    Result := NormalizeString(NormalizationC, PChar(Str),
                                   Length(Str), PChar(Buf), Length(Buf));
  end
  else
    Result := Length(Str);
end;

function LongestSentenceLen(const Text: string): Integer;
var
  Len: Integer;
  i, CurrLen: Integer;
begin
  Len := Length(Text);
  CurrLen := 0;
  Result := 0;
  for i := 1 to Len do
  begin
    // Replaces 'if Text[i] in ['.', '!', '?']', which will work
    // but generates a compiler warning.
    if CharInSet(Text[i], ['.', '!', '?']) then 
    begin
      if CurrLen > Result then
        Result := CurrLen;
      CurrLen := 0;
    end
    else
      Inc(CurrLen, NormalizedStringLength(Text[i]));
  end;
end;

var
  Test: string;

begin
  Test := 'Ahoj, jak se máš? Hello World.';
  WriteLn(Test);
  WriteLn(Format('Longest: %d', [LongestSentenceLen(Test)]));
  ReadLn;
end.

The output of the above is

Ahoj, jak se más? Hello World.
Longest: 16
Comments