user3142695 user3142695 - 2 months ago 8
Javascript Question

Split string into lines and sentences, but ignoring abbrevations

There is some string content, which I have to split. First of all I need to split the string content into lines.

This is how I do that:

str.split('\n').forEach((item) => {
if (item) {
// TODO: split also each line into sentences

let data = {
type : 'item',
content: [{
content : item,
timestamp: Math.floor(Date.now() / 1000)
}]
};

// Save `data` to DB
}
});


But now I need to split also each line into sentences. The difficulty to me for this is to split it correctly. Therefore I would use
.
(dot and space) to split the line.
BUT there is also an array of abbrevations, which should NOT split the line:

cont abbr = ['vs.', 'min.', 'max.']; // Just an example; there are 70 abbrevations in that array


... and there are a few more rules:


  1. Any number and dot or single letter and dot should also be ignored as split string:
    1.
    ,
    2.
    ,
    30.
    ,
    A.
    ,
    b.

  2. Upper and lower case should be ignored:
    Max. Lorem ipsum
    should not be splitted.
    Lorem max. ipsum
    either.



Example

const str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar.';


The result of that should be four data-objects:

{ type: 'item', content: [{ content: 'Just some examples:', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'This example has min. 2 lines.', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'Max. 10 lines.', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'There are some words: 1. Foo and 2. bar.', timestamp: 123 }] }

Answer

You can first detect the abbreviations and the numberings in the string, and replace the dot by a dummy string in each one. After splitting the string on the remaining dots, which signal the end of a sentence, you can restore the original dots. Once you have the sentences, you can split each one on newline characters like you do in your original code.

The code below allows the dot to appear anywhere in the abbreviations (not necessarily at the end).

var i, regx, abbrParts;
const DOT = "_DOT_";
const abbr = ['vs.', 'min.', 'max.'];

var str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar. And also A. some letters.';

console.log("String: " + str);

// Replace dot in abbreviations found in string
for (i = 0; i < abbr.length; i++) {
    abbrParts = abbr[i].split(".");
    regx = new RegExp("(\\W*" + abbrParts[0] + ")(\\.)(" + abbrParts[1] + "\\W*)", "gi");
    str = str.replace(regx, "$1" + DOT + "$3");
}

// Replace dot in numbers found in string
str = str.replace(/(\W*\d+)(\.)/gi, "$1" + DOT);

// Replace dot in letter numbering found in string
str = str.replace(/(\W+[a-zA-Z])(\.)/gi, "$1" + DOT);

// Split the string at dots
var parts = str.split(".");

// Restore dots in sentences
var sentences = [];
regx = new RegExp(DOT, "gi");
for (i = 0; i < parts.length; i++) {
    if (parts[i].trim().length > 0) {
        sentences.push(parts[i].replace(regx, ".").trim() + ".");
        console.log("Sentence " + (i + 1) + ": " + sentences[i]);
    }
}