Azaghal Azaghal - 1 month ago 7
Perl Question

Perl XML::twig : Find a substring located before a child element in mixed content

I'm working on a XML file with some mixed content (elements containing text, one child tag, then text again).

I would like to extract, for each parent element, the word (substring) coming right before the child element.

Example of XML Input :



<root>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
</root>


Example of Text Output :



all
all
all
all


I know that applying
text_only
to the
parent
element will give me
there is text all around it
, so I don't have to deal with the child element anymore, but then I don't know how to locate the preceding word.

Should I replace the
child
element by some kind of textual marker like
|
and just go through the remaining text as a single string ?

I'm not asking for a full "ready-made" answer, but some directions would sure be helpful.

Answer

You can find each child element and then check the text of its sibling on the left. That's the previous sibling. Conveniently there is a method prev_sibling_text that gives you just that, since the previous sibling is a text node anyway. From there, it's just a matter of locating the last word.

use strict;
use warnings;
use feature 'say';
use XML::Twig;

my $twig = XML::Twig->new(
    TwigHandlers => {
        child => sub {
            say +( split /\s/, $_->prev_sibling_text )[-1];
        },
    }
);

$twig->parse( \*DATA );

__DATA__
<root>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
<parent> there is text all <child>text</child> around it</parent>
</root>