yoeddy yoeddy - 1 year ago 82
Perl Question

XML::Twig parsing same name tag in same path

I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:


<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
<birthTime value=YEARMMDD"/>

more fields for address etc.are following in the xml file.


Here is what I coded:

my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name

my @fields;

sub get_family_name {my($twig,$data)=@_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=@_;$fields[1]=$data->text;$twig->purge;}

I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.

Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.

I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?

Thank you for your help in advance!

Answer Source

It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small

It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required

I've chosen to build a hash of information %patient out of each patient element and push it onto an array @patients that contains all the data in the file. If you have only one patient per file then this will need to be changed

I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable

This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs

use strict;
use warnings 'all';

use XML::Twig;

my @patients;

my $twig = XML::Twig->new(
    twig_handlers => { patient => \&get_patient }

sub get_patient {
    my ($twig, $pat) = @_;

    my %patient;

    $patient{id} = $pat>att('extension');

    my $name         = $pat->first_child('name');yy
    $patient{family} = $name->first_child_trimmed_text('family');
    $patient{given}  = join ' ', $name->children_trimmed_text('given');

    $patient{dob}    = $pat->first_child('birthTime')->att('value');

    push @patients, \%patient;