Martin Toth Martin Toth -4 years ago 76
SQL Question

SAS hierarchical structure sum

I have a dataset with a hierarchical codelist variable.
The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable.
There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).

I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's.
And I need to calculate (sum up) the weight to "higher" level nodes.

NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).

EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.

Input table:

ID level code var1 var2 nodes weight
1 1 1 . . 999 999
2 2 11 . . 999 999
3 3 111 . . 999 999
4 4 1111 . . 999 999
5 5 11111 . . 999 999
6 6 111111 . . 999 999
7 10 1111119999 01 1 1 0.1
8 10 1111119999 01 2 1 0.1
9 6 111112 . . 999 999
10 10 1111120000 01 1 1 0.5
11 5 11119 . . 999 999
12 6 111190 . . 999 999
13 10 1111901000 01 1 1 0.1
14 10 1111901000 02 1 1 0.2


Desired output table:

ID level code var1 var2 nodes weight
1 1 1 . . 5 1.0
2 2 11 . . 5 1.0
3 3 111 . . 5 1.0
4 4 1111 . . 5 1.0
5 5 11111 . . 3 0.7
6 6 111111 . . 2 0.2
7 10 1111119999 01 1 1 0.1
8 10 1111119999 01 2 1 0.1
9 6 111112 . . 1 0.5
10 10 1111120000 01 1 1 0.5
11 5 11119 . . 2 0.3
12 6 111190 . . 2 0.3
13 10 1111901000 01 1 1 0.1
14 10 1111901000 02 1 1 0.2


And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request.
Any suggestions on speeding up the code, or any other solutions are welcome.

%macro doit;

data temporary;
set have;
run;

%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);

proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;

select code
into :id1 - :id&cc
from have
where level = &i;
quit;

%do j=1 %to &cc.;

%let idd = &&id&j;

proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;

Answer Source

Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.

I think the approach @joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....

data want (drop=_:);

   *hash table of terminal nodes;
   if (_n_ = 1) then do;
      if (0) then set have (rename=(code=_code weight=_weight));
      declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
      declare hiter iter('h');
      h.definekey('ID');
      h.definedata('_code','_weight');
      h.definedone();
   end;

   set have;

   *for each non-terminal node, iterate through;
   *hash table of all terminal nodes, looking for children;
   if level ne 10 then do;
      call missing(weight, nodes);

      do _n_ = iter.first() by 0 while (_n_ = 0);
         if trim(code) =: _code then do;  
           weight=sum(weight,_weight);
           nodes=sum(nodes,1);
         end;
         _n_ = iter.next();
      end;
   end;
   output;
run;
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download