Foxer Foxer - 3 years ago 76
SQL Question

SAS: most efficient method to output first non-missing across multiple columns

The data I have are millions of rows and rather sparse with anywhere between 3 and 10 variables needing processed. My end result needs to be one single row containing the first non-missing value for each column. Take the following test data:

** test data **;
data test;
length ID $5 AID 8 TYPE $5;
input ID $ AID TYPE $;
datalines;
A . .
. 123 .
C . XYZ
;
run;


The end result should look like such:

ID AID TYPE
A 123 XYZ


Using macro lists and loops I can brute force this result with multiple merge statements where the variable is non-missing and
obs=1
but this is not efficient when the data are very large (below I'd loop over these variables rather than write multiple
merge
statements):

** works but takes too long on big data **;
data one_row;
merge
test(keep=ID where=(ID ne "") obs=1) /* character */
test(keep=AID where=(AID ne .) obs=1) /* numeric */
test(keep=TYPE where=(TYPE ne "") obs=1); /* character */
run;


The
coalesce
function seems very promising, but I believe I need it in combination with
array
and
output
to build this single-row result. The function also differs (
coalesce
and
coalescec
depending on variable type) whereas it does not matter using
proc sql
. I get an error using
array
since all variables in the array list are not the same type.

Joe Joe
Answer Source

Exactly what is most efficient will largely depend on the characteristics of your data. In particular, whether the first nonmissing value for the last variable is usually relatively "early" in the dataset, or if you usually will have to trawl through the entire dataset to get to it.

I assume your dataset is not indexed (as that would simplify things greatly).

One option is the standard data step. This isn't necessarily fast, but it's probably not too much slower than most other options given you're going to have to read most/all of the rows no matter what you do. This has a nice advantage that it can stop when every row is complete.

data want;
  if 0 then set test; *defines characteristics;
  set test(rename=(id=_id aid=_aid type=_type)) end=eof;  
  id=coalescec(id,_id);
  aid=coalesce(aid,_aid);
  type=coalescec(type,_type);
  if cmiss(of id aid type)=0 then do;
    output;
    stop;
  end;
  else if eof then output;
  drop _:;
run;

You could populate all of that from macro variables from dictionary.columns, or even might use temporary arrays, though I think that gets too messy.

Another option is the self update, except it needs two changes. One, you need something to join on (as opposed to merge which can have no by variable). Two, it will give you the last nonmissing value, not the first, so you'd have to reverse-sort the dataset.

But assuming you added x to the first dataset, with any value (doesn't matter, but constant for every row), it is this simple:

data want;
  update test(obs=0) test;
  by x;
run;

So that has the huge advantage of simplicity of code, exchanged for some cost of time (reverse sorting and adding a new variable).

If your dataset is very sparse, a transpose might be a good compromise. Doesn't require knowing the variable names as you can process them with arrays.

data test_t;
  set test;
  array numvars _numeric_;
  array charvars _character_;
  do _i = 1 to dim(numvars);
    if not missing(numvars[_i]) then do;
      varname = vname(numvars[_i]);
      numvalue= numvars[_i];
      output;
    end;
  end;
  do _i = 1 to dim(charvars);
    if not missing(charvars[_i]) then do;
      varname = vname(charvars[_i]);
      charvalue= charvars[_i];
      output;
    end;
  end;
  keep numvalue charvalue varname;
run;


proc sort data=test_t;
  by varname;
run;

data want;
  set test_t;
  by varname;
  if first.varname;
run;

Then you proc transpose this to get the desired want (or maybe this works for you as is). It does lose the formats/etc. on the value, so take that into account, and your character value length probably needs to be set to something appropriately long - and then set back (you can use an if 0 then set to fix it).

A similar hash approach would work roughly the same way; it has the advantage that it would stop much sooner, and doesn't require resorting.

data test_h;
  set test end=eof;

  array numvars _numeric_;
  array charvars _character_;
  length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;
  if _n_=1 then do;
    declare hash h(ordered:'a');
    h.defineKey('varname');
    h.defineData('varname','numvalue','charvalue');
    h.defineDone();
  end;

  do _i = 1 to dim(numvars);
    if not missing(numvars[_i]) then do;
      varname = vname(numvars[_i]);
      rc = h.find();
      if rc ne 0 then do;
        numvalue= numvars[_i];
        rc=h.add();
      end;    
    end;
  end;
  do _i = 1 to dim(charvars);
    if not missing(charvars[_i]) then do;
      varname = vname(charvars[_i]);
      rc = h.find();
      if rc ne 0 then do;
        charvalue= charvars[_i];
        rc=h.add();
      end;    
    end;
  end;

  if eof or h.num_items = dim(numvars) + dim(charvars) then do;
     rc = h.output(dataset:'want');
  end;
run;

There are lots of other solutions, just depending on your data which would be most efficient.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download