Micromann Micromann - 13 days ago 4
R Question

How to use lapply to remove columns with too many missing values in a list in R?

I have a list of data frames called

ls.df.val.dcas
. Each dataframe has various columns with some missing values which are NA. I would like to use
lappy()
to the list so that I can remove those columns that more than X % (e.g. 40%) of their values are NA. To give you a view of how the dataframes within the list look like I am showing an example:

$ SK_VALUES_IMV_EU28_INTRA :'data.frame': 74 obs. of 65 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] 1078759 1850083 1872924 1038070 626471 ...
..$ 2208 : num [1:74] 3329179 7061890 1351550 1371469 1557605 ...
..$ 220710 : num [1:74] 1030704 1804495 1831958 972263 574855 ...
..$ 220720 : num [1:74] 48055 45588 40966 65807 51616 ...
..$ 220820 : num [1:74] 380843 1014933 71804 126348 138138 ...
..$ 220830 : num [1:74] 380007 459653 155033 205879 297446 ...
..$ 220840 : num [1:74] 41561 88449 31549 60768 117534 ...
..$ 220850 : num [1:74] 94483 340439 44949 32949 37550 ...
..$ 220860 : num [1:74] 371217 728521 143974 179311 254546 ...
..$ 220870 : num [1:74] 731231 1374532 228087 227772 230129 ...
..$ 22082014: num [1:74] NA 2531 1776 NA NA ...
$ RO_VALUES_IMV_EU28_EXTRA :'data.frame': 74 obs. of 44 variables:
..$ PERIOD : Date[1:74], format: "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" ...
..$ 2207 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 2208 : num [1:74] 312035 840540 315008 884357 100836 ...
..$ 220710 : num [1:74] NA NA NA NA NA 5 NA NA NA NA ...
..$ 220720 : num [1:74] NA NA NA NA NA NA NA NA NA NA ...
..$ 220820 : num [1:74] 3570 698 483 1087 1802 ...


My incomplete solution is based on counting the number of NA in each column of each dataframe and calculating the percentage of NA. Then removing those columns that the percentage is more than X%.

# Counting the number of NA
ls.Nan <- lapply(ls.df.val.dcas, function(x) colSums(!is.na(x)))
# Calculating the lengths of all column
ls.size <- lapply(ls.df.val.dcas, function(x) dim(x))

# we want the first element of size which shows the number of rows.
ls.percen <- mapply(function(x,y) x/y[1] , x=ls.Nan, y=ls.size)
# keeping those columns that have more than half of the data on that category

mis.list <- sapply(ls.df.val.dcas, "]]" sapply(ls.percen, function(x) x >= NPI))


I get the following error from running the last line.

Error: unexpected symbol in "mis.list <- sapply(ls.df.val.dcas, "]]" sapply"


Ultimately I also like to merge all of these functions into a single functions and then use lapply once. But right now, I am struggling to understand the indexing system of lapply applied to list of dataframes. If any one can demonstrate with an example how to use lapply with different granularity of lists then that would be great. For instance how functions should be written when you want to change an element of a list or a dataframe within list, or a column within a dataframe of a list.


EDIT
Given the comment below about forgetting to put a comma after "]]". I corrected the code but still getting the error


> mis.list <- sapply(ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))
Error in get(as.character(FUN), mode = "function", envir = envir) :
object ']]' of mode 'function' was not found


By the way, the NPI is just a percentage threshold of NAs in the column. For instance I have set it to NPI= 0.35

Since I suspect there the error is related to the structure of my data, I added the more info on the structure of the ls.percen.

> str(ls.percen)
List of 69
$ AT_VALUES_IMV_EU28_EXTRA : Named num [1:59] 1 0.635 1 0.378 0.338 ...
..- attr(*, "names")= chr [1:59] "PERIOD" "2207" "2208" "220710" ...
$ AT_VALUES_IMV_EU28_INTRA : Named num [1:67] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:67] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_EXTRA : Named num [1:57] 1 1 1 1 0.365 ...
..- attr(*, "names")= chr [1:57] "PERIOD" "2207" "2208" "220710" ...
$ BE_VALUES_IMV_EU28_INTRA : Named num [1:69] 1 0.986 0.986 0.986 0.986 ...
..- attr(*, "names")= chr [1:69] "PERIOD" "2207" "2208" "220710" ...

42- 42-
Answer

Might be a simple typo (and not a problem with indexing): that message says you are missing a comma, and it should perhaps be:

mis.list <- sapply( ls.df.val.dcas, "]]", sapply(ls.percen, function(x) x >= NPI))

We don't see a definition of 'NPI'. Might be simpler to merge the first two 'lapply' calls (and return the desired list of shorted df's) with:

mis.lst <- lapply( ls.df.val.dcas, 
                  function(x) x[ , colSums(!is.na(x))/nrow(x) > .40 ] )

You can use logical indexing in the "j" position for the two argument version of "[".