HNSKD HNSKD - 2 months ago 5
R Question

How do you find a maximum character of a vector based on user-defined hierarchy in R?

How do you find a maximum character of a vector based on user-defined hierarchy in R?

I have a variable, say

Code
which I would like to impose a hierarchy on so that MSP<1A<1B<1C<2A<2B<2C<...<7C. I want to apply a maximum function on a vector containing these elements. While there's no problem for those characters that are in the form [digit][letter], there is a problem for "MSP". In R, max("2A","MSP") gives "MSP", but I want it to be "2A" instead.

In other words, R will rank it this way 1A<...MSP. This is due to the fact that in R, the hierarchy is "integer < double < character".

Dataframe:

No<-rep(c(1,2,3),c(4,4,3))
Date<-rep(c("05/09/2013","20/05/2013","23/05/2013","28/05/2013","03/06/2013"),c(2,2,2,2,3))
Time<-rep(c("14:15:00","09:40:00","14:30:00","13:10:00","08:45:00"),c(2,1,3,2,3))
Code<-c("MSP","3A","5B",NA,NA,NA,"7C","3B","MSP","MSP",NA)
df<-data.frame(No,Date,Time,Code,stringsAsFactors=FALSE)
df$Date<-dmy(df$Date)
df$Time<-hms(df$Time)
df

# No Date Time Code
# 1 1 2013-09-05 14H 15M 0S MSP
# 2 1 2013-09-05 14H 15M 0S 3A
# 3 1 2013-05-20 9H 40M 0S 5B
# 4 1 2013-05-20 14H 30M 0S <NA>
# 5 2 2013-05-23 14H 30M 0S <NA>
# 6 2 2013-05-23 14H 30M 0S <NA>
# 7 2 2013-05-28 13H 10M 0S 7C
# 8 2 2013-05-28 13H 10M 0S 3B
# 9 3 2013-06-03 8H 45M 0S MSP
# 10 3 2013-06-03 8H 45M 0S MSP
# 11 3 2013-06-03 8H 45M 0S <NA>


My Code:
I would like to take the maximum on
Code
(based on the user-defined rank) within each No and Date. So, the rows that share the same No and the same Date will have the same Code. Hence, I grouped it first.

I created a new variable called
CodeAnother
using the
ifelse
function.


  1. The condition is such that if there is at least one "MSP" and not all elements are "MSP"

  2. Yes: Take max function by excluding "MSP"

  3. No: Take max function the usual way

    library(dplyr)
    dfnew<-df %>%
    group_by(No,Date) %>%
    mutate(IndicatorMSP=(Code=="MSP" & !is.na(Code))) %>%
    mutate(CodeNo=sum(!is.na(Code))) %>%
    mutate(CodeAnother=ifelse(sum(IndicatorMSP)>=1 & sum(IndicatorMSP)<CodeNo,
    max(Code[!(Code=="MSP") & !is.na(Code)]),
    max(Code[!is.na(Code)])))



I would like to know if there is a nicer way of achieving this using nicer code.

Answer

factors are, for once, your friend:

An example where I reorder letters:

factoringVariable <- sample(letters)

> factoringVariable
[1] "z" "k" "p" "s" "f" "v" "j" "b" "o" "l" "u" "m" "w" "c" "n" "t" "r" "x" "a" "i" "y" "q" "h" "d" "e" "g"

> sort(factor(letters,levels = factoringVariable))
[1] z k p s f v j b o l u m w c n t r x a i y q h d e g

so in your case:

factoringVariable <- c('MSP', sort(unlist(outer(1:7,LETTERS[1:3],paste0))))

> factoringVariable
 [1] "MSP" "1A"  "1B"  "1C"  "2A"  "2B"  "2C"  "3A"  "3B"  "3C"  "4A"  "4B"  "4C"  "5A"  "5B"  "5C"  "6A"  "6B"  "6C"  "7A" 
[21] "7B"  "7C" 

Now that I've set my order:

df$Code <- factor(df$Code, levels = factoringVariable)

and then you can just use top_n function in dplyr (with -1 to get the bottom 1)

library(dplyr)
dfnew<-df %>%
  group_by(No,Date) %>%
  top_n(-1,Code)

output:

> dfnew
Source: local data frame [5 x 4]
Groups: No, Date [4]

     No       Date     Time   Code
  <dbl>      <chr>    <chr> <fctr>
1     1 05/09/2013 14:15:00    MSP
2     1 20/05/2013 09:40:00     5B
3     2 28/05/2013 13:10:00     3B
4     3 03/06/2013 08:45:00    MSP
5     3 03/06/2013 08:45:00    MSP

EDIT: I realize now you want to assign all the largest value, in which case, we can't use top_n

instead:

library(dplyr)
dfnew<-df %>%
  group_by(No,Date) %>%
  mutate(CodeAll = sort(Code, partial = 1)[1])

EDIT 2: you can actually speed it up more (if you need the speed) by using partial sorting, since you're only going to take the first one anyways

Comments