Dyllan Dyllan - 3 months ago 16
R Question

Calculating Percent Overlap from Between two Columns

I have two variables. One Continuous (ranges from -2 to 2) and one dichotomous (A and B). The two variables are highly correlated with most of the variables coded as "B" being positive and most of the variables coded as "A" as negative. I would like to calculate the proportion of overlap between the two variables in r. Or I would like to find how many observations lie between the most negative observation on the continuous scale that are coded as "B" on the dichotomous scale and the most positive observation on the continuous scale that is coded as "A" on the dichotomous scale.

What would be the best way to tackle this in r?

For example, if I have the following data:

Continous Variable Dichotmous Variable
.189 B
-.7 A
.5 B
-.3 A
-.5 A
-.1 B
.2 A
-.05 A


Because the B variable with the lowest value -.1 and the A variable with the highest value is .2, I would like to calculate the number of observations in between those two values. In this case, it would be 25% because I have two observations that overlap out of a total of 8 observations.

Would running a loop be the best method?

I apologize in advance if this is not clearly explained and I appreciate any suggestions you might provide.

Answer
df <- data.frame(cont=c(0.189,-0.7,0.5,-0.3,-0.5,-0.1,0.2,-0.05),dich=c('B','A','B','A','A','B','A','A'));
(sum(findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L)-1L)/nrow(df)*100;
## [1] 25

Let's break that down one piece at a time:


min(df$cont[df$dich=='B'])
## [1] -0.1

Get the minimum continuous value for group B.


max(df$cont[df$dich=='A'])
## [1] 0.2

Get the maximum continuous value for group A.


c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A']))
## [1] -0.1  0.2

Combine the two values into a two-element vector.

Note that I have not included in my solution any provision for checking if this two-element vector is indeed sorted ascending. It seems to be an assumption in your question that the smallest B value will be less than the largest A value; that assumption is effectively embedded in my solution. If you need to check it, you'll have to precompute the two values first and check their order. If the assumption is violated, you would have to avoid running the remainder of the solution, since findInterval() would fail on the invalid (by virtue of not being sorted ascending) vec.


findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))
## [1] 1 0 2 0 0 1 2 1

Find which elements are (0) below the smallest B, (1) between the smallest B and largest A, and (2) above the largest A. We're looking for the 1s.


findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

Test which intervals are 1.


sum(findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L)
## [1] 3

Count the number of intervals that are 1.

Note that we get 3 rather than 2 because findInterval() includes the lower bound of the interval by default, so the smallest B value matches. We'll subtract off that unwanted match in the next step.

If you need different treatment of the endpoints, you can try messing around with the rightmost.closed, all.inside, and left.open arguments of findInterval() to get what you need.


sum(findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L)-1L
## [1] 2

Subtract 1 to remove the smallest B value, since we want to exclude it.


(sum(findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L)-1L)/nrow(df)
## [1] 0.25

Divide by the total number of rows in the data.frame to get a fraction.


(sum(findInterval(df$cont,c(min(df$cont[df$dich=='B']),max(df$cont[df$dich=='A'])))==1L)-1L)/nrow(df)*100;
## [1] 25

Multiply by 100 to get a percentage.