bidisha ..som bidisha ..som - 2 months ago 7
R Question

Randomly select a certain percentage of rows and create new columns

I have a species column containing 10 species names. I have to distribute the species into four columns randomly such that each column will take a specific percentage of species.

Let's say the first column takes 20%, the second 30%, the third 40% and the last 10%. The four columns will be four different environments i.e.:

Restricted, Tidalflat, beach, estuary


Hence the column intake will be predefined but the selection will be random.

My input data will look like this:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
'Nassarius','Cardium','Cardium')


Result should look like this:

enter image description here

Answer

Some simple setup:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')
rspecies <- sample(species)

envirs <- c('Restricted', 'Tidalflat', 'Beach', 'Estuary')

probs <- c(.2, .3, .4, .1)

nrs <- round(length(species) * probs)

Now, a data.frame with separate columns is not a very good way of expressing your data, as your data is not rectangular, i.e. you don't have the same number of observations in each column.

You can either present the data in long form:

df <- data.frame(species = rspecies, envir = rep(envirs, nrs), stringsAsFactors = FALSE)
     species      envir
1    Tellina Restricted
2     Natica Restricted
3       Arca  Tidalflat
4     Mactra  Tidalflat
5    Tellina  Tidalflat
6       Arca      Beach
7  Nassarius      Beach
8    Cardium      Beach
9    Cardium      Beach
10    Natica    Estuary

Or as a list:

split(rspecies, df$envir)
$Beach
[1] "Mactra" "Natica" "Arca"   "Arca"  

$Estuary
[1] "Tellina"

$Restricted
[1] "Nassarius" "Cardium"  

$Tidalflat
[1] "Cardium" "Natica"  "Tellina"

Edit:

One way to accommodate different number of species, is to make the assignment probabilistic according the environment. This will work better the larger the actual dataset is.

species2 <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium', 'Cardium')
length(species2)

[1] 11

grps <- sample(envirs, size = length(species2), prob = probs, replace = TRUE)
df2 <- data.frame(species = species2, envir = grps, stringsAsFactors = FALSE) 
df2 <- df2[order(df2$envir), ]
     species      envir
5       Arca      Beach
10   Cardium      Beach
1     Natica    Estuary
11   Cardium    Estuary
3     Mactra Restricted
7    Tellina Restricted
2    Tellina  Tidalflat
4     Natica  Tidalflat
6       Arca  Tidalflat
8  Nassarius  Tidalflat
9    Cardium  Tidalflat
Comments