8bytez - 7 months ago 23

R Question

I have a data set similar to this one:

`x <- sample(c("A", "B", "C", "D"), 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))`

y <- sample(1:40, 1000, replace=TRUE)

d <- data.frame(x,y)

str(d)

'data.frame': 1000 obs. of 2 variables:

$ x: Factor w/4 levels "A","B","C","D": 1 3 3 2 3 3 3 3 4 3 ...

$ y: int 28 35 14 4 34 36 30 35 26 9 ...

table(d$x)

A B C D

115 204 637 44

So in my real data set i have multiple thousands of these category (A, B, C, D).

The

`str()`

`str(realdata)`

data.frame': 346340 obs. of 91 variables:

$ author : Factor w/ 42590 levels "-jon-","--LZR--",..: 1962 3434 1241 7666 6235 2391 1196 2779 1881 339 ...

$ created_utc : Factor w/ 343708 levels "2015-05-01 02:00:41",..: 14815 23163 2281 3569 5922 7211 15783 5512 13485 8591 ...

$ group : Factor w/ 5 levels "xyz","abc","bnm",..: 2 2 2 2 2 2 2 2 2 2 ...

....

Now i want to subset the data, so i have only the rows of those $authors (or

`$x`

`d`

I tried the following:

`dnew <- subset(realdata, table(realdata$author) > 100)`

It gives me a result, but it seems the not all entries of the authors were included. Although it should be way more, i just get 1.3% of the rows of the complete dataset. I checked it manually (with excel) and it should be way more than that (approx. 30%). The manual analysis showed that 1.2 % of $author stand for 30% of the entries. So it seems he just gave me one row with the $author who has more than 100 entries, but not all of his entries.

Do you know of a way to fix this?

Answer

I. Data frame `d`

with four levels

```
table(d$x)
# A B C D
# 92 232 630 46
```

II. Checking which level has greater than 100 records

```
which(table(d$x)>100)
# B C
# 2 3
```

III. Subsetting `d`

data frame having only records belonging to levels which have greater than 100 records ie. `level B`

and `level C`

```
result <- d[ d$x %in% names(table(d$x))[table(d$x) > 100] , ]
dim(result)
# [1] 862 2
str(result)
# 'data.frame': 862 obs. of 2 variables:
# $ x: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 2 3 3 3 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
```

Level `A`

and `D`

still persists with `0 records`

```
table(result$x)
# A B C D
# 0 232 630 0
```

IV. Removing the levels with 0 records using `factor()`

```
result$x <- factor(result$x)
str(result)
# 'data.frame': 860 obs. of 2 variables:
# $ x: Factor w/ 2 levels "B","C": 2 2 1 2 1 2 2 2 1 2 ...
# $ y: int 29 32 27 40 30 38 8 16 2 23 ...
table(result$x)
# B C
# 232 630
```