myloginid myloginid - 2 months ago 20
R Question

Sample Rows with SQLDF

sqldf has a limit option to get 'X' rows. Can we also do a 'x%' sample using sqldf?

e.g.

> sqldf("select * from iris limit 3")

Loading required package: tcltk
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

> sqldf("select * from iris sample 0.01")
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near ".1": syntax error


Is there some workaround for this?

Manish

CL. CL.
Answer

This is basically a SQL question. As sqldf uses SQLite by default (this is not absolutely correct, see the documentation for drv in ?sqldf for more details), the question boils down to "How to select rows in random order in SQLite?". A simple answer can be found in this SO post:

SELECT * FROM table ORDER BY RANDOM() LIMIT 1;

Using the example from the question:

> sqldf("select * from iris order by RANDOM() limit 3")
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          6.3         3.3          4.7         1.6 versicolor
2          6.0         2.7          5.1         1.6 versicolor
3          5.0         2.0          3.5         1.0 versicolor

EDIT: If the percentage instead of the number of rows is given, the following could be used as a R solution. Pure SQLite solutions may exist.

percentage <- 0.02
mylimit <- round(nrow(iris) * percentage, 0)
sqldf(sprintf("select * from iris order by RANDOM() limit %d", mylimit))
Comments