I have a table with 21638 unique* rows:
vocabulary <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Vocabulary.txt", header=T)
This table has five columns, the first of which holds the respondent ID numbers. I want to check if any respondents appear twice, or if all respondents are unique.
To count unique IDs I can use
and to check if there are any duplicates I might do
length(unique(vocabulary$id)) == nrow(vocabulary)
, if there are no duplicates (which there aren't).
Is there a direct way to return the values or line numbers of duplicates?
Some further explanation:
There is an interpretation problem with using the function
, because is only returns the duplicates in the strict sense, excluding the "originals". For example,
might return "5" as the number of duplicate rows. The problem is that if you only know the number of duplicates, you won't know how many rows they duplicate. Does "5" mean that there are five rows with one duplicate each, or that there is one row with five duplicates? And since you won't have the IDs or line numbers of the duplicates, you wouldn't have any means of finding the "originals".
*I know there are no duplicate IDs in this survey, but it is a good example, because using any of the answers given elsewhere to this question, like
will output a haystack to your screen in which you'll be quite unable to find any possible rare duplicate needles.