Buckeye14Guy Buckeye14Guy - 2 months ago 23
R Question

Using a string distance technique to create a factor variable in R

I am a new R enthusiast working on expanding my knowledge. I am reading the An Introduction To Data Cleaning With R article by Edwin de Jonge and Mark van der Loo. I am working on exercise 2.4 and I would appreciate it if someone could confirm my technique in solving this specific problem:
This is the original data:

1 // Survey data. Created : 21 May 2013
2 // Field 1: Gender
3 // Field 2: Age (in years)
4 // Field 3: Weight (in kg)
5 M;28;81.3
6 male;45;
7 Female;17;57,2
8 fem.;64;62.8


This is a cleaner version that I was able to construct:

df:
Gender Age..in.years. Weight..in.kg.
1 M 28 81.3
2 male 45 <NA>
3 Female 17 57,2
4 fem. 64 62.8


Now this is what I get from recoding using adist

D:
rawtext coded
1 M male
2 male male
3 Female female
4 fem. female


Now I have to transform the Gender column into a factor variable with labels man and woman.
I have no idea how to proceed and I am thinking of changing the gender column of the data to the following column vector:

f <- factor(D$coded, levels = c("male", "female"), labels = c("man", "woman"))


which returns:

[1] man man woman woman
Levels: man woman


Am I correct or plain wrong?; Is there a way to use transform to directly change the Gender variable in df? i.e. is it better to do:

df$Gender <- plyr::revalue(D$coded, c(male = "man", female = "woman"))


Or is there another way to change the observations of the Gender variable to "man" or "woman" without using multiple ifesle commands?

I am trying to get answers by learning more about factors but nothing quite similar to this pops up anywhere.
Thanks.

Answer

The line

f <- factor(D$coded, levels = c("male", "female"), labels = c("man", "woman"))

did work, but only because you got lucky- that is to say, because D$coded's levels were in the order c("male", "female"). If they'd been in a different order, the man and woman labels would have been transposed in your new factor. (After all, you never specify in that line which level should go to "male" and which to "female"!)

When revaluing levels of a factor, it's safer and simpler to use the revalue function from the plyr package:

f <- plyr::revalue(D$coded, c(male = "man", female = "woman"))