Buckeye14Guy - 1 year ago 84
R Question

# Using a string distance technique to create a factor variable in R

I am a new R enthusiast working on expanding my knowledge. I am reading the An Introduction To Data Cleaning With R article by Edwin de Jonge and Mark van der Loo. I am working on exercise 2.4 and I would appreciate it if someone could confirm my technique in solving this specific problem:
This is the original data:

``````1 // Survey data. Created : 21 May 2013
2 // Field 1: Gender
3 // Field 2: Age (in years)
4 // Field 3: Weight (in kg)
5 M;28;81.3
6 male;45;
7 Female;17;57,2
8 fem.;64;62.8
``````

This is a cleaner version that I was able to construct:

``````df:
Gender Age..in.years. Weight..in.kg.
1      M             28           81.3
2   male             45           <NA>
3 Female             17           57,2
4   fem.             64           62.8
``````

Now this is what I get from recoding using adist

``````D:
rawtext  coded
1       M   male
2    male   male
3  Female female
4    fem. female
``````

Now I have to transform the Gender column into a factor variable with labels man and woman.
I have no idea how to proceed and I am thinking of changing the gender column of the data to the following column vector:

``````    f <- factor(D\$coded, levels = c("male", "female"), labels = c("man", "woman"))
``````

which returns:

``````    [1] man   man   woman woman
Levels: man woman
``````

Am I correct or plain wrong?; Is there a way to use transform to directly change the Gender variable in df? i.e. is it better to do:

``````df\$Gender <- plyr::revalue(D\$coded, c(male = "man", female = "woman"))
``````

Or is there another way to change the observations of the Gender variable to "man" or "woman" without using multiple ifesle commands?

I am trying to get answers by learning more about factors but nothing quite similar to this pops up anywhere.
Thanks.

``````f <- factor(D\$coded, levels = c("male", "female"), labels = c("man", "woman"))
did work, but only because you got lucky- that is to say, because `D\$coded`'s levels were in the order `c("male", "female")`. If they'd been in a different order, the man and woman labels would have been transposed in your new factor. (After all, you never specify in that line which level should go to "male" and which to "female"!)
When revaluing levels of a factor, it's safer and simpler to use the `revalue` function from the plyr package:
``````f <- plyr::revalue(D\$coded, c(male = "man", female = "woman"))