Travis Heeter Travis Heeter - 1 month ago 15
R Question

r - How to classify with non-numeric data?

I have a data frame like this:

-------------------------------------------------------------------
| | Keywords | Paragraph | Date | Decision |
|===+==================+==================+============+==========|
| 1 | a; b | A lot. of words. | 12/15/2015 | TRUE |
|---+------------------+------------------+------------+----------|
| 2 | c; d | more. words. many| 01/23/2015 | FALSE |
|---+------------------+------------------+------------+----------|
| 3 | a; d; c; foo; bar| words, words, etc| 12/13/2015 | FALSE |
-------------------------------------------------------------------


But with about 1500 records.

I'm trying to find the most common characteristics of a Decision. For instance:

Group 1: Keywords: "a", Paragraph words: ["trouble", "abhorrent"], Date: "12/12/2015",
Outcome: FALSE, odds of FALSE Decision: 60%
Group 2: Keywords: "b", Paragraph words: ["good", "maximum"], Date: "02/02/2015",
Outcome: TRUE, odds of TRUE Decision: 30%


Also, it would be nice if I could plot the odds on a graph like this:

| -----
60% | |///|
| |///| -----
30% | |///| |\\\|
| |///| |\\\|
0% +---|---|------|---|---
Group 1 Group 2


I think I'm looking for regression modeling, but all the examples seem to deal with purely numeric data. How can I accomplish this using non-numeric data?

Edit: Here's a link to the dput file on Google Drive: https://drive.google.com/open?id=0BwrbzZiF0KGtVVZ4Tk1kdDdBZXM

Answer

Using the data you uploaded here's a simple example:

mod <- glm(Decision ~ Keywords, data = df1, family = "binomial")

predictions <- predict(mod, df1, "response")

predictions 
  1   2   3   4   5   6 
0.6 0.6 0.6 0.6 0.6 1.0

Here's the plot you wanted, where the groups are defined by Keywords:

res <- aggregate(predictions, by=list(df1$Keywords), mean)

barplot(res$x, names.arg = c("Group 1", "Group 2")) 

enter image description here

Comments