whiteTea whiteTea - 1 month ago 16
Python Question

What representation of chat text data should I use for user classification?

I'm trying to train a classifier to classify text from a chat between 2 users so later on I can predict who of the two users is more likely to say X sentence/word. To get there I mined the text from the chat log and ended up with two arrays of words,

UserA_words
and
UserB_words
.

In which format do I have to transform this arrays to pass it to a classifier like naiveBayes or SVM? How do I pass e.g. a bag of words representation to a classifier?

Answer

You're asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user. Here are some:

  • character length, word length, sentence length of each comment
  • typing speed (esp. if you have timestamps in seconds)
  • ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
  • ratio of capitalization
  • ratio of numerals
  • ratio of whitespace
  • character n-grams (and notice these can pick up e.g. l0ser, f##k, :-) )
  • use of Unicode (emojis, symbols e.g. stars)
  • ratio of specific punctuation (e.g. how many '.', '!', '?', '*', '#' )
  • word-counts, esp. anything statistically anomalous
  • anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)
Comments