I'm working with twitter feeds to sort out words, names, hashtags and phrases in various tweets.
I'm assuming names are several words together that start with capital letters, hashtags are # followed by everything but spaces, phrases are things within quotes, and words are words.
It would also be nice to pull out any links too, but that is not necessary.
I would like to use Regex, but if there is a better solution, I would like to know.
An example Twitter post:
You know you watch a lot of Wes Anderson films when you see his new trailer and think, "Wait, where's the Futura font?" #MoviesILike http://bit.ly/HklUk
Wait, where's the Futura font?
Regex _wordRegex = new Regex(@"(?:\""(?<Item>.*?)\"")|(?<Item>(?:[A-Z][a-z]*?[.\s])+)|(?<Item>#\S+)|(?<Item>\w+)");
I've dealt with my fair share of twitter data. I've found that the best approach is to tokenize the message string by whitespace, then analyze each token. This works pretty well... let's look at the cases:
@bobjones let's go watch the game at @hooters #nfl #broncos #tebow
@ and the
# tokens, you just have to check the first character. For URLs, you might want to do something with regex there. So basically:
if token == '@' then mention else if token == '#' then hashtag else if token looks like a url then url else then word
No need to complicate things with regex in this case, in my opinion. Especially since you are looking to extract different types of things from the same string.
You mention things within quotes... you might want to handle that as a corner case in the tokenization.