joe_coolish joe_coolish - 1 month ago 9
C# Question

Regular Expression to extract words, names, hashtags, and phrases from tweets

I'm working with twitter feeds to sort out words, names, hashtags and phrases in various tweets.

I'm assuming names are several words together that start with capital letters, hashtags are # followed by everything but spaces, phrases are things within quotes, and words are words.

It would also be nice to pull out any links too, but that is not necessary.

I would like to use Regex, but if there is a better solution, I would like to know.

An example Twitter post:


You know you watch a lot of Wes Anderson films when you see his new trailer and think, "Wait, where's the Futura font?" #MoviesILike http://bit.ly/HklUk


would split
Wes Anderson
,
Wait, where's the Futura font?
,
#MoviesILike
, and all of the words

The Regex I'm playing with right now is:

Regex _wordRegex = new Regex(@"(?:\""(?<Item>.*?)\"")|(?<Item>(?:[A-Z][a-z]*?[.\s])+)|(?<Item>#\S+)|(?<Item>\w+)");

Answer

I've dealt with my fair share of twitter data. I've found that the best approach is to tokenize the message string by whitespace, then analyze each token. This works pretty well... let's look at the cases:

@bobjones let's go watch the game at @hooters #nfl #broncos #tebow

For the @ and the # tokens, you just have to check the first character. For URLs, you might want to do something with regex there. So basically:

if token[0] == '@' then mention
else if token[0] == '#' then hashtag
else if token looks like a url then url
else then word

No need to complicate things with regex in this case, in my opinion. Especially since you are looking to extract different types of things from the same string.

You mention things within quotes... you might want to handle that as a corner case in the tokenization.

Comments