匿名柴棍 匿名柴棍 - 17 days ago 7
Java Question

How to separate word by comma, space, period(.), tab(\t), parentheses(), brackets[], and curly braces({}) characters in wordcount hadoop?

I am practicing MapReduce with Cloudera turotial here. However, currently the tutorial only split words by space with this regex in Java:

private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");


However, in addition to space
"\\s*"
, I also want to define separate words by comma, period(.) and tab(\t), parentheses(), brackets[], and curly braces({}) characters. In other words, I define a word as a string that has one or more alphanumeric characters bounded by two non alphanumeric characters. For example:


  • (cece54) has one word "cece54" bounded by
    ()

  • {dwd] has one word "dwd" bounded by
    {]

  • xxx) has one word "xxx" bound by
    <space>
    and
    )

  • so on and so forth.



So how should my regex be written in order to obtain this requirement?

Answer

If you define a word as one or more consecutive alphanumeric characters, then split on one or more consecutive non-alphanumeric characters, i.e. "\\P{Alnum}+" or "[^a-zA-Z0-9]+".

See regex101 for example.

You can prefix the first one with (?U), i.e. "(?U)\\P{Alnum}+", for full international unicode support.