Scott Scott - 9 months ago 34
Javascript Question

Regex: parsing GitHub usernames (JavaScript)

I'm trying to parse GitHub usernames (that start with @) from a paragraph of text in order to link them to their associated profiles.

The GitHub username constraints are:


  • Alphanumeric with single hyphens (no consecutive hyphens)

  • Cannot begin or end with a hyphen (if it ends with a hyphen, just match everything up until there)

  • Max length of 39 characters.






For example, the following text:


Example @valid hello @valid-username: @another-valid-username, @-invalid @in--valid @ignore-last-dash- an@email.com @another-valid?


The script...

Should match:


  • @valid

  • @valid-username

  • @another-valid-username

  • @in

  • @ignore-last-dash

  • @another-valid



Should ignore:


  • @-invalid

  • an@email.com






I'm getting reasonably close with JavaScript by using:

/\B@((?!.*(-){2,}.*)[a-z0-9][a-z0-9-]{0,38}[a-z0-9])/ig


But this isn't matching usernames with a single character (such as @a).

Here are my tests to far: https://regex101.com/r/rZ5eW1/2

Is the current regex efficient? And how can I match a single non-hyphen character?

Answer Source
/\B@([a-z0-9](?:-?[a-z0-9]){0,38})/gi

Note: When this regex runs into a character or set of characters that can't be in a username (i.e. ., --), it matches from @ up until that stopping point. OP says that's fine so I'm rolling with it. So, if bold is the matched area (NOT the captured area):

@abc.123
@abc--123
@abc-

This works by using lots of nested groups. Regex101 has a fantastic breakdown, but here's mine anyway:

  1. \B: This is a builtin means 'not a word boundary', which seems to do the trick, though it may be problematic if something like someones.@email.com is a valid email address. At that point, though, it's indistinguishable from the text of someone who doesn't put spaces after punctuation[1] when they start a sentence with an @reference.

    Thanks to Honore Doktorr for pointing out that negative lookbehinds don't exist in JS.

  2. @: Just the literal @ symbol. One of the few places where a character means what it is.

  3. (...): The capturing group. The way it's placed means that it won't capture the @ symbol, it'll just match it, so it's easier to get the username -- no need to get a substring.
  4. [a-z0-9]: A character class to match any letter or number. Because of the i flag, this also matches capital letters. Because it's the first letter, it must be present.
  5. (?:...): This is a noncapturing group. It wraps a block of regex in a group without capturing it as a result.
  6. -?[a-z0-9]: The second bit is a character class, like before. The first says that it can match with or without the hyphen there. This section is what makes consecutive - invalid -- if there is a -, it has to be followed by something that matches [a-z0-9].
  7. {0,38}: Match the noncapturing group between 0 and 38 times, inclusive. Combined with #4, this gives us 39 letters maximum. Anything beyond that will be ignored.