Satwik Satwik -4 years ago 55
Ruby Question

Convert ruby regular expression definition to python regex

I've following regexes defined for capturing the gem names in a Gemfile.

GEM_NAME = /[a-zA-Z0-9\-_\.]+/

QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/

I want to convert these into a regex that can be used in python and other languages.

I tried
based on substitution and several similar combinations but none of them worked. Here's the regexr link

Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.

Answer Source

To define a named group, you need to use (?P<name>) and then (?p=name) named If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):

s = """%q<Some-name1> "some-name2" 'some-name3'"""

GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)

import regex
res = ["name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
# => ['Some-name1', 'some-name2', 'some-name3']

backreference in the replacement pattern.

See this Python demo.

If you decide to go with Python re, it can't handle identically named groups in one regex pattern.

You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.

Example Python code:

import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [ if else for x in re.finditer(QUOTED_GEM_NAME, s)]
# => ['Some-name1', 'some-name2', 'some-name3']

So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.

Pattern details

  • ([\"']) - Group 1: a " or '
  • ({0}) - Group 2: GEM_NAME pattern
  • \1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
  • | - or
  • %q< - a literal substring
  • ({0}) - Group 3: GEM_NAME pattern
  • > - a literal >.
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download