Ohad Benita Ohad Benita - 3 months ago 15
Java Question

Regular expression for extracting instance ID, AMI ID, Volume ID

Given the following string


Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305


I want to be able to extract the following using a regular expression


i-b9b4ffaa
ami-dbcf88b1
vol-e97db305


This is the regular expression I came up with, which currently doesn't do what I need :

Pattern p = Pattern.compile("Created by CreateImage([a-z]+[0.9]+)([a-z]+[0.9]+)([a-z]+[0.9]+)",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305");
System.out.println(m.matches()); --> false

Answer

You may match all words starting with letters, followed with a hyphen, and then having alphanumeric chars:

String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
Pattern pattern = Pattern.compile("(?i)\\b[a-z]+-[a-z0-9]+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(0)); 
} 
// => i-b9b4ffaa, ami-dbcf88b1, vol-e97db305

See the Java demo

Pattern details:

  • (?i) - a case insensitive modifier (embedded flag option)
  • \\b - a word boundary
  • [a-z]+ - 1 or more ASCII letters
  • - - a hyphen
  • [a-z0-9]+ - 1 or more alphanumerics.

To make sure these values appear on the same line after Created by CreateImage, use a \G-based regex:

String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
Pattern pattern = Pattern.compile("(?i)(?:Created by CreateImage|(?!\\A)\\G)(?:(?!\\b[a-z]+-[a-z0-9]+).)*\\b([a-z]+-[a-z0-9]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(1)); 
} 

See this demo.

Note that the above pattern is based on the \G operator that matches the end of the last successful match (so we only match after a match or after Created...) and a tempered greedy token (?:(?!\\b[a-z]+-[a-z0-9]+).)* (matching any symbol other than a newline that does not start a sequence: word boundary+letters+-+letters|digits) that is very resource consuming.

You should consider using a two-step approach to first check if a string starts with Created... string, and then process it:

String s = "Created by CreateImage(i-b9b4ffaa) for ami-dbcf88b1 from vol-e97db305";
if (s.startsWith("Created by CreateImage")) {
    Matcher n = Pattern.compile("(?i)\\b[a-z]+-[a-z0-9]+").matcher(s);
    while(n.find()) {
        System.out.println(n.group(0)); 
    }
} 

See another demo