aProgger aProgger - 1 month ago 4
Java Question

Why does my regex not work in Java

I have to match custom (German) address strings to get the street, housenumber, zipcode and city. I have a regex for it which works with RegExr and Java Visual Regex Tester.

This is the regex (delivered but editable):

^([^0-9]+)([0-9]+.*?)?(?:\w)?([0-9]{5})(?:\w)?(.*)$


This is the string:

NEUE BÜHNE Senftenberg, Theaterpassage 1, 01968 Senftenberg


This is my code:

String regex = "^([^0-9]+)([0-9]+\\.*?)?(?:\\w)?([0-9]{5})(?:\\w)?(\\.*)$";
String address = "NEUE BÜHNE Senftenberg, Theaterpassage 1, 01968 Senftenberg";
Pattern pattern = Pattern.compile(regex);
String[] addrFromRegex;

// gives an array (length 1) with [0] == address
addrFromRegex = address.split(regex);

// gives an array (length 1) with [0] == address
addrFromRegex = pattern.split(address);


As for split(), the problem may be the faulty escaping. But for pattern I thought I do not have to care about this. What am I doing wrong?

Update:

The , in the string is not always given. Other possible address strings are:

NEUE BÜHNE Senftenberg; Theaterpassage 1; 01968 Senftenberg
NEUE BÜHNE Senftenberg Theaterpassage 1 01968 Senftenberg
NEUE BÜHNE Senftenberg|Theaterpassage|1|01968|Senftenberg
NEUE BÜHNE Senftenberg|Theaterpassage_1_01968_Senftenberg
...


I get the addresses via XML and I do not have any influence on the data provided. By the way the address provided here is an example for a faulty one. I have to deal with those too.

Answer

The main point is that your pattern is meant to match the strings you have. So, instead of split, you need to use Pattern#matches() and collect the captured values into a list/array/etc.

The fixed regex is

"^([^0-9]+?)\\s*([0-9]+)[\\W_]+([0-9]{5})\\s*(.*)$"

enter image description here

Details:

  • ^ - start of string (not necessary in matches()) -([^0-9]+?) - Group 1: one or more chars other than digits but as few as possible
  • \\s* - 0+ whitespaces
  • ([0-9]+) - Group 2 capturing 1+ digits
  • [\\W_]+ - 1 or more chars that are either non-word or _
  • ([0-9]{5}) - Group 3 capturing 5 digits
  • \\s* - zero or more whitespaces
  • (.*) - Group 4 capturing the rest of the line
  • $ - end of string (not necessary in matches()).

Java demo:

List<String> lst = new ArrayList<>();
String s = "NEUE BÜHNE Senftenberg, Theaterpassage 1, 01968 Senftenberg";
Pattern pattern = Pattern.compile("([^0-9]+?)\\s*([0-9]+)[\\W_]+([0-9]{5})\\s*(.*)");
Matcher matcher = pattern.matcher(s);
if (matcher.matches()){
    lst.add(matcher.group(1));
    lst.add(matcher.group(2));
    lst.add(matcher.group(3));
    lst.add(matcher.group(4));
} 
System.out.println(lst); // => [NEUE BÜHNE Senftenberg, Theaterpassage, 1, 01968, Senftenberg]