Angie94 Angie94 - 5 months ago 37
Java Question

Stackoverflow when spliting string using regex

I'm doing a project in MapReduce using Amazon Web Services and I'm having this error:


FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child :
java.lang.StackOverflowError at
java.util.regex.Pattern$GroupHead.match(Pattern.java:4658)


I read a few other questions to understand why this happened and it seems my regex has repetitive alternative paths. This is the regex:

\\s+(?=(?:(?<=[a-zA-Z])\"(?=[A-Za-z])|\"[^\"]*\"|[^\"])*$)


What it does is that it splits by space except when they are inside these symbols
< >
or these
" "
. So basically takes strings that are inside those 2 types of symbol. I have tried many other versions but none works, so I am far away from an optimal one. I am kind of lost and it's the first time Im using these complicated regexs. Can someone please give a better option for my regex?

I would truly appreciate every feedback regarding this!

EDIT:

This string with URLs inside <> and text inside "" and spaces:

<\janhaeussler.com/?sioc_type=user&sioc_id=1/> "HEY" <.org/1999/02/22-rdf-syntax-ns#type/>

should produce these 3 Strings:

1. <\janhaeussler.com/?sioc_type=user&sioc_id=1/> (with or without <>)

2. "HEY"

3. <.org/1999/02/22-rdf-syntax-ns#type/>

EDIT 2:

I think the symbols <> are confusing. I am trying to find a regex that splits by one or more spaces without taking into consideration the spaces inside " ", since the urls do not have spaces.

Answer Source

Try this:

\s+(?=(?:(?:[^"]*"){2})*[^"]*$)

Demo

    String string = "abc d<\\janhaeussler.com/?sioc_type=user &sioc_id=1/> \"HEY 1\" 2 3 <.org/1999/02/22-rdf-syntax-ns#type/> \"tra la\" <asdfadsf sadfasdf/> 4    \"sdf sdf\" 5 6";
    String[] res=string.split("\\s+(?=(?:(?:[^\"]*\"){2})*[^\"]*$)");
    System.out.println(Arrays.toString(res));

Will output:

[abc, d<\janhaeussler.com/?sioc_type=user, &sioc_id=1/>, "HEY 1", 2, 3, <.org/1999/02/22-rdf-syntax-ns#type/>, "tra la", <asdfadsf, sadfasdf/>, 4, "sdf sdf", 5, 6]