firecatcher firecatcher - 1 month ago 10
HTML Question

How to Remove Empty Lines in an HTML output file using Regex in Java

The input in HTML is;

<div>TODO write content</div>

<span class="test"></span>
<ruby>text1<rp>(</rp><rt>textA</rt><rp>)</rp></ruby>
<ruby>
text1<rp>(</rp><rt>textA</rt><rp>)</rp>
text2<rp>(</rp><rt>textB</rt><rp>)</rp>
text3<rp>(</rp><rt>textC</rt><rp>)</rp>
</ruby>
<img src="images/aaaaa.jpg">
<img src="./audio/bbbbb.mp3">


This is needed to be modified in this format,

<div>TODO write content</div>

<span class="test"></span>
<font class="ruby" title="textA">text1</font>
<font class="ruby" title="textA">text1</font>
<font class="ruby" title="textB">text2</font>
<font class="ruby" title="textC">text3</font>
<img src="images/aaaaa.jpg">
<img src="./audio/bbbbb.mp3">


So, I applied these codes using REGEX and while loop;

final String REPLACE = "";

final String REGEX_RUBY_1 = "<ruby>";
final String REGEX__RUBY_2 = "</ruby>";
Pattern rubyP_1 = Pattern.compile(REGEX_RUBY_1);
Matcher rubyM_1 = rubyP_1.matcher(text);
text = rubyM_1.replaceAll(REPLACE);

Pattern rubyP_2 = Pattern.compile(REGEX__RUBY_2);
Matcher rubyM_2 = rubyP_2.matcher(text);
text = rubyM_2.replaceAll(REPLACE);

final Pattern pattern = Pattern.compile("<rt>(.+?)</rt>",Pattern.MULTILINE);
final Pattern pattern2 = Pattern.compile("(?=(\\b(\\w*\\S)\\b)<rp>)",Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(text);
final Matcher matcher2=pattern2.matcher(text);

while(matcher.find()){
matcher2.find();
text="<font class=\"ruby\" title=\""+matcher.group(1)+"\""+">"+matcher2.group(1)+"</font>";
break;
}


But the output was,

<div>TODO write content</div>

<span class="test"></span>

<font class="ruby" title="textA">text1</font>
<font class="ruby" title="textA">text1</font>
<font class="ruby" title="textB">text2</font>
<font class="ruby" title="textC">text3</font>

<img src="images/aaaaa.jpg">
<img src="./audio/bbbbb.mp3">


the replacing was fine, but the format was different since there are empty lines and the replaced texts were aligned left. I tried to modify some codes and searched for some possible codes to be applied it seems that it is still not working.

Answer

For indent the file change the first pattern to this:

final Pattern pattern = Pattern.compile("^( +).+<rt>(.+?)</rt>",Pattern.MULTILINE);

and then change the text assignement like this:

text=matcher.group(1)+"<font class=\"ruby\" title=\""+matcher.group(2)+"\""+">"+matcher2.group(1)+"</font>";

and for get rid of White lines try this:

final String REGEX_RUBY_1 = "<ruby> *\n?";
final String REGEX__RUBY_2 = "</ruby> *\n?";
Comments