pstenstrm pstenstrm - 10 days ago 8
Ruby Question

Regex remove multiple whitespace and line-break inside HTML tag

Some background: we're adding a styleguide to our Middleman project. It's for other developers to use, so we want our code examples to be readable. We don't however want to have to update code in multiple places when we change a component.

We use redcarpet for markdown parsing and creating code examples.

<%= partial '../partials/component' %>

```html
<%= partial '../partials/component' %>
```


This does however leave very messy and unreadable code examples. We can clean them up pretty well with htmlbeautifier. But we still have the issue with multiple whitespaces and linebreaks inside HTML tags.

It often looks like this:

<article class="default-s-sans teaser-media"

data-item-ratio="16x9"


data-background-color="d-blue"

>


We want to remove the extra white space and line breaks inside the tag, that is between
<
and
>
. But not between elements, so it should leave this unchanged:

<div>
<span class="price">$100</span>
<span>
Word word
</span>
</div>


I have gotten this far:

html.gsub(/(?<=<)(\s{2,})(?>)/, ' ')


But it will only match whitespaces between
<
and
>
if there's nothing else between.

How can I match whitespaces between
<
and
>
but allow other characters as well?

Answer

You can use the matchdata object in gsub blocks:

html.gsub(/(?<=<)(.+)(?>)/m) { |match| match.gsub(/\n/, ' ').gsub(/\s+/, ' ') }