tommy.bonderenka tommy.bonderenka - 5 months ago 15
HTML Question

Regex to find URL parameters in HTML (Ruby)

I am attempting to replace embedded YouTube videos with thumbnails in dynamically created email templates. I am attempting to find each YouTube ID from each embedded URL, then replace the entire block with custom HTML. I have it working if there is only one embedded video with the following RegEx:

<span contenteditable="false" draggable="true" fr-original-class="fr-video\sfr-dvb\sfr-draggable"\s.*\ssrc="[a-z:]*?\/\/w{3}?.?youtube.com\/embed\/([a-zA-Z\d\-]*).*<\/iframe><\/span>


The problem is, if there is more than one video, it will only find the ID from the last video. I feel like I may be over-complicating this.

Note that the attributes of the span that the embedded video is in will always be the same (
contenteditable="false" draggable="true" fr-original-class="fr-video
).

A sample email template is below, the above RegEx only pulls the second ID from this, not the first. I would like to pull both.

This is being done in Ruby.

EDIT: I realize the RegEx I am using is probably overkill but I need a complex RegEx for the
gsub
replace so that I only replace the video and it's container, not anything surrounding it.

<!DOCTYPE html>
<html>
<head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'>
</head>
<body style='margin: 0px; font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 18px;'>
<table border='0' cellpadding='0' cellspacing='0' style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; width: 600px;' width='600'>
<tr>
<td>
FooBar
<br>
<br>
<span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
<iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/e7zCqsjK1Vg?feature=oembed&amp;url=http://www.youtube.com/watch?v=e7zCqsjK1Vg&amp;image=https://i.ytimg.com/vi/e7zCqsjK1Vg/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
</span>
<br>
Foo Bar
<br>
<br>
<span contenteditable="false" draggable="true" fr-original-class="fr-video fr-dvb fr-draggable" fr-original-style="-webkit-user-select: none;" style="-webkit-user-select: none; text-align: center; position: relative; display: block; clear: both;">
<iframe src="//cdn.embedly.com/widgets/media.html?src=https://www.youtube.com/embed/skLz87ixE48?feature=oembed&amp;url=http://www.youtube.com/watch?v=skLz87ixE48&amp;image=https://i.ytimg.com/vi/skLz87ixE48/hqdefault.jpg&amp;key=2aa3c4d5f3de4f5b9120b660ad850dc9&amp;type=text/html&amp;schema=youtube" width="600" height="338" scrolling="no" frameborder="0" allowfullscreen="" style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-style="box-sizing: content-box; max-width: 100%; border: 0px;" fr-original-class="embedly-embed"></iframe>
</span>
<br>
</td>
</tr>
<tr style='font-family: Helvetica Neue,Helvetica,Arial,sans-serif; font-size: 12px; color: #656565; text-align: center;'>
<td style='padding: 10px 0px;'>
</td>
</tr>
</table>
</body>
</html>

Answer

To grab the YouTube IDs, I think the best way would be to use look-arounds. The following should work.

(?<=embed\/)(.+?)(?=\?)

Here's a link to a demonstration on regex101.com

Turn on the "global" flag so that the regex engine doesn't stop after finding the first match. This regex uses a look-behind, (?<=embed\/); followed by a capturing group that matches wildcard characters in a non-greedy fashion, (.+?); followed by a look-ahead that asserts a literal question mark, (?=\?).

This should suffice in grabbing the video IDs.

As for replacing the HTML, here's a regex that will match the <span>...</span> blocks:

<span.*?>\s*<iframe.+?>.*?<\/iframe>\s*<\/span>

For this to work, apply the s flag to the regex engine so that . wildcard characters can match \/n newline characters. Also apply the g flag for the same reasons mentioned previously.

NOTE: this will capture any <span> groups that have <iframe>s as direct children. Depending on the content with which you are working, you may need to add more specificity to the regex to scan the attributes on those <iframe>s. For the content you provided to this question, however, it appears to work.

Let me know if you'd like any clarification or additional functionality.

Here's a link to a demonstration on regex101.com.

Comments