Chris Rockwell Chris Rockwell - 2 months ago 20
HTML Question

Replace double slash with single slash in import.io XPath selector

I am using import.io to scrape some pages. I came across a page that uses internal hrefs like this:

http://domain.com//Event
- notice the double slash after the domain name. From my research, this is done for SEO purposes but I need to get the url without those double slashes, so it returns
http://domain.com/Event
.

I am trying to use XPath (which I'm very new to) and I can get the link fine with:
//a[contains(@class, 'event-info-btn')]//@href
.

My next step was to try
fn:repace()
with this:
fn:replace(//a[contains(@class, 'event-info-btn')]//@href, 'http://domain.com//', 'http://domain.com/')
. This isn't working - nothing is returned.

I'm not sure if my implementation is bad, or if import.io just doesn't support this.


  • I'll also note the reason why I'm trying to do this: import.io is failing on all of the urls. If I manually remove the slash and try again, it works fine.


Answer

Note that import.io claims to support XPath 2.0.

Problem

You probably mean /@href rather than //@href, but that's not the real problem.

Your XPath is returning a sequence of href attributes where replace() is expecting a string.

Solution

For this HTML,

<div>
  <a class="event-info-btn" href="http://domain.com//1">one</a>
  <a class="event-info-btn" href="http://domain.com//2">one</a>
  <a class="event-info-btn" href="http://domain.com//3">one</a>
</div>

this XPath,

for $href in //a[contains(@class, 'event-info-btn')]/@href 
    return replace($href, 'http://domain.com//', 'http://domain.com/')

will return

http://domain.com/1
http://domain.com/2
http://domain.com/3

as requested.


Update

This doesn't work in import.io and I'm having trouble finding a fiddle-like site to test it.

You can see this working here.

Import.io, it seems, only allows you to input one line of xpath.

You might try putting the XPath on a single line, then:

for $href in //a[contains(@class, 'event-info-btn')]/@href return replace($href, 'http://domain.com//', 'http://domain.com/')

If that doesn't work, then import.io's claim that they support XPath 2.0 is not correct.

Comments