YellowPillow YellowPillow - 6 months ago 53
Swift Question

Parsing HTML in Swift other than using regex

Below is the HTML code that I want to parse through in Swift:

<td class="pinyin">
<a href="rsc/audio/voice_pinyin_pz/yi1.mp3">
<span class="mpt1">yī</span></a>
<a href="rsc/audio/voice_pinyin_pz/yan3.mp3">
<span class="mpt3">yǎn</span>
</a>
</td>


I have read that Regex is not a good way to parse through HTML but nevertheless I have written an expression that capture what I want (which are the letters between the span):
and
yǎn


Regex expression:

/pinyin.+<span.+>(.+)<\/.+<span.+>(.+)<\//Us


I was wondering how to implement it in so that I can capture both
and
yǎn
at the same time and save it into an array. Also, I was wondering if there is another way that I would be able to do this without Regex.

EDIT:

I ended up using TFHpple as suggested by Rob. Although I did take a long time to figure out how to import it into Swift so I thought it would be helpful to post it here for convenience:

1. Open your project and drag the TFHpple files into it

2. At this point XCode will probably prompt you to create a bridging-header class file if you haven't included any Obj-C code in your current project. In this bridging-header file you should add:

#import <Foundation/Foundation.h>
#import "TFHpple.h"
#import "TFHppleElement.h"


3. Select the target, under General, in Linked Frameworks and Libraries (just scroll down when you are in the General tab and you will see it, add libxml2.2.dylib and libxml2.dylib

4. Under Build Settings, in Header Search Paths, add $(SDKROOT)/usr/include/libxml2
WARNING: be sure that it isn't User Header Search Paths as this is not the same

5. Under Build Settings, in Other Linker Flags, add -lxml2

Enjoy!

Rob Rob
Answer

You can use the typical iOS HTML parser, TFHpple:

let data = NSData(contentsOfFile: path)
let doc = TFHpple(HTMLData: data)
if let elements = doc.searchWithXPathQuery("//td[@class='pinyin']/a/span") as? [TFHppleElement] {
    for element in elements {
        println(element.content)
    }
}

Or you can use NDHpple:

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//td/a/span") {
    for element in elements {
        println(element.children?.first?.content)
    }
}

I have more miles with TFHpple, so I'm personally more comfortable with that. NDHpple seems like it theoretically could be an alternative, though I'm not as crazy about it personally (e.g. why does HTMLData parameter take string and not NSData? why do I have to navigate through children to get contents of //td/a/span results? the [@class='pinyin'] qualifier doesn't appear to work, etc.). But, try both and see which you prefer.

Both require bridging header: TFHpple requires TFHpple.h in the bridging header, NDHpple requires the libxml headers there. See the documentation for each for more information.