lfalin lfalin - 4 months ago 12
Objective-C Question

Remove HTML Tags from an NSString on the iPhone

There are a couple of different ways to remove

HTML tags
from an
NSString
in
Cocoa
.

One way is to render the string into an
NSAttributedString
and then grab the rendered text.

Another way is to use
NSXMLDocument's
-
objectByApplyingXSLTString
method to apply an
XSLT
transform that does it.

Unfortunately, the iPhone doesn't support
NSAttributedString
or
NSXMLDocument
. There are too many edge cases and malformed
HTML
documents for me to feel comfortable using regex or
NSScanner
. Does anyone have a solution to this?

One suggestion has been to simply look for opening and closing tag characters, this method won't work except for very trivial cases.

For example these cases (from the Perl Cookbook chapter on the same subject) would break this method:

<IMG SRC = "foo.gif" ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

Answer

A quick and "dirty" (removes everything between < and >) solution, works with iOS >= 3.2:

-(NSString *) stringByStrippingHTML {
  NSRange r;
  NSString *s = [[self copy] autorelease];
  while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
    s = [s stringByReplacingCharactersInRange:r withString:@""];
  return s;
}

I have this declared as a category os NSString.