user299791 user299791 - 24 days ago 7
Scala Question

Remove all text from string after a sequence of words in Scala

I am trying to assemble a UDF in Scala that takes a column from a data frame and manipulates it to remove HTML and other useless pieces of text.

The column I need to modify is very messy, sometimes there is HTML, sometimes there is not... Searching SO I have found a regex solution to remove HTML

what I'd like to accomplish now is to find a regex that can find a specific word in the text and delete all the text after that word.

I think I understand from this SO answer that the regex should be something like

\).*
if you want to remove all after
)
, so I am trying to adapt this to my case, unsuccessfully due to my lack of knowledge about regex.

I have strings like:

I am interested to hear from you, thanks Sent from iPhone other stuff I want to delete....


I'd like to retain the first part of the string up to "Sent from" excluded, so a perfect output would be:

I am interested to hear from you, thanks


What I have so far is something like:

val toStringNoHTML = udf[String, String](_.toString
// code from SO as linked above
.replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*?>""", " ")
// delete all text after key word
.replaceAll("""'Sent from'.*""", "")
// remove all punctuation
.replaceAll("""[\p{Punct}\n]""", " ")
)


While the HTML gets remove, the "Sent from" and all the text after does not. Any hint how to adjust the regex to make it work?

EDIT
as pointed out in the comment, a small typo prevented my code to work, thanks for the help:

.replaceAll("""'Sent from'.*""", "")


should be

.replaceAll("""Sent from.*""", "")

Answer

Instead of doing multiple replaceAll(pattern, blank) I'd be tempted to start with an extraction.

val msgRE = "(.*>)?(.*)Sent from.*".r

val result = udfStr match {
  case msgRE(_, msg) => Some(msg.trim) // .replaceAll() can be added here
  case _ => None
}

Here the result is an Option[String] but that really depends on how you want to handle the non-matching input.

If more cleaning is needed after the extraction then replaceAll() can be added where indicated (or the extraction pattern can be better refined).