Marvin Schopf Marvin Schopf - 10 months ago 42
R Question

Extracting unknown dates from txt/HTML files using R

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my pc in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from lexis nexis.

For each document i want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.

So i found this question which is similar to my problem
Extract unknown words from a recurrent pattern

But i have several problems getting to the solution.
First off i dont know how to correctly upload my Data from the single documents into R for further processing with a regex formula.

Secondly I have problems with understanding and applying the sub formula myself. See this formula what i found:

> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

i have difficulties adapting the pattern part of sub (the first part i assume) to my problem.
Also i dont know what the second part means. For the third part i know that this is the source of the text but i dont know what [,5] means.

Here the code in full:

> tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

also a txt file i use:

My knowledge of R is currently Swirl courses and specifically on text mining

Answer Source

The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.

To achieve specifically what you asked for, try gregexpr w/ regmatches:

fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
mytxt <- readChar(fileName,$size)
regmatches(mytxt, regexec("UPDATE:",mytxt))

regmatches(mytxt, gregexpr(
"UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}", 

It says, in English: look for the literal UPDATE: followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 9 letters, followed by a space, followed by a 4 digit number.

You get:

[1] "UPDATE: 18. März 2008"      "UPDATE: 14. März 2008"     
[3] "UPDATE: 13. März 2008"      "UPDATE: 14. März 2008"     
[5] "UPDATE: 28. Februar 2008"   "UPDATE: 20. Februar 2008" 
[189] "UPDATE: 31. Dezember 2004"      "UPDATE: 3. Januar 2005"        
[191] "UPDATE: 9. Dezember 2004"       "UPDATE: 23. November 2004"