Marvin Schopf Marvin Schopf - 8 months ago 35
R Question

Extracting unknown dates from txt/HTML files using R

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my pc in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from lexis nexis.

For each document i want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.

So i found this question which is similar to my problem
Extract unknown words from a recurrent pattern

But i have several problems getting to the solution.
First off i dont know how to correctly upload my Data from the single documents into R for further processing with a regex formula.

Secondly I have problems with understanding and applying the sub formula myself. See this formula what i found:

> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

i have difficulties adapting the pattern part of sub (the first part i assume) to my problem.
Also i dont know what the second part means. For the third part i know that this is the source of the text but i dont know what [,5] means.

Here the code in full:

> tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])

also a txt file i use:

My knowledge of R is currently Swirl courses and specifically on text mining


The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.

To achieve specifically what you asked for, try gregexpr w/ regmatches:

fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
mytxt <- readChar(fileName,$size)
regmatches(mytxt, regexec("UPDATE:",mytxt))

regmatches(mytxt, gregexpr(
"UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}", 

It says, in English: look for the literal UPDATE: followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 9 letters, followed by a space, followed by a 4 digit number.

You get:

[1] "UPDATE: 18. März 2008"      "UPDATE: 14. März 2008"     
[3] "UPDATE: 13. März 2008"      "UPDATE: 14. März 2008"     
[5] "UPDATE: 28. Februar 2008"   "UPDATE: 20. Februar 2008" 
[189] "UPDATE: 31. Dezember 2004"      "UPDATE: 3. Januar 2005"        
[191] "UPDATE: 9. Dezember 2004"       "UPDATE: 23. November 2004"