Marvin Schopf Marvin Schopf - 1 month ago 8
R Question

Extracting unknown dates from txt/HTML files using R

I want to extract Dates from txt(or HTML) documents using a Pattern which I identified in the text using the R tm package. I have newspaper articles on my pc in the folders data_X_txt and data_X (in HTML). Each of the folders contains documents named after a company which contains all newspaper articles in one txt or html document. I downloaded these documents in HTML from lexis nexis.

For each document i want to know the Upload dates from the contained articles. I identified that the Uploaddate is given for each article following the word UPDATE:.

So i found this question which is similar to my problem
Extract unknown words from a recurrent pattern

But i have several problems getting to the solution.
First off i dont know how to correctly upload my Data from the single documents into R for further processing with a regex formula.

Secondly I have problems with understanding and applying the sub formula myself. See this formula what i found:

> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])


i have difficulties adapting the pattern part of sub (the first part i assume) to my problem.
Also i dont know what the second part means. For the third part i know that this is the source of the text but i dont know what [,5] means.

Here the code in full:

> tmp <- read.csv("LaVanguardia_facebook_statuses.csv")
> sub("^(?:https?:\\/\\/)?[^\\/]+\\/([^\\/]+).*$", "\\1", tmp[,5])


also a txt file i use:
https://www.dropbox.com/s/e24ywni8z3s8wqk/SolarWorldAG_25.03.2008_1.HTML.txt?dl=0

My knowledge of R is currently Swirl courses and specifically on text mining https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html

Answer

The text mining package will not help much if all you need are the dates, but the regular expression capabilities of R are pretty useful.

To achieve specifically what you asked for, try gregexpr w/ regmatches:

fileName <- "~/Downloads/SolarWorldAG_25.03.2008_1.HTML.txt"
mytxt <- readChar(fileName, file.info(fileName)$size)
regmatches(mytxt, regexec("UPDATE:",mytxt))

regmatches(mytxt, gregexpr(
"UPDATE: [A-Za-z]{0,10} ?[0-9]{1,2}\\. [A-Z]{1}[a-z|ä]{2,8} [0-9]{4}", 
mytxt))

It says, in English: look for the literal UPDATE: followed by a space, followed by an optional set of 0 to 10 characters corresponding to the (optional) day of the week in german, an optional space, a 1 to 2 digit number, a period (escaped by a \\, because reasons) a capital letter, all lowercase letters of the english alphabet and ä, in a sequence of 2 to 9 letters, followed by a space, followed by a 4 digit number.

You get:

[1] "UPDATE: 18. März 2008"      "UPDATE: 14. März 2008"     
[3] "UPDATE: 13. März 2008"      "UPDATE: 14. März 2008"     
[5] "UPDATE: 28. Februar 2008"   "UPDATE: 20. Februar 2008" 
...
[189] "UPDATE: 31. Dezember 2004"      "UPDATE: 3. Januar 2005"        
[191] "UPDATE: 9. Dezember 2004"       "UPDATE: 23. November 2004"      
Comments