Antoine Antoine - 3 months ago 6
R Question

parsing file with r

I'm trying to parse this unstructured file using R:

ftp://ftp.fu-berlin.de/pub/misc/movies/database/ratings.list.gz

0000001322 175300 8.3 The Sting (1973)
0000001123 426445 8.3 2001: A Space Odyssey (1968)
0000001222 94315 8.3 Ladri di biciclette (1948)
0000001222 149759 8.3 Singin' in the Rain (1952)
0000001322 622326 8.3 Toy Story (1995)
0000001222 599957 8.3 Snatch (2000)


The file has (among other junk) lines like these above. I have tried
fread
and
read.table
, all fail at some point, and I was not able to find a solution. I need a way to parse the lines by dividing them into 4 variables. With regex I would do it like this:

^\s+(\S+)\s+(\d+)\s+(\S{3,4})\s+(.*)$


Any tips/pointers on how should I tackle this within R?

P.S. The first match has to be
\S
and not
\d
as some will look like this
.0..002212
and the third match will rarely be
10.0
hence I specified 3-4x non-whitespace. Anything after the score (ex
8.3
) is the movie title.

Answer

Firstly look at your data; the first 27 lines are prose information, and the first dataset runs from lines 28 to 278. The readr package's read_table function is a good bit smarter than read.table, and can handle the missing data well:

df <- readr::read_table('ratings.list.gz', skip = 27, n_max = 250)

df
## # A tibble: 250 x 5
##      New Distribution   Votes  Rank                                                Title
##    <chr>        <chr>   <int> <dbl>                                                <chr>
## 1          0000000125 1686502   9.2                      The Shawshank Redemption (1994)
## 2          0000000125 1153698   9.2                                 The Godfather (1972)
## 3          0000000124  789387   9.0                        The Godfather: Part II (1974)
## 4          0000000124 1671708   8.9                               The Dark Knight (2008)
## 5          0000000133  863309   8.9                              Schindler's List (1993)
## 6          0000000133  446671   8.9                                  12 Angry Men (1957)
## 7          0000000123 1322033   8.9                                  Pulp Fiction (1994)
## 8          0000000124 1213467   8.9 The Lord of the Rings: The Return of the King (2003)
## 9          0000000123  502576   8.9               Il buono, il brutto, il cattivo (1966)
## 10         0000000133 1344643   8.8                                    Fight Club (1999)
## # ... with 240 more rows

There is then another 10-line dataset, and then the main one, which seems to contain the rest. read_table isn't good at seeing the splits, but as long as you tell it where to start and if it should stop, it will do the rest.