Kári Gunnarsson Kári Gunnarsson - 4 months ago 54x
JSON Question

Unable to use jsonlite::stream_in with certain JSON formats

I'm attempting to stream the rather large (65gb) JSON file from the YASP data dump (https://github.com/yasp-dota/yasp/wiki/JSON-Data-Dump), but it seems that the way the JSON file is formatted means that i cannot read the file, and gives this error:

Error: parse error: premature EOF
(right here) ------^

I've created this small sample JSON file using the same format so everyone else can recreate it easily:

{"match_id": 2000594819,"match_seq_num": 1764515493}
{"match_id": 2000594820,"match_seq_num": 1764515494}
{"match_id": 2000594821,"match_seq_num": 1764515495}

I've saved this file as test.json, and attempt to load it through the jsonlite::stream_in function

con <- file('~/yasp/test.json')
jsonStream <- stream_in(con)

I get the same "premature EOF" error as shown above.

However, if the format of the file is all in one single chunk like so:

[{"match_id": 2000594819,"match_seq_num": 1764515493},{"match_id": 2000594820,"match_seq_num": 1764515494},{"match_id": 2000594821,"match_seq_num": 1764515495}]

Then there are no issues, and the stream_in works fine.

I've played around with using readLines, and collapsing the frame before reading it in:

initialJSON <- readLines('~/yasp/test.json')
collapsedJSON <- paste(initialJSON, collapse="")

While this does work and create a character string i can read into fromJSON, this is not a scalable solution for me, since i can only read a few thousand lines at a time like this, and isn't very scalable (i'd also love to be able to stream directly from the gz file).

Does anyone know how i can get stream_in to accept this fileformat, or some alternative way to do this using R? They show examples of how it works in Java fine, but i would love to be able to do this without jumping into a language i don't really know.


Still haven't got the stream to work, but wrote my own (of sorts), seems to perform decently for my purposes.

fileCon <- file('~/yasp/test.json', open="r")

# Initialize everything
numMatches <- 5
outputFile <- 0
lineCount <- 0
matchCount <- 0
matchIDList <- NULL

# Stream using readLines and only look at even numbered lines
while(matchCount <= numMatches) {
next_line = readLines(fileCon, n = 1)

lineCount <- lineCount + 1

if(lineCount %% 2 == 0) {

matchCount <- matchCount + 1

# read into JSON format
readJSON <- jsonlite::fromJSON(next_line)

# Up the match counter
matchCount <- matchCount + 1

# whatever operations you want, for example get match_id
matchIDList <- c(matchIDList, readJSON$match_id)



Well i never got the stream_in function to work for me, but i created my own streamer that works well and has a low footprint.

streamJSON <- function(con, pagesize, numMatches){

  ## "con" is the file connection
  ## "pagesize" is number of lines streamed for each iteration.
  ## "numMatches" is number of games we want to output

  outputFile <- 0
  matchCount <- 0
  print("Starting parsing games...")
  print(paste("Number of games parsed:",matchCount))
  # Stream in using readLines until we reach the number of matches we want.
  while(matchCount < numMatches) {

    initialJSON = readLines(con, n = pagesize)

    collapsedJSON <- paste(initialJSON[2:pagesize], collapse="")
    fixedJSON <- sprintf("[%s]", collapsedJSON, collapse=",")
    readJSON <- jsonlite::fromJSON(fixedJSON)

    finalList <- 0
    ## Run throught he initial file
    for (i in 1:length(readJSON$match_id)) {
      ## Some work with the JSON to return whatever it is i wanted to return
      ## In this example match_id, who won, and the duration.

      matchList <- as.data.frame(cbind(readJSON$match_id[[i]],
      colnames(matchList) <- c("match_id", "radiant_win", "duration")

      ## Assign to output
      if (length(finalList) == 1) {
        finalList <- matchList
      } else {
        finalList <- rbind.fill(finalList, matchList)

    matchCount <- matchCount + length(unique(finalList[,1]))

    if (length(outputFile) == 1) {
       outputFile <- finalList
    } else {
      outputFile <- rbind.fill(outputFile, finalList)
    print(paste("Number of games parsed:",matchCount))

Not sure if this is of help to others, since it may a bit specific for the YASP data dump, but i can now call this function like this:

fileCon <- gzfile('~/yasp/yasp-dump-2015-12-18.json.gz', open="rb")
streamJSONPos(fileCon, 100, 500)

Which will output a data frame of 500 rows with the specified data, i then have to modify the part within the while loop for whatever it is that i'm looking to extract from the JSON data.

I've been able to stream in 50.000 matches (with fairly complex JSON function), pretty easily, and seems to run at a comparable time (per match) as the stream_in function was.