Martin Preusse Martin Preusse - 3 months ago 14
JSON Question

Process large JSON stream with jq

I get a very large JSON stream (several GB) from

curl
and try to process it with
jq
.

The relevant output I want to parse with
jq
is packed in a document representing the result structure:

{
"results":[
{
"columns": ["n"],

// get this
"data": [
{"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
{"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
// ... millions of rows

]
}
],
"errors": []
}


I want to extract the
row
data with
jq
. This is simple:

curl XYZ | jq -r -c '.results[0].data[0].row[]'


Result:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}


However, this always waits until
curl
is completed.

I played with the
--stream
option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from
curl
:

curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]'


Is there a way to 'jump' to the
data
field and start parsing
row
one by one without waiting for closing tags?

Answer

(1) The vanilla filter you would use would be as follows:

jq -r -c '.results[0].data[].row'

(2) One way to use the streaming parser here would be to use it to process the output of .results[0].data, but the combination of the two steps will probably be slower than the vanilla approach.

(3) You may wish to try something along these lines:

jq -n --stream 'inputs
      | select(length==2)
      | select( .[0]|[.[0],.[2],.[4]] == ["results", "data", "row"])
      | [ .[0][6], .[1]] '

For the illustrative input (modified to make it valid JSON), the output would be:

[ "key", "value1" ] [ "key", "value2" ]