1290 1290 - 4 months ago 40
JSON Question

How to format the TSV file in Druid

I am trying to load in a TSV in druid using this ingestion speck:

{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "sample_data",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "tsv",
"timestampSpec" : {
"column" : "date_time",
"format" : "yyyy-MM-dd hh:mm:ss"
},
"dimensionsSpec" : {
"dimensions": ["name", "email", "age"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
},
{
"type" : "doubleSum",
"name" : "age",
"fieldName" : "age"
},
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2016-06-18/2016-06-22" ]
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "quickstart/sample_data",
"filter" : "people_json_file.json"
}
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : -1,
"rowFlushBoundary" : 0,
"numShards": 1
}
}
}


New Questions:

1) For the
dataSource
key in the
dataSchema
object does that have to
local
if you are ingesting from your local machine or can it be any phrase like
sample_data
the one I put?

2) Also in the
metricsSpec
array, for the first object in there
{ "type" : "count", "name" : "count" }
Is that a generic metric for every single dimension in the dataset? Since I do not have any metric column named
count
in my JSON Schema.

3) Lastly from what I understand in the
metricsSpec
there is a
type
,
name
, and
fieldName
key. Does the
name
and
fieldName
key have to have the same values? Like
age
and
age
.

BOLDED PART BELOW IS ANSWERED BY zlosim:

If my schema looks like this:

Schema: name email age


And actual dataset looks like this:

name email age Bob Jones 23 Billy Jones 45


Is this how the columns should be formatted^^ in the above dataset for a TSV? Like
name email age
should be first (the columns) and then the actual data. I am confused how Druid will know how to map the columns to the actual dataset in TSV format.

Answer

TSV stands for tab separated format, so it looks the same as csv but you will use tabs instead of commas e.g.

Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St

you will use frist line as header to define your column names - so you can use "name", "age" or "email" in dimensions in your spec file

as for the gmt and utc, they are basically the same

There is no time difference between Greenwich Mean Time and Coordinated Universal Time

first one is time zone, the other one is a time standard

btw don`t forget to include a column with some time value in your tsv file!!

so e.g. if you will have tsv file that looks like:

"name"  "position"  "office"    "age"   "start_date"    "salary"
"Airi Satou"    "Accountant"    "Tokyo" "33"    "2016-07-16T19:20:30+01:00" "162700"
"Angelica Ramos"    "Chief Executive Officer (CEO)" "London"    "47"    "2016-07-16T19:20:30+01:00" "1200000"

your spec file should look like this:

{
    "spec" : {
        "ioConfig" : {
            "inputSpec" : {
                "type": "local",
                "baseDir": "path_to_folder",
                "filter": "name_of_the_file(s)"
            }
        },
        "dataSchema" : {
            "dataSource" : "local",
            "granularitySpec" : {
                "type" : "uniform",
                "segmentGranularity" : "hour",
                "queryGranularity" : "none",
                "intervals" : ["2016-07-01/2016-07-28"]
            },
            "parser" : {
                "type" : "string",
                "parseSpec" : {
                    "format" : "tsv",
                    "dimensionsSpec" : {
                        "dimensions" : [
                            "position",
                            "age",
                            "office"
                        ]
                    },
                    "timestampSpec" : {
                        "format" : "auto",
                         "column" : "start_date"
                    }
                }
            },
            "metricsSpec" : [
                {
                    "name" : "count",
                    "type" : "count"
                },
                {
                    "name" : "sum_sallary",
                    "type" : "longSum",
                    "fieldName" : "salary"
                }
            ]
        }
    }
}
Comments