i trying load in tsv in druid using ingestion speck:
most updated spec below:
{ "type" : "index", "spec" : { "ioconfig" : { "type" : "index", "inputspec" : { "type": "local", "basedir": "quickstart", "filter": "test_data.json" } }, "dataschema" : { "datasource" : "local", "granularityspec" : { "type" : "uniform", "segmentgranularity" : "hour", "querygranularity" : "none", "intervals" : ["2016-07-18/2016-07-22"] }, "parser" : { "type" : "string", "parsespec" : { "format" : "json", "dimensionsspec" : { "dimensions" : ["name", "email", "age"] }, "timestampspec" : { "format" : "yyyy-mm-dd hh:mm:ss", "column" : "date" } } }, "metricsspec" : [ { "name" : "count", "type" : "count" }, { "type" : "doublesum", "name" : "age", "fieldname" : "age" } ] } }
}
if schema looks this:
schema: name email age
and actual dataset looks this:
name email age bob jones 23 billy jones 45
is how columns should formatted^^ in above dataset tsv? name email age
should first (the columns) , actual data. confused how druid know how map columns actual dataset in tsv format.
tsv stands tab separated format, looks same csv use tabs instead of commas e.g.
name<tab>age<tab>address paul<tab>23<tab>1115 w franklin bessy cow<tab>5<tab>big farm way zeke<tab>45<tab>w main st
you use frist line header define column names - can use "name", "age" or "email" in dimensions in spec file
as gmt , utc, same
there no time difference between greenwich mean time , coordinated universal time
first 1 time zone, other 1 time standard
btw don`t forget include column time value in tsv file!!
so e.g. if have tsv file looks like:
"name" "position" "office" "age" "start_date" "salary" "airi satou" "accountant" "tokyo" "33" "2016-07-16t19:20:30+01:00" "162700" "angelica ramos" "chief executive officer (ceo)" "london" "47" "2016-07-16t19:20:30+01:00" "1200000"
your spec file should this:
{ "spec" : { "ioconfig" : { "inputspec" : { "type": "local", "basedir": "path_to_folder", "filter": "name_of_the_file(s)" } }, "dataschema" : { "datasource" : "local", "granularityspec" : { "type" : "uniform", "segmentgranularity" : "hour", "querygranularity" : "none", "intervals" : ["2016-07-01/2016-07-28"] }, "parser" : { "type" : "string", "parsespec" : { "format" : "tsv", "dimensionsspec" : { "dimensions" : [ "position", "age", "office" ] }, "timestampspec" : { "format" : "auto", "column" : "start_date" } } }, "metricsspec" : [ { "name" : "count", "type" : "count" }, { "name" : "sum_sallary", "type" : "longsum", "fieldname" : "salary" } ] } } }
Comments
Post a Comment