jeff jeff - 3 months ago 21
MySQL Question

Partitions in hive

I have dataset on CDH cluster and it is partitioned by by yyyymm.

When I run the below query on hive:

select actvydt, cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos where yyyymm=201601 and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;


It is hitting the right partition of 201601 from the dataset.

Below are the results:

actvydt yyyymm
2016-01-02 201601
2016-01-02 201601
2016-01-02 201601


But when I run the below query: (just passing the parameter for yyyymm through subst and concat function)

select actvydt,cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos.pos_sales_weekly where yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;


It is hitting the entire dataset. So the value for yyyymm is not getting passed correctly. There is some problem with this function:

cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int)


But the value of the function is passed as a column and can be seen in the results above. It is showing the correct parameter 201601.
Any help would be highly appreciable.

Below is the table schema:
CREATE EXTERNAL TABLE IF NOT EXISTS pos (nid bigint, actvydt date, upc string, tchid string, posfileid string, yssk bigint) PARTITIONED BY (yyyymm int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/data/' TBLPROPERTIES ( 'avro.output.codec'='snappy' );

Answer

Partition key values must be known before query execution for partition pruning to work. You are using WHERE clause: yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09'

Optimizer unfortunately does not have such intelligence to infer yyyymm values from your rather complex function before query execution. Try to additionally add explicit condition: yyyymm='201601' This will work. You can pass it as a variable.

Comments