user3292373 user3292373 - 1 month ago 15
SQL Question

Hive join understanding issue

I have created two tables as below in hive

create table test1(id string)
;

create table test2(id string);


test1 has values as given below

1

1

test2 has values as given below

1

1

When I am joining these two tables I am getting output with

1

1

1

1

This is the query used :

select a.id from test1 a,test2 b where a.id=b.id;


Please help I expected the output to be as

1

1

I am using cloudera distribution

Answer Source

Better use ANSI join syntax:

select a.id 
  from test1 a 
       inner join test2 b on a.id=b.id

The expected output cannot be the result of your join because for each a.id all matching rows from a and b are selected. For the first row from a it will be two matching rows in b. For the second row from a it will be also two matching rows from b. So it will be four rows totally.

You can apply distinct to the second table before join for example.

select a.id 
  from test1 a 
       inner join (select distinct b.id from test2 b) b on a.id=b.id

In this case for each row in table a it will be single matching row in table b.

See this lesson to understand JOINS better: https://www.coursera.org/learn/analytics-mysql/lecture/kydcf/joins-with-many-to-many-relationships-and-duplicates