PasLeChoix PasLeChoix - 2 months ago 23
MySQL Question

what's the difference of sqoop import to hdfs and to hive?

I was able to use sqoop to import a mysql table "titles" to hdfs using command like this:

sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table=titles --target-dir=titles --m=1


Now I want to import to hive, if I use the following command:

sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table titles --hive-import


I will be prompted that:


Output directory hdfs://localhost:9000/user/root/titles already exists


In hive, if I do a show tables I get the following:

hive> show tables;
OK
dept_emp
emp
myfirsthivetable
parted1emp
partitionedemp


You can see there is no table called titles in hive

I am confused at this, for the sqoop imported data, is there any 1 to 1 relationship between hdfs and hive? What's the meaning of the prompt?

Thank you for your enlighening.

Answer

As Amit has pointed out, since you already created the HDFS directory in your first command, Sqoop refuses to overwrite the folder titles since it already contains data.

In your second command, you are telling Sqoop to import (once again) the whole table (which was already imported in the first command) into Hive. Since you are not specifying the --target-dir with the HDFS destination, Sqoop will try to create the folder titles under /user/root/. SInce this folder already exists, an error was raised.

When you tell Hive to show the tables, titles doesn't appear because the second command (with hive-import) was not successful, and Hive doesn't know anything about the data. When you add the flag --hive-import, what Sqoop does under the hood is update the Hive metastore which is a database that has the metadata of the Hive tables, partitions and HDFS location.

You could do the data import using just one Sqoop command instead of using two different ones. If you delete the titles HDFS folder and you perform something like this:

sqoop import --connect jdbc:mysql://localhost/employees --username=root 
-P --table=titles --target-dir=titles --hive-import --m=1

This way, you are pulling the data from Mysql, creating the /user/root/titles HDFS directory and updating the metastore, so that Hive knows where the table (and the data) is.

But what if you wouldn't want to delete the folder with the data that you already imported? In that case, you could create a new Hive table titles and specify the location of the data using something like this:

CREATE [TEMPORARY] [EXTERNAL] TABLE title    
  [(col_name data_type [COMMENT col_comment], ...)]
  (...)
  LOCATION '/user/root/titles'

This way, you wouldn't need to re-import the whole data again, since it's already in HDFS.