Superbrain_bug Superbrain_bug - 1 year ago 75
Apache Configuration Question

Making storage plugin on Apache Drill to HDFS

I'm trying to make storage plugin for Hadoop (hdfs) and Apache Drill.
Actually I'm confused and I don't know what to set as port for hdfs:// connection, and what to set for location.
This is my plugin:

{
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:54310",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}


So, is ti correct to set localhost:54310 because I got that with command:

hdfs -getconf -nnRpcAddresses


or it is :8020 ?

Second question, what do I need to set for location? My hadoop folder is in:

/usr/local/hadoop


, and there you can find /etc /bin /lib /log ... So, do I need to set location on my datanode, or?

Third question. When I'm connecting to Drill, I'm going through sqlline and than connecting on my zookeeper like:

!connect jdbc:drill:zk=localhost:2181


My question here is, after I make storage plugin and when I connect to Drill with zk, can I query hdfs file?

I'm very sorry if this is a noob question but I haven't find anything useful on internet or at least it haven't helped me.
If you are able to explain me some stuff, I'll be very grateful.

Answer Source

As per Drill docs,

  {
    "type" : "file",
    "enabled" : true,
    "connection" : "hdfs://10.10.30.156:8020/",
    "workspaces" : {
      "root" : {
        "location" : "/user/root/drill",
        "writable" : true,
        "defaultInputFormat" : null
      }
    },
    "formats" : {
      "json" : {
        "type" : "json"
      }
    }
  }

In "connection",

put namenode server address.

If you are not sure about this address. Check fs.default.name or fs.defaultFS properties in core-site.xml.

Coming to "workspaces",

you can save workspaces in this. In the above example, there is a workspace with name root and location /user/root/drill. This is your HDFS location.

If you have files under /user/root/drill hdfs directory, you can query them using this workspace name.

Example: abc is under this directory.

 select * from dfs.root.`abc.csv`

After successfully creating the plugin, you can start drill and start querying .

You can query any directory irrespective to workspaces.

Say you want to query employee.json in /tmp/data hdfs directory.

Query is :

select * from dfs.`/tmp/data/employee.json`