Noobie Noobie - 2 months ago 11
Python Question

how to load/export CSV/TSV files from Pig to Pandas?

I am using Apache PIG to process some data, and at the end of my script I use

store data into '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge /mypath/tempp2 /localpath/data.tsv;


That way I have a
tsv
file that I readable with
read_csv(headers=0)
in Pandas.

The problem is that the
tsv
file now contains the headers on the first row (which is nice) but also the schema concatenated to the first observation in the second row such as:

col1 col2 col3
{pigschema}0 1 2


assuming the first row is
[0,1,2]
. So unless I use
skiprows=1
in
read_csv
(losing that row), I get this weird observation in my data.

So I wonder if there is a better way to export my data, while getting the headers.

Many thanks!

Answer

First of all you want to use -nl parameter for -getmerge:

store data into  '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge -nl /mypath/tempp2  /localpath/data.tsv;

Docs:

Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

then you'll have in your /localpath/data.tsv the following structure:

0 - headerline
1 - empty line
2 - PIG schema
3 - empty line
4 - 1-st line of DATA
5 - 2-nd line of DATA
...

so now you can easily read it in pandas:

df = pd.read_csv('/localpath/data.tsv', sep='\t', skiprows=[1,2,3])
Comments