Noobie Noobie - 1 year ago 83
Python Question

how to load/export CSV/TSV files from Pig to Pandas?

I am using Apache PIG to process some data, and at the end of my script I use

store data into '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge /mypath/tempp2 /localpath/data.tsv;

That way I have a
file that I readable with
in Pandas.

The problem is that the
file now contains the headers on the first row (which is nice) but also the schema concatenated to the first observation in the second row such as:

col1 col2 col3
{pigschema}0 1 2

assuming the first row is
. So unless I use
(losing that row), I get this weird observation in my data.

So I wonder if there is a better way to export my data, while getting the headers.

Many thanks!

Answer Source

First of all you want to use -nl parameter for -getmerge:

store data into  '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge -nl /mypath/tempp2  /localpath/data.tsv;


Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

then you'll have in your /localpath/data.tsv the following structure:

0 - headerline
1 - empty line
2 - PIG schema
3 - empty line
4 - 1-st line of DATA
5 - 2-nd line of DATA

so now you can easily read it in pandas:

df = pd.read_csv('/localpath/data.tsv', sep='\t', skiprows=[1,2,3])