Hussain Sultan Hussain Sultan - 3 months ago 22
Python Question

How to map results of `dask.DataFrame` to csvs

I create a dataframe with

df=dask.DataFrame.read_csv('s3://bucket/*.csv')
. When i execute a
df[df.a.isnull()].compute
operation, i get a set of rows returned that match the filter criteria. I would like to know which files do these returned rows belong in so that i could investigate why such records have null values. The
DataFrame
has billions of rows and the records with the missing values are in single digits. Is there an efficient way to do so?

Answer

If your CSV files are small then I recommend creating one partition per file

df = dd.read_csv('s3://bucket/*.csv', blocksize=None)

And then computing the number of null elements per partition:

counts = df.a.isnull().map_partitions(sum).compute()

You could then find the filenames

from s3fs import S3FileSystem
s3 = S3FileSystem()
filenames = s3.glob('s3://bucket/*.csv')

And compare the two

dict(zip(filenames, counts))