Imagine you are loading a large dataset by the SparkContext and Hive. So this dataset is then distributed in your Spark cluster. For instance a observations (values + timestamps) for thousands of variables.
Now you would use some map/reduce methods or aggregations to organize/analyze your data. For instance grouping by variable name.
Once grouped, you could get all observations (values) for each variable as a timeseries Dataframe. If you now use DataFrame.toPandas
df = sc.load....
There is nothing special about Pandas
DataFrame in this context.
DataFrameis created by using
pyspark.sql.dataframe.DataFramethis collects data and creates local Python object on the driver.
pandas.core.frame.DataFrameis created inside executor process (for example in
mapPartitions) you simply get
RDD[pandas.core.frame.DataFrame]. There is no distinction between Pandas objects and let's say a
DataFrame(I assume this what you mean by
_.toDF) inside executor thread.