I am using Spark to do exploratory data analysis on a user log file. One of the analysis that I am doing is average requests on daily basis per host. So in order to figure out the average, I need to divide the total request column of the DataFrame by number unique Request column of the DataFrame.
total_req_per_day_df = logs_df.select('host',dayofmonth('time').alias('day')).groupby('day').count()
avg_daily_req_per_host_df = total_req_per_day_df.select("day",(total_req_per_day_df["count"] / daily_hosts_df["count"]).alias("count"))
AnalysisException: u'resolved attribute(s) count#1993L missing from day#3628,count#3629L in operator !Project [day#3628,(cast(count#3629L as double) / cast(count#1993L as double)) AS count#3630];
It is not possible to reference column from another table. If you want to combine data you'll have to
join first using something similar to this:
from pyspark.sql.functions import col (total_req_per_day_df.alias("total") .join(daily_hosts_df.alias("host"), ["day"]) .select(col("day"), (col("total.count") / col("host.count")).alias("count")))