Tyler Durden Tyler Durden - 5 months ago 11
SQL Question

SQL: Appending number of records from this month as a column/field

I'm working with HiveQL trying to query Hadoop. My problem is this.

Let's say I have a query (and resulting table) like this:

SELECT CustomerID, PurchaseID, DateOfPurchase
FROM MyTableName;

+------------+------------+----------------+
| CustomerID | PurchaseID | DateOfPurchase |
+------------+------------+----------------+
| 101 | 501 | 2014-01-01 |
| 101 | 502 | 2014-01-15 |
| 101 | 503 | 2014-01-20 |
| 101 | 504 | 2015-01-19 |
| 101 | 505 | 2015-08-25 |
| 102 | 506 | 2014-01-02 |
| 102 | 507 | 2014-01-03 |
| 103 | 508 | 2016-05-05 |
+------------+------------+----------------+


I want to add another column that represents the number of orders a customer has made per month, corresponding with the date in each row. Here is my idea of the resulting table:

+------------+------------+----------------+--------------------+
| CustomerID | PurchaseID | DateOfPurchase | PurchasesThisMonth |
+------------+------------+----------------+--------------------+
| 101 | 501 | 2014-01-01 | 3 |
| 101 | 502 | 2014-01-15 | 3 |
| 101 | 503 | 2014-01-20 | 3 |
| 101 | 504 | 2015-01-19 | 1 |
| 101 | 505 | 2015-08-25 | 1 |
| 102 | 506 | 2014-01-02 | 2 |
| 102 | 507 | 2014-01-03 | 2 |
| 103 | 508 | 2016-05-05 | 1 |
+------------+------------+----------------+--------------------+


That is, for each row, the PurchasesThisMonth column represents how many purchases were made by that customer in that month. Customer 101 made 3 purchases in January of 2014, so each of the January 2014 rows have a 3 in PurchasesThisMonth.

I was able to get this to work using an INNER JOIN on a subquery on the same table. But this takes quite a while considering my dataset is quite large. However, is there a better/faster way of doing this?

Here is my bruteforce way of solving it.

SELECT CustomerID, PurchaseID, DateOfPurchase, Sub.PurchasesThisMonth
FROM MyTableName
INNER JOIN (
SELECT
CustomerID,
COUNT(Inner.PurchaseID) as PurchasesThisMonth
MONTH(Inner.DateOfPurchase) as month,
YEAR(Inner.DateOfPurchase) as year

FROM MyTableName Inner
GROUP BY Inner.CustomerID,
MONTH(Inner.DateOfPurchase),
YEAR(Inner.DateOfPurchase)
) Sub
ON CustomerID=Sub.CustomerID AND
MONTH(DateOfPurchase)=Sub.month AND
YEAR(DateOfPurchase)=Sub.year


This query makes an entire subquery on the exact same table. Is this necessary? If not, what is the best practice here?

Thanks!

vkp vkp
Answer

I think you can use window function count() over().

SELECT CustomerID, PurchaseID, DateOfPurchase
,count(*) over(partition by
               customerid,
               MONTH(DateOfPurchase), 
               YEAR(DateOfPurchase) order by customerid)
FROM MyTableName;