Codeguard Codeguard - 1 month ago 5
SQL Question

COUNT(1) + COUNT(DISTINCT()) much slower then doing 2 queries separately

Query explanation:


  • Person
    (identified by
    PersonID
    ) may have or not have a corresponding
    Job
    (identified by
    JobID
    ).

  • If there's a corresponding
    Job
    , binding is stored in table
    PersonJob
    (
    PersonID
    <=>
    JobID
    ).

  • Person
    without
    Job
    are ignored.

  • Job
    also has
    CityID
    .

  • Per every
    Job.CityID
    , query wants to know total count of
    Person
    and also count of unique
    Person.HouseID



Query:

SELECT
Job.CityID, COUNT(1) NumTotal, COUNT(DISTINCT(Person.HouseID)) NumDistinct
FROM
Job
INNER JOIN PersonJob ON (PersonJob.JobID = Job.JobID)
INNER JOIN Person ON (Person.PersonID = PersonJob.PersonID)
GROUP BY
Job.CityID


Stats:


  • SELECT COUNT(1) FROM PersonJob
    ~ 600.000

  • SELECT COUNT(1) FROM Person
    ~ 800.000

  • SELECT COUNT(DISTINCT(Person.HouseID)) FROM Person
    ~ 10.000

  • SELECT COUNT(1) FROM Job
    ~ 500

  • MS SQL Server 10.50



Problem:


  • COUNT(1)
    part of query, when run separately, runs in 0.25sec.

    SELECT
    Job.CityID, COUNT(1) NumTotal
    FROM
    Job
    INNER JOIN PersonJob ON (PersonJob.JobID = Job.JobID)
    INNER JOIN Person ON (Person.PersonID = PersonJob.PersonID)
    GROUP BY
    Job.CityID

  • COUNT(DISTINCT(Person.HouseID))
    part of query, when run separately, runs in 0.80sec.

    SELECT
    Job.CityID, COUNT(DISTINCT(Person.HouseID)) NumDistinct
    FROM
    Job
    INNER JOIN PersonJob ON (PersonJob.JobID = Job.JobID)
    INNER JOIN Person ON (Person.PersonID = PersonJob.PersonID)
    GROUP BY
    Job.CityID

  • Entire query runs in 3.10sec - 3 times slower, why?



Execution plans:


  • I'm no expert in reading those, sorry.

  • As far as I can tell, the problem lies within COUNT(DISTINCT)

  • In partial query:


    • 25% Hash match (Aggregate) (output
      Job.CityID
      )

    • 15% Hash match (Inner Join) (output
      Job.CityID
      ,
      Person.HouseID
      )


      • 30% Index scan (output
        Person.PersonID
        ,
        Person.HouseID
        )

      • 14% Index seek (output
        PersonJob.PersonID
        )



  • In full query:


    • 03% Hash match (Partial aggregate) (output
      Job.CityID
      ,
      COUNT(*)
      )

    • 31% Hash match (Aggregate) (output
      Job.CityID
      )

    • 29% Table Spool (output
      Job.CityID
      ,
      Person.HouseID
      )



Answer

This is a known issue in versions of SQL Server prior to 2012.

You could try this rewrite based on the code here.

WITH T1
     AS (SELECT Job.CityID,
                Person.HouseID
         FROM   Job
                INNER JOIN PersonJob
                        ON ( PersonJob.JobID = Job.JobID )
                INNER JOIN Person
                        ON ( Person.PersonID = PersonJob.PersonID )),
     PartialSums
     AS (SELECT COUNT(*) AS CountStarPartialCount,
                HouseID,
                CityID
         FROM   T1
         GROUP  BY CityID,
                   HouseID)
SELECT CityID,
       SUM(CountStarPartialCount) AS NumTotal,
       COUNT(HouseID)             AS NumDistinct
FROM   PartialSums
GROUP  BY CityID 

SQL Server 2012 has some imrovements in this area. See Is Distinct Aggregation Still Considered Harmful?

Comments