Marc Gravell Marc Gravell - 2 months ago 8
SQL Question

How can I normalize the capitalization of a group-by column?

On SQL Server configured as case-insensitive,

group by
can have interesting results when the
[n][var]char
column is not the first
group by
column. Essentially, it looks like whatever row it encounters "first" (where "first" is undefined in the absence of an order): wins for that grouping. For example:

select x.[day], x.[name], count(1) as [count]
from (
select 1 as [day], 'a' as [name]
union all select 1, 'A'
union all select 2, 'A'
union all select 2, 'a'
) x group by x.[day], x.[name]


which returns, for me:

day name count
----------- ---- -----------
1 A 2
2 a 2


Using
min(x.[name])
has no effect, since the grouping already happened.

I can't add an
order by
before the
group by
, as that is illegal; and adding the
order by
after the
group by
just defines the output order after the grouping - it still gives
a
and
A
.

So: is there a sane way of doing this where the capitalization will at least be consistent for all the groupings? (I'll leave for another day the problem of being consistent for separate runs)

Desired output, either:

day name count
----------- ---- -----------
1 A 2
2 A 2


or:

day name count
----------- ---- -----------
1 a 2
2 a 2





Edit: without destroying capitalisation when it is consistent between groups. So no upper/lower. So if one of the groups consistently has the value
BcDeF
, I want the result of that row to be
BcDeF
, not
bcdef
or
BCDEF
.

Answer

I would use windowing functions for this. By using ROW_NUMBER and partitioning using a case insensitive collation, but ordering by a case sensitive one, we will choose consistently one result with the original capitalisation, but it will group them as if they are the same:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT *
FROM CTE
WHERE RN = 1;

It returns:

╔═════╦═══════╦════╦═══╗
║ day ║ name  ║ RN ║ N ║
╠═════╬═══════╬════╬═══╣
║   1 ║ A     ║  1 ║ 2 ║
║   2 ║ A     ║  1 ║ 2 ║
║   3 ║ BcDeF ║  1 ║ 2 ║
╚═════╩═══════╩════╩═══╝

Following @AndriyM's comment, if you want the same capitalisation over the whole result set, and not just the same day, you can use:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT  [day],
        MAX([name] COLLATE SQL_Latin1_General_Cp1_CS_AS) OVER (PARTITION BY [name]) [name],
        N
FROM CTE
WHERE RN = 1;