Lee Greco - 1 year ago 63

SQL Question

We have a table of values that has been expanded into a de-normalized set and I need to re-normalize it, finding the fewest number of reference sets.

A simplified version of the source data looks something like this:

`Period Group Item Seq`

------ ----- ---- ---

1 A 1 1

1 A 2 2

1 A 3 3

1 B 1 1

1 B 2 2

1 B 3 3

1 C 1 1

1 C 4 2

1 C 5 3

1 D 2 1

1 D 1 2

1 D 3 3

1 E 1 1

1 E 2 2

1 F 2 1

1 F 1 2

1 F 3 3

I want to extract the minimum number of lists defined in the data and assign a reference to the list based on period and group. A list consists of an ordered sequence of items. Here are the 4 lists defined in the above data:

`List Item Seq`

---- ---- ---

1 2 1

1 1 2

1 3 3

2 1 1

2 2 2

2 3 3

3 1 1

3 4 2

3 5 3

4 1 1

4 2 2

And the output which I want to achieve:

`Period Group List`

------ ----- ----

1 A 2

1 B 2

1 C 3

1 D 1

1 E 4

1 F 1

I have a solution that works using ORA_HASH and LIST_AGG to generate a hash over the items of the group, but it fails when the number of items in a group is greater than 400. The resulting error is ORA-01489: result of string concatenation is too long.

I'm looking for a general solution that would work regardless of the number of items in a group in any given period.

Items are identified by an integer value less than 100,000.

Realistically, we'll never see more than 4000 items in a group.

This is logically similar to what works for up to 400 group item records:

`WITH`

the_source_data as (

select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union

select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union

select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union

select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union

select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union

select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union

select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual

),

cte_list_hash as (

select

the_period,

the_group,

ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order by the_seq)) as list_hash

from

the_source_data

group by

the_period,

the_group

),

cte_unique_lists as

(

select

list_hash,

min(the_period) keep (dense_rank first order by the_period, the_group) as the_period,

min(the_group) keep (dense_rank first order by the_period, the_group) as the_group

from

cte_list_hash

group by

list_hash

),

cte_list_base as

(

select

the_period,

the_group,

list_hash,

rownum as the_list

from

cte_unique_lists

)

select

A.the_period,

A.the_group,

B.the_list

from

cte_list_hash A

inner join

cte_list_base B

on A.list_hash = B.list_hash;

Any help finding the right direction to take this would be greatly appreciated.

Answer Source

Here is a way to get your results without using `LISTAGG`

and without getting `ORA-01489`

errors.

The main caveat is that it numbers the lists differently that what you had in your example, but that numbering seemed arbitrary to me. This version numbers them based on the ordinal position of the first period/group that uses the list. That is, for example, the list used by group A in period 1 would be "list #1".

I threw in some sample data for period 2, just to make sure that was happening correctly, too.

Hopefully the comments in the SQL below explain the approach clearly enough.

Finally... I have no idea how long this will run on a large data set. The cross join may be problematic.

```
WITH
the_source_data as (
select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 1 as the_item, 1 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 4 as the_item, 2 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 5 as the_item, 3 as the_seq from dual
),
-- this CTE counts the number of rows in each period, group. We need this to avoid matching a long list to a shorter list that
-- happens to share the same order, as far is it goes.
sd2 as (
select sd.*, count(*) over ( partition by sd.the_period, sd.the_group) cnt from the_source_data sd ),
-- this CTE joins every row to every other rows and then filters based on matches of item#, seq, and list length
-- it then counts the number of matches by period and group (cnt3)
sd3 as (
select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt,
sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2
, count(*) over ( partition by sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3
from sd2 sd2a cross join sd2 sd2b
where sd2b.the_item= sd2a.the_item
and sd2b.the_seq = sd2a.the_seq
and sd2a.cnt = sd2b.cnt ),
-- This CTE filters to period, groups that had the same number of matches as elements in the original period, group. I.e., it
-- filters to perfect list matches: all elements the same, in the same order, and the list lengths are the same.
-- for each, it gets the first period and group # that share the list
sd4 as (
select the_period, the_group, --min(the_group2) over ( partition by the_period, the_group ) first_in_group
min(the_period2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_period,
min(the_group2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_group
from sd3 where cnt = cnt3 )
-- We'll arbitrarily name the lists based on the ordinal position of the first period and group that uses the list.
select distinct the_period, the_group, dense_rank() over ( partition by null order by first_period, first_group ) list
from sd4
order by 1,2
```