Shlomi Schwartz Shlomi Schwartz - 2 months ago 15
SQL Question

Aggregate text based on a criteria

On a previous question I asked a similar question that relied on a helper table to be a part of the criteria for splitting the data. It seems that my current goal is easier, but I couldn't figure it out.

Given the table:

CREATE TABLE conversations (id int, record_id int, is_response bool, text text);
INSERT INTO conversations VALUES
(1, 1, false, 'in text 1')
, (2, 1, true , 'response text 1')
, (3, 1, false, 'in text 2')
, (4, 1, true , 'response text 2')
, (5, 1, true , 'response text 3')
, (6, 2, false, 'in text 1')
, (7, 2, true , 'response text 1')
, (8, 2, false, 'in text 2')
, (9, 2, true , 'response text 2')
, (10, 2, true , 'response text 3');


I would like to aggregate the text based on the
is_response
value and output the following:

record_id | aggregated_text |
----------+---------------------------------------------------+
1 |in text 1 response text 1 |
----------+---------------------------------------------------+
1 |in text 2 response text 2 response text 3 |
----------+---------------------------------------------------+
2 |in text 1 response text 1 |
----------+---------------------------------------------------+
2 |in text 2 response text 2 response text 3 |


I've tried the following query, but it fails to aggregate two responses in a row, IE :is_response is true in a sequence.

SELECT
record_id,
string_agg(text, ' ' ORDER BY id) AS aggregated_text
FROM (
SELECT
*,
coalesce(sum(incl::integer) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),0) AS grp
FROM (
SELECT *, is_response as incl
FROM conversations
) c
) c1
GROUP BY record_id, grp
HAVING bool_or(incl)
ORDER BY max(id);


The output of my query just adds another line for the following is_response row like so:

record_id | aggregated_text |
----------+---------------------------------------------------+
1 |in text 1 response text 1 |
----------+---------------------------------------------------+
1 |in text 2 response text 2 |
----------+---------------------------------------------------+
1 |response text 3 |
----------+---------------------------------------------------+
2 |in text 1 response text 1 |
----------+---------------------------------------------------+
2 |in text 2 response text 2 |
----------+---------------------------------------------------+
2 | response text 3 |
----------+---------------------------------------------------+


How can I fix it?

Answer

Here's a variation of the answer I gave in your previous question:

SELECT record_id, string_agg(text, ' ')
FROM (
    SELECT *, coalesce(sum(incl::integer) OVER w,0) AS subgrp
    FROM (
        SELECT *, is_response AND NOT coalesce(lead(is_response) OVER w,false) AS incl
        FROM conversations
        WINDOW w AS (PARTITION BY record_id ORDER BY id)
    ) t
    WINDOW w AS (PARTITION BY record_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
) t1
GROUP BY record_id, subgrp
HAVING bool_or(incl)
ORDER BY min(id);

The idea is that for each row we look at the next row of the same record with the help of the lead window function. If there is not such row, or if there is one and its is_response is false while the current is_response is true, then we select that row, aggregating all previous unused values of text.

This query also ensures that if the last conversation if incomplete (which doesn't happen in you sample data), it will be omitted.