Lightness Races in Orbit Lightness Races in Orbit - 4 months ago 8
SQL Question

Compacting tables after normalisation

I recently increased the level of normalisation in my database, going from something like this:

+--------------------------------------+
| state_changes |
+----+-------+-----------+------+------+
| ID | Name | Timestamp | Val1 | Val2 |
+----+-------+-----------+------+------+
| 0 | John | 17:19:01 | A | X |
| 1 | Bob | 17:19:02 | E | W |
| 2 | John | 17:19:05 | E | Y |
| 3 | John | 17:19:06 | B | Y |
| 4 | John | 17:19:12 | C | Z |
| 5 | John | 17:19:15 | A | Z |
+----+-------+-----------+------+------+


To something more like this:

+-------------------------------+ +-------------------------------+
| state_changes_1 | | state_changes_2 |
+----+-------+-----------+------+ +----+-------------------+------+
| ID | Name | Timestamp | Val1 | | ID | Name | Timestamp | Val2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| 0 | John | 17:19:01 | A | | 0 | John | 17:19:01 | X |
| 1 | Bob | 17:19:02 | E | | 1 | Bob | 17:19:02 | W |
| 2 | John | 17:19:05 | E | | 2 | John | 17:19:05 | Y |
| 3 | John | 17:19:06 | B | | 3 | John | 17:19:06 | Y |
| 4 | John | 17:19:12 | C | | 4 | John | 17:19:12 | Z |
| 5 | John | 17:19:15 | A | | 5 | John | 17:19:15 | Z |
+----+-------+-----------+------+ +----+-------+-----------+------+


How could I now write a query to "compact" the two resulting tables where values are duplicated?


  • I want to ignore the
    ID
    field when considering row uniqueness;

  • I want to ignore the
    Timestamp
    when considering row uniqueness;

  • But fields must be sequential (under a
    Name,Timestamp
    ordering) to be considered duplicates.



The result, in this example, should be:

+-------------------------------+ +-------------------------------+
| state_changes_1 | | state_changes_2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| ID | Name | Timestamp | Val1 | | ID | Name | Timestamp | Val2 |
+----+-------+-----------+------+ +----+-------+-----------+------+
| 0 | John | 17:19:01 | A | | 0 | John | 17:19:01 | X |
| 1 | Bob | 17:19:02 | E | | 1 | Bob | 17:19:02 | W |
| 3 | John | 17:19:06 | B | | 2 | John | 17:19:05 | Y |
| 4 | John | 17:19:12 | C | | 4 | John | 17:19:12 | Z |
| 5 | John | 17:19:15 | A | +----+-------+-----------+------+
+----+-------+-----------+------+


My tables have several billion rows so I'm looking for something that takes efficiency into consideration; that said, I'm a realistic sort of person so I'm happy for the query to take an hour or two to run (including index rebuilds) if needs be.

Answer

I tried this on MySQL 5.1.58 and it seems to work with your test data.

SET @name = NULL;
SET @val1 = NULL;

UPDATE state_changes_1
SET Val1 = IF(Name=@name AND Val1=@val1, NULL, (@val1:=Val1)),
    Name = (@name:=Name)
ORDER BY Name, `Timestamp`;

DELETE FROM state_changes_1 WHERE Val1 IS NULL;