RkdL RkdL - 4 months ago 3x
SQL Question

Iterate through a table and replace values in another table using a stored procedure or function

Hopelessly stuck at the following and up until now none of my programming speed dial buddies has been able to help out (most of them not MySQL experts):

I have different tables where the column names and datatypes are auto generated from the 'import table data wizard' using a CSV file, and the table does not contain an AUTO INCREMENT column (yet). This particular table consists of approx: 30.000 rows It starts at row=id(1) from a table that looks like this:

I am trying to correct values in one column that are

comma delimited
using one 'corrections' table. And to do this I am writing a stored procedure containing a WHILE loop to interate through the corrections table row for row, and check wheter or not an Alias is found in the table that was imported.

| id | material | alias01 | alias02 | alias03 | *up to 12
1 Katoen Cotton Supima Pima
2 Polyester Polyster
3 Lyocell Lycocell Lyocel
4 Linnen Linen
5 Viscose Visose Viskose Viscoe Voscose
6 Scheerwol
7 Polyamide
8 Nylon
9 Leer Leder Lamsleder Varkensleder
10 Polyurethaan Polyurethan PU Polyuretaan

For testing purposes to test any kind of results i am only using alias01 for now ( it needs to check alias01, then 02 etc... but i'll try to solve that at a later time).

It needs to compare `Length' ( alias_string_length = found_string_length) to make sure that a string that consist of 'wo' is not found in 'wool' or 'wol'.

The values from the column that need corrections look like this (the comma's dont need to be there it's just what i was given to work with):

| material |



Thanks to Drew's tip i changed the procedure. I added a tmp table that holds materials AND a unique id for each row, and iterate through each one with the alias01. It takes around 11 seconds to do 9000 rows but
0 row(s) affected,
. Any tips on increasing speed are most welcome, but insight in what might be the issue would help alot more.

CREATE DEFINER=`root`@`localhost` PROCEDURE `replace_materials`()
set @rownumber = 1;
set @totalrows = 28;
set @um ='';
set @cm ='';
set @corrected ='';
set @correctme ='';

INSERT INTO tmp (material) SELECT material FROM vantilburgonline.productinfo;

WHILE (@rownumber < @totalrows) DO

SET @um = (SELECT alias01 FROM vantilburgonline.materials WHERE id=@rownumber);
-- gives 'um' value from column alias01, from table materials, row(X)
SET @cm = (SELECT material FROM vantilburgonline.materials WHERE id=@rownumber);
-- gives 'cm' value from column material, from table materials, row(X)

set @tmprow = 1;
set @totaltmprow =9000;

WHILE (@tmprow < @totaltmprow) DO

SET @correctme = (SELECT material FROM vantilburgonline.tmp WHERE id = @tmprow);
-- gives the value from column material from table tmp to correctme(X).

SET @correctme = REPLACE(@correctme,@um,@cm);
-- should run through column material from table productinfo and replace 'alias01' with correct 'material'.
SET @tmprow = @tmprow +1;


SET @rownumber = @rownumber +1;


though i'm certain alias01 contains strings it should've found in the materials. Also Workbench was using 9GB at this point and i was only able to counter that by restarting..


I would recommend an alteration from your materials table which is unwieldy with multiple columns (alias01 .. alias12). A transition to a normalized, extensible system. It would have a materials table and a materials_alias table. As it sits alongside your current table that you created, I named them with a 2.


drop table if exists materials2;
create table materials2
(   material varchar(100) primary key, -- let's go with a natural key
    active bool not null -- turn it LIVE and ON for string replacement of alias back to material name
    -- so active is TRUE for ones to do replacement, or FALSE for skip
    -- facilitates your testing of your synonyms, translations, slangs, etc

insert materials2 (material,active) values
-- 21 rows
-- a few rows were skipped. The intent of them read as gibberish to me. Please review.

-- we need to restructure the materials2_alias table (after the first attempt)
-- 1. it might need special handling when `alias` is a legitimate substring of `material` (those 2 columns)
-- 2. it needs a unique composite index
drop table if exists materials2_alias;
create table materials2_alias
(   id int auto_increment primary key,
    material varchar(100) not null,
    alias varchar(100) not null,
    ais bool not null, -- Alias is Substring (alias is a legitimate substring of material, like Wo and Wol, respectively)
    unique key(material,alias), -- Composite Index, do not allow dupe combos (only 1 row per combo)
    foreign key `m2alias_m2` (material) references materials2(material)

insert materials2_alias (material,alias,ais) values

-- this cleans up the above, where false should have been true
update materials2_alias
set ais=true 
where instr(material,alias)>0;
-- 4 rows

There are several alter table statements and other things. I will try to document them or link to them. I am merely trying to capture something to share considering it is several hundred lines of code from you. But mine comes down to a simple chunk of code you would put in a loop.

The Update put in a loop:

UPDATE productinfo pi
join materials2_alias ma 
on instr(  pi.material,  concat(',',ma.alias,',')  )>0 
join materials2 m
on m.material=ma.material and m.active=true
set pi.material=replace(lower(pi.material),lower(ma.alias),lower(ma.material)),

A few notes on the update:

-- Note, pi.material starts and ends with a comma.
-- I forced that during the ETL. But `ma.alias` does not contain commas.
-- So add the commas with a concat() within the "Update with a Join" pattern shown
-- Note that the commas solved the problem with the Wol - Wo 

Well, the following 4 in particular.

select * from materials2_alias 
where ais=true 
order by material,alias;
| id | material   | alias    | ais |
|  6 | Lyocell    | Lyocel   |   1 |
| 33 | Polyamide  | Polyamid |   1 |
| 28 | Wol        | WO       |   1 |
| 35 | Wol-Merino | Merino   |   1 |

-- instr() is not case sensitive except for binary strings
-- REPLACE(str,from_str,to_str); -- case sensitive
-- http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_replace
-- so the update uses lower() or this won't work due to replace() case sensitivity

Stored Procedure:

DROP PROCEDURE if exists touchCounts;
    select touchCount,count(*) as rowCount 
    from productinfo 
    group by touchCount 
    order by touchCount;
END $$

When that stored procedure returns the same count of rows on a successive call (the next call), you are done modifying the material column via the update.

That stored procedure could naturally return an out parameter for the rowcount. But it is late and time to sleep.

For your last data set from your side, the update statement would need to be called 4 times. That is like 13 seconds on my mediocre laptop. The idea is naturally flexible, for hundreds of aliases per material if you want.

I parked it up on github as it is too much otherwise.