Tom Tom - 4 months ago 8
SQL Question

Querying substrings against a list of values

I'm reading from a dataset that I unfortunately don't have the access to modify. It has concatenated strings of values, and I want to select records for which any of those substrings (as split by a given character) matches any of the values in a specific list. I'll be passing the queries in via Python, so it won't be compared against a static list.

For example, the table looks like:

CrappyColumn
-----------
1;2
4
1
2;1
1;3
2


And I might want to return anything that has 2 or 4 in it. So, my result should be:

1;2
4
2
2;1


I have played with regexp_substr and gotten something that actually works; however, it just runs indefinitely (as much as 10 minutes before I give up) when I run it on the full dataset (which only includes about three thousand records with values that are often a couple hundred characters long). I need something that works in a reasonable amount of time for repeated execution.

I realize that--even with a variable comparison list--I could just write my Python code to parse the list and construct multiple LIKE statements, but that seems inefficient, and I assume that there is a better way.

And here's what I've done that takes too long:

SELECT DISTINCT CrappyColumn
FROM
(SELECT DISTINCT CrappyColumn, regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) as UGH
FROM CrappyTable
CONNECT BY regexp_substr(CrappyColumn, '[^;]+', 1, LEVEL) IS NOT NULL)
WHERE UGH IN ('2', '4')


Is there a better, faster, cleaner way to accomplish this?

vkp vkp
Answer

A simpler method would be to use regexp_like to check if the list has 2 or 4 in it.

select *
from tablename
where regexp_like(crappycolumn,'^[2|4][^0-9]|[^0-9][2|4][^0-9]|[^0-9][2|4]$|^[2|4]$')
  • ^[2|4][^0-9] - Starts with 2 or 4 not followed by a digit.
  • [^0-9][2|4][^0-9] - 2 or 4 not succeeded or preceded by a digit.
  • [^0-9][2|4]$ - Ends with 2 or 4 not preceded by a digit.
  • ^[2|4]$ - 2 or 4 is the only character in the string.
Comments