pg. pg. - 7 months ago 26
SQL Question

Fastest way to do MySQL query select with WHERE clauses on an unindexed table

I have a very big unindexed table called table with rows like this:

IP entrypoint timestamp
171.128.123.179 /page-title/?kw=abc 2016-04-14 11:59:52
170.45.121.111 /another-page/?kw=123 2016-04-12 04:13:20
169.70.121.101 /a-third-page/ 2016-05-12 09:43:30


I want to make the fastest query that, given 30 IPs and one date, will search rows as far back a week before that date and return the most recent row that contains "?kw=" for each IP. So I want DISTINCT entrypoints but only the most recent one.

I'm stuck by this I know it's a relatively simple INNER JOIN but I don't know the fastest way to do it.

By the way: I can't add the index right now because it's very big and on a db that serves a website. I'm going to replace it with an indexed table don't worry.

Answer

Rows from the table

SELECT ...
  FROM very_big_unindexed_table t

only within the past week...

 WHERE t.timestamp >= NOW() + INTERVAL - 1 WEEK

that contains '?kw=' in the entry point

   AND t.entrypoint LIKE '%?kw=%'

only the latest row for each IP. There's a couple of approaches to that. A correlated subquery on a very big unindexed table is going to eat your lunch and your lunch box. And without an index, there's no getting around a full scan of the table and a "Using filesort" operation.

Given the unfortunate circumstances, our best bet for performance is likely going to be getting the set whittled down as small as we can, and then perform the sort, and avoid any join operations (back to that table) and avoid correlated subqueries.

So, let's start with something like this, to return all of the rows from the past week with '?kw=' in entry point. This is going to be full scan of the table, and a sort operation...

         SELECT t.ip
              , t.timestamp
              , t.entry_point
           FROM very_big_unindexed_table t
          WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
            AND t.entrypoint LIKE '%?kw=%'
          ORDER BY t.ip DESC, t.timestamp DESC

We can use an unsupported trick with user-defined variables. (The MySQL Reference Manual specifically warns against using a pattern like this, because the behavior is (officially) undefined. Unofficially, the optimizer in MySQL 5.1 and 5.5 (at least) is very predictable.

I think this is going to be about as good as you are going to get, if the number of rows from the past week are significant subset of the entire table. This is going to create a sizable intermediate resultset (derived table), if there are lot of rows that satisfy the predicates.

SELECT q.ip
     , q.entrypoint
     , q.timestamp
  FROM (
         SELECT IF(t.ip = @prev_ip, 0, 1) AS new_ip
              , @prev_ip  := t.ip         AS ip
              , t.timestamp               AS timestamp
              , t.entrypoint              AS entrypoint
           FROM (SELECT @prev_ip := NULL) i
          CROSS
           JOIN very_big_unindexed_table t
          WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
            AND t.entrypoint LIKE '%?kw=%'
          ORDER BY t.ip DESC, t.timestamp DESC
       ) q
 WHERE q.new_ip

Execution of that query will require (in terms of what's going to take the time)

  • a full scan of the table (there's no way to get around that)
  • a sort operation (again, there's no way around that)
  • materializing a derived table containing all of the rows that satisfy the predicates
  • a pass through the derived table to pull out the "latest" row for each IP