OstrichProjects OstrichProjects - 2 months ago 18
SQL Question

Why does Redshift need to do a full table scan to find the max value of the DIST/SORT key?

I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this

INSERT INTO a (x) VALUES (1), (2), (3), (4);



XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)

I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by

And just as a sanity check, the
only scans 2 rows instead of the whole table.

Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.


Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.

The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.

Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.

Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.

In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.

Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.

Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.

How to have a meaningful data set to be able to evaluate a query plan?

The short answer is that your table would have a "large number" of data blocks per slice.

How many blocks is per slice is my table going to require? The answer depends on several factors:

  1. Number of nodes in your cluster
  2. Type of node in the cluster - Number of slices per node
  3. Data Type - How many bytes each value requires.
  4. The type of compression encoding for the column involved in the query. The optimal encoding depends on data demographics

So let's start at the top.

Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.

Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.

When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.

Here is a simple example:

If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.

Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.

Redshift uses a block size of 1MB.

So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.

If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.

In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.

But wait.....

There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.

What is the bottom line?

In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.

This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.

If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:

How many rows per slice:

Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like 'category%'
order by slice;

How to count how many blocks:

select trim(name) as table_name, col,  b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
  and a.slice = b.slice
and name like 'category%'
group by 1,2,3,4
order by col, slice;