ckozl ckozl - 4 months ago 14
SQL Question

Odd SQL Server (TSQL) query results with NEWID() in the "WHERE" clause

This is an odd question to be sure, but I'm a bit flummoxed for an explanation for this behavior:

Background: (not required to know)

So to start, I was writing a quick query and was pasting a list of

UNIQUERIDENTIFIER
and wanted them to be uniform inside of a
WHERE X IN (...)
clause. In the past I used an empty
UNIQUERIDENTIFIER
(all zeros) at the top of the list so that I could paste a uniform set of
UNIQUERIDENTIFIER
that look like:
,'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'
. This time, to avoid tapping the zeros, I inserted a
NEWID()
thinking that the odds of a collision were nearly impossible, to my surprise that resulted thousands of additional results, like 50+% of the table.

Begin question: (part you do need to know)

This query:

-- SETUP: (i boiled this down to the bare minimum)
-- just creating a table with 500 PK UNIQUERIDENTIFIERs
IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable;

CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY);

INSERT INTO #wtfTable
SELECT TOP(500) NEWID()
FROM master.sys.all_objects o1 (NOLOCK)
CROSS JOIN master.sys.all_objects o2 (NOLOCK);

-- ACTUAL QUERY:
SELECT *
FROM #wtfTable
WHERE [WtfId] IN ('00000000-0000-0000-0000-000000000000', NEWID());


... should statistically produce bupkis. But if you run it ten times or so you will sometimes get massive selections. For instance, in this last run I received 465/500 rows, meaning over 93% of the rows were returned.

Results screen

While I understand the
NEWID()
will be recalculated on a per row basis, there's not a statistical chance in hell that it hits that much. Everything I wrote here is required to produce the nuanced
SELECT
, removal of anything will prevent it from happening. Incidentally, you can replace the
IN
with a
WHERE WtfId = '...' OR WtfId = NEWID()
and still receive the same results. I am using SQL SERVER 2014 Standard patched up to date, with no odd settings activated that I know of.

So does anyone out there know what's up with this? Thanks in advance.




Edit:

The
'00000000-0000-0000-0000-000000000000'
is a red herring, here's a version that works with integers: (interesting note, I needed to raise the table size to 1000 with integers to produce the problematic query plan...)

IF (OBJECT_ID('tempdb..#wtfTable') IS NOT NULL) DROP TABLE #wtfTable;

CREATE TABLE #wtfTable (WtfId INT PRIMARY KEY);

INSERT INTO #wtfTable
SELECT DISTINCT TOP(1000) CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT)
FROM sys.tables o1 (NOLOCK)
CROSS JOIN sys.tables o2 (NOLOCK);

SELECT *
FROM #wtfTable
WHERE [WtfId] IN (0, CAST(CAST('0x' + LEFT(NEWID(), 8) AS VARBINARY) AS INT));


or you could just replace the literal
UNIQUEIDENTIFIER
and do this:


DECLARE @someId UNIQUEIDENTIFIER = NEWID();

SELECT *
FROM #wtfTable
WHERE [WtfId] IN (@someId, NEWID());


both produce the same results... the question is Why does this happen?

Answer

Let's have a look at the execution plan.

plan

seek

In this particular run of the query the Seek returned 51 rows instead of estimated 1.


The following actual query produces the plan with the same shape, but it is easier to analyse it, because we have two variables @ID1 and @ID2, which you can track in the plan.

CREATE TABLE #wtfTable (WtfId UNIQUEIDENTIFIER PRIMARY KEY);

INSERT INTO #wtfTable 
SELECT TOP(500) NEWID()
FROM master.sys.all_objects o1 (NOLOCK)
CROSS JOIN master.sys.all_objects o2 (NOLOCK);

DECLARE @ID1 UNIQUEIDENTIFIER;
DECLARE @ID2 UNIQUEIDENTIFIER;

SELECT TOP(1) @ID1 = WtfId
FROM #wtfTable
ORDER BY WtfId;

SELECT TOP(1) @ID2 = WtfId
FROM #wtfTable
ORDER BY WtfId DESC;

-- ACTUAL QUERY:
SELECT *
FROM #wtfTable
WHERE WtfId IN (@ID1, @ID2);

DROP TABLE #wtfTable;

If you examine closely operators in this plan you'll see that IN part of the query is converted into a table with two rows and three columns. The Concatenation operator returns this table. Each row in this helper table defines a range for seeking in the index.

ExpFrom    ExpTo    ExpFlags
@ID1       @ID1     62
@ID2       @ID2     62

Internal ExpFlags specify what kind of range seek is needed (<, <=, >, >=). If you add more variables to IN clause you'll see them in the plan concatenated to this helper table.

Sort and Merge Interval operators make sure that any possible overlapping ranges are merged. See detailed post about Merge Interval operator by Fabiano Amorim which examines the plans with this shape. Here is another good post about this plan shape by Paul White.

In the end the helper table with two rows is joined with the main table and for each row in the helper table there is a range seek in the clustered index from ExpFrom to ExpTo, which is shown in the Index Seek operator. The Seek operator shows < and >, but it is misleading. The actual comparison is defined internally by the Flags value.

If you had some different set of ranges, for example:

WHERE 
    ([WtfId] >= @ID1 AND [WtfId] < @ID2)
    OR [WtfId] = @ID3

, you would still see the same shape of the plan with the same seek predicate, but different Flags values.

So, there are two seeks:

from @ID1 to @ID1, which returns one row
from @ID2 to @ID2, which returns one row

In the query with variables internal expressions lead to getting values from the variables when needed. The value of the variable doesn't change during the query execution and everything behaves correctly as expected.

How NEWID() affects it

When we use NEWID as in your example:

SELECT *
FROM #wtfTable
WHERE WtfId IN ('00000000-0000-0000-0000-000000000000', NEWID());

the plan and all internal processing is the same as for variables.

The difference is that this internal table effectively becomes:

ExpFrom    ExpTo    ExpFlags
0...0      0...0    62
NEWID()    NEWID()  62

NEWID() is called two times. Naturally, each call produces a different value, which by chance results in a range that covers some existing values in the table.

There are two range scans of the clustered index with ranges

from `0...0`     to `0...0`
from `some_id_1` to `some_id_2`

Now it is easy to see how such query can return some rows, even though the chances of NEWID collision is very small.


Apparently, optimiser thinks that it can call NEWID twice instead of remembering the first generated random value and using it further in the query. There have been other cases when optimiser called NEWID more times than expected producing similar seemingly impossible results.

For example:

Is it legal for SQL Server to fill PERSISTED columns with data that does not match the definition?

Inconsistent results with NEWID() and PERSISTED computed column

Optimiser should know that NEWID() is non-deterministic. Overall, it feels like a bug.

I don't know anything about SQL Server internals, but my wild guess looks like this: There are runtime constant functions like RAND(). NEWID() was put into this category by mistake. Then somebody noticed that people do not expect it to return the same ID in the same fashion as RAND() returns the same random number for each invocation. And they patched it by actually regenerating new ID each time NEWID() appears in expressions. But overall rules for optimiser remained the same as for RAND(), so higher level optimiser thinks that all invocations of NEWID() return the same value and freely rearranges expressions with NEWID() which leads to unexpected results.


There is another question about a similar strange behaviour of NEWID():

NEWID() In Joined Virtual Table Causes Unintended Cross Apply Behavior

The answer says that there is a Connect bug report and it is closed as "Won't fix". The comments from Microsoft essentially say that this behaviour is by design.

The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.