FaithInMotion FaithInMotion - 6 months ago 15
MySQL Question

SQL lower() on one side of WHERE statement works for part of result but not all?

I will happily take an edit on the title - I have no clue what to call this. Basically, I'm looking for an explanation to something I just came across in my work environment.

Consider the user with the user name

Test User randomlyGeneratedString
(this is a test-generated user). This user sold a product called
TEST PRODUCT (DO NOT BUY)
.

For work, I had to create a search on the user name that included using "lower()" on a join. In my naivety, this is what I came up with:

WHERE....
AND (...
OR products.product_name LIKE %$search_terms%
OR users.display_name LIKE lower(%$search_terms%)
)


(obviously, terms was cleaned up before hitting the database - I'm just trying to lay out the example)

This created some other fun things that I'm not going to get into for the sake of this question (like the conflict between TEST in both the product and the user name, and the fact that a second lower() is needed around the column).

Here is what I'm trying to understand.


  • TEST, test, and Test obviosly all returned Product Name results.

  • User, user, and User returned NOTHING.

  • randomlyGeneratedString, randomlyGeneratedStrinG returned the CORRECT RESULTS



My boss said that it shouldn't return ANY results (with the exception of the product names issue) because the users table requires the lower search to match case. However, neither of us can figure out why a partial search on only the randomly generated string returned correct results while a partial search on the middle of the name returned absolutely nothing.

EDIT - CLARIFICATION 1: users.display_name is a binary, nullable, varchar column. products.product_name is also a varchar but neither binary nor nullable.

EDIT - CLARIFICATION 2: To answer a provided below question - the reason we are using a LIKE search on a case sensitive table is because its an old old table that may or may not have been designed properly to begin with, and the search had to accept partial matches. LIKE and LOWER() seems like the only answer to this.

Adding a note to say that yes, I understand that the LOWER() should have been on both sides of the LIKE, and it is now. The question is centered around the fact that it was on ONE SIDE and behaving in an unexpected manner (that got me in trouble for not properly testing, as what I tested was the part that was returning as expected).

Sample data that I was actually specifically working with...

Here is the list of possible user names to be returned:


  • Test User ztr9ejsj6a

  • Test User 2i0jltuwrl

  • Test User csmr0e16ry

  • Test User 57lhpp4bfi

  • Test User 6pvz9w7m6z



Searching
ztr9ejsj6a
returns 1 result:


  • Test User ztr9ejsj6a



Searching
ztr9ejsj6A
(note the last letter capitalized) returns 1 result:


  • Test User ztr9ejsj6a



Searching 'User', 'user', and 'USER' returns 0 results

Searching any variation of TEST is irrelevant (products have TEST in their name and therefore are searched before the user name).

Answer

TD;DR

SQL lower() on one side of WHERE statement works for part of result but not all?

Yes, when one side is case sensitive/binary and there are values in the column with upper case characters in them. LOWER() will match the substring in lower case only.

Detailed

In MySQL there is no difference between the values tEsT and TeSt when you are using LIKE and both of the operands are case insensitive.

This is why you get results for the products.product_name LIKE %$search_terms% regardless of the character cases in the search term or in the stored values.

The string comparison becomes more strict when one of the operands are case sensitive or binary. To have a match all characters should be in the correct case.

users.display_name LIKE lower(%$search_terms%) This solves nothing, unless all the values are stored in lower case in the display_name column. (As the OP stated, the display_name column has a binary collation.) As a side effect, if you have a UNIQUE key on this column, the values tEsT and TeSt are treated as different values and you can store both of them.

To make the above condition work as expected, you have to convert both sides to the same case: lower(users.display_name) LIKE lower(%$search_terms%), or you can convert the collation of the column using the COLLATE keyword: users.display_name COLLATE utf8_general_ci = %$search_terms%

Please note that this may render two different values equal, (tEsT and TeSt will become equal). Also note, that using any functions/transformations on a column will prevent MySQL to use indexes on that column. This includes the COLLATE keyword.

Only one question left: Why do you store those values in a case sensitive (or binary) column and want to search using case insensitive comparison?

You can check the statements above using a batch like this:

SET
  @bin = 'tEsT' COLLATE utf8_bin,
  @ci = 'tEsT' COLLATE utf8_general_ci
;

SELECT
  @bin, LOWER(@bin),
  @bin LIKE '%TeSt%',              -- 0
  LOWER(@bin) LIKE '%TeSt%',       -- 0
  LOWER(@bin) LIKE LOWER('%TeSt%') -- 1
;

SELECT
  @ci, LOWER(@ci),
  @ci LIKE '%TeSt%',              -- 1
  LOWER(@ci) LIKE '%TeSt%',       -- 1
  LOWER(@ci) LIKE LOWER('%TeSt%') -- 1
;

Disclaimer

You mentioned, that randomlyGeneratedString, randomlyGeneratedStrinG returned the CORRECT RESULTS. The correct result is the users with randomlygeneratedstring in their user names at any position. Unfortunatelly you did not provide sample data what we can use to explain how string comparison works in MySQL.

Sample data

Here is the list of possible user names to be returned:

Test User ztr9ejsj6a
Test User 2i0jltuwrl
Test User csmr0e16ry
Test User 57lhpp4bfi
Test User 6pvz9w7m6z

Searching ztr9ejsj6a returns 1 result: Test User ztr9ejsj6a Searching ztr9ejsj6A (note the last letter capitalized) returns 1 result: Test User ztr9ejsj6a.

In both cases, the search string will be converted to it's lower case variant, which is 'ztr9ejsj6a', which is available in the database as the part of one display name.

Searching 'User', 'user', and 'USER' returns 0 results

This is because, you have no values with lower case user in the display name.

Searching any variation of TEST is irrelevant (products have TEST in their name and therefore are searched before the user name).

This is not true, you will have no results from the user table, since you have no records with the lower case test in them.

Applying the LOWER() function on the column will render all values lower case and allows the results you may want.

Your test data is a bit wrong tho. For the random string you have only lower case values and the 'Test' and 'User' parts are always in the mentioned form. In short, your test data does not contain all possible variations and with this, you may miss to find some bugs in your logic. Consider to use this (or something like this):

Test User ztr9ejsj6a -- no match for 'user' nor for 'test'
Test user 2i0jltuwrl -- match for 'user', but not for 'test'
test user csmr0e16ry -- match for both 'test' and 'user'
test User 98sdvg09s8 -- match for 'test' but not for 'user'
Test User 57lhPP4bFi -- no match for '57lhpp4bfi'
Test User 6Pvtestm6Z -- match for 'test' due to the random text!

The right side of the table above stands for all mentioned values in all case combinations (since they will be converted to their lower case variants).

As you can see, when you transform your search string to lower case, you will include some of the records, and exclude others, since the comparison is done using a case sensitive method. To prevent this either apply the same transformations to both sides or force both sides into their case insensitive form.