I have been researching and reading about SQL data types for a few days now (I know... I know, that's not very long) and one of the things that is hard for me to grasp is how to pick the best data type for expandability, efficiency, and ease of access.
I think it's pretty straight forward how to choose basic Data Types (i.e. int vs varchar) but how do you choose between things like blob and text type.
The MySQL man pages are great but they are not what we computer geeks love... efficient.
I think it would be awesome if we could compile a list of the MySQL data types, the general advantages/disadvantages of each and when it would make sense to choose that data type.
MySQL string types come in two variants: One without a character set label and one with a character set label.
A fixed length string, padded with spaces at the end, is CHAR(n). The matching type that has no character set label is BINARY(n). Storing the string "hello" in a
CHAR(255) CHARSET utf8 will take 765 bytes (string padded with spaces to full length, stored as utf8 which as a worst case space usage of 3 bytes/character allocates 3*255 bytes).
A variable length string with one or two length bytes and no padding is VARCHAR((n). The matching type that has no character set label is VARBINARY(n). Storing the string "hello" in a
VARCHAR(255) CHARSET utf8 will take 6 bytes (1 length byte plus 5 bytes for the actual text). Storing the string クリス in the same type will take 10 bytes (1 length byte plus 3 characters using 3 bytes per character to represent them).
mysql> select hex('クリス'), length(hex('クリス'))/2 as bytes; +--------------------+--------+ | hex('クリス') | bytes | +--------------------+--------+ | E382AFE383AAE382B9 | 9.0000 | +--------------------+--------+ 1 row in set (0.02 sec)
A variable length string with one, two, three or four length bytes is TINYTEXT, TEXT, MEDIUMTEXT and LARGETEXT. The matching types that have no character set label are TINYBLOB, BLOB, MEDIUMBLOB and LARGEBLOB.
A TEXT/BLOB-like type differs from a VARCHAR/VARBINARY-like type in how and where the data is stored, see http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/ for details on how TEXT/BLOB-like types are stored in InnoDB depending on version and ROW_FORMAT settings. For performance reasons, you want a recent version of InnoDB and "Barracuda"-Format tables.
MySQL is incapable of working with any data that is larger than max_allowed_packet (default: 1M) in size, unless you construct complicated and memory intense workarounds at the server side. This further restricts what can be done with TEXT/BLOB-like types, and generally makes the LARGETEXT/LARGEBLOB type useless in a default configuration.
For types without a character set label (BINARY, VARBINARY and %BLOB%), MySQL will accept the data as received and write it to disk. For types with a character set label, MySQL will look at what you announced as your clients character set to the server with
SET NAMES and what is the columns defined character set label. It will then convert from the connection character set to the column character set and write the converted data. You can check that with the HEX() function, e.g.
SELECT HEX(str) FROM t WHERE id = ....
On retrieval, the connections announced character set with
SET NAMES can be different from what it had been at write time. MySQL again will check the columns character set label against the character set announced for this connection, and if necessary, will convert to the connection character set.
The performance penalty for that conversion either way is negligible compared to the time taken for the disk I/O incurred for such data anyway, to performancewise it hardly matters which type you choose. The rule is instead: Chose a type with a character set label, if you are working with textual data, and a type without if you are not.
A related question often asked: Shall I choose CHAR or VARCHAR (BINARY or VARBINARY respectively)?
For InnoDB, the answer is always: Chose the variable length data type. There is no performance benefit from fixed length data types in InnoDB, ever, but there is a huge size penalty if you are choosing a fixed length data type and then are not using all space in it. Plus fixed length SQL string types have really weird rules regarding padding and trimming with spaces at the end which you probably can't be bothered to learn. For MySQL the case can be different, but almost never is.
Another related question: Shall I choose VARCHAR or TEXT for my strings (VARBINARY or BLOB, respectively)?
The answer for that is using a recent version of InnoDB, Barracuda format tables and then TEXT/BLOB. The reason for that is explained at length in http://www.mysqlperformanceblog.com/2011/04/07/innodb-row-size-limitation/. The upshot of that is: With either VARCHAR or TEXT/BLOB in pre-Barracuda format you run the risk of overflowing the InnoDB row size limitation if you have too many of them in a single row.
And finally: Shall I store files/images/other large blob or text data in the database?
The answer for that is: Usually not. Serving files from a database (http://mysqldump.azundris.com/archives/36-Serving-Images-From-A-Database.html) is an expensive operation compared to serving files from a filesystem. If at all possible, you'd want to do that instead. There is a way around that, http://www.blobstreaming.org/, but that is advanced technology that requires that you have total control of your execution environment, which is never the case in a hosted environment.
To round this off: There are no variable length data types in MEMORY engine tables. So if you are seeing "using temporary" in an
EXPLAIN output, this means
If the temporary table by this process becomes larger than tmp_table_size OR max_heap_table_size, it is on the fly converted to MyISAM format and goes to disk.
Example: You are defining a Ruby Active Record class
User with contains ten fields tagged as
:string. Each of these end up being
VARCHAR(255) CHARSET utf8 in your
Elsewhere in your codebase,
Users is being used in a way that involves a plan
using temporary. You are instantly dying in disk operations under load, because each row of the
Users table is now using 7650 bytes in MEMORY at least, most of them being spaces used as padding. This forces the temporary table to be converted to MyISAM and written to disk.
This implies that any query with a TEXT or BLOB type and a plan with "using temporary" needs to be rewritten in order to avoid temp tables hitting the disk.