Thursday, July 28, 2011

MySQL Group By is a little too indulgent

After 30 years of Oracle, I've found myself using MySQL recently. I came across a little thing that surprised me. I'm by no means the first to trip over this - I found this 2006 post from Peter Zaitsey on the same topic.

MySQL lets you write a group by statement that references columns that aren't in the group by, and aren't aggregates. For example:

mysql> select table_name, column_name, count(*)
-> from information_schema.columns
-> where table_schema = 'information_schema'
-> group by table_name
-> limit 5;
+---------------------------------------+--------------------+----------+
| table_name | column_name | count(*) |
+---------------------------------------+--------------------+----------+
| CHARACTER_SETS | CHARACTER_SET_NAME | 4 |
| COLLATIONS | COLLATION_NAME | 6 |
| COLLATION_CHARACTER_SET_APPLICABILITY | COLLATION_NAME | 2 |
| COLUMNS | TABLE_CATALOG | 19 |
| COLUMN_PRIVILEGES | GRANTEE | 7 |
+---------------------------------------+--------------------+----------+
5 rows in set (0.07 sec)


A similar query from any version of Oracle would fail:

SQL> select table_name, column_name, count(*)
2 from dba_tab_columns
3 group by table_name;
select table_name, column_name, count(*)
*
ERROR at line 1:
ORA-00979: not a GROUP BY expression


In effect MYSQL is doing the GROUP BY as requested, and giving you the first value it comes across for the un-aggregated columns (COLUMN_NAME in this example). A near equivalent Oracle query would be:

SQL> select table_name, min(column_name), count(*)
2 from dba_tab_columns
3* group by table_name


TABLE_NAME MIN(COLUMN_NAME) COUNT(*)
------------------------------ ------------------------------ ----------
ICOL$ BO# 14
PROXY_ROLE_DATA$ CLIENT# 3
TS$ AFFSTRENGTH 32
ARGUMENT$ ARGUMENT 25
IDL_CHAR$ LENGTH 6
TRIGGER$ ACTION# 19
TRIGGERCOL$ COL# 5


But in the Oracle case we are explicitly selecting the MIN(column_name), whereas MySQL's laxer behaviour is just picking the first column name at random (or rather, dependent on the execution plan).

So: when grouping in MySQL, make double certain that your SQL is really returning the number of rows you expected. In our example it is possible that the intention was actually the very different:

mysql> select table_name, column_name, count(*)
-> from information_schema.columns
-> where table_schema = 'information_schema'
-> group by table_name, column_name
-> limit 20;
+---------------------------------------+--------------------------+----------+
| table_name | column_name | count(*) |
+---------------------------------------+--------------------------+----------+
| CHARACTER_SETS | CHARACTER_SET_NAME | 1 |
| CHARACTER_SETS | DEFAULT_COLLATE_NAME | 1 |
| CHARACTER_SETS | DESCRIPTION | 1 |
| CHARACTER_SETS | MAXLEN | 1 |
| COLLATIONS | CHARACTER_SET_NAME | 1 |
| COLLATIONS | COLLATION_NAME | 1 |
| COLLATIONS | ID | 1 |
| COLLATIONS | IS_COMPILED | 1 |
| COLLATIONS | IS_DEFAULT | 1 |
| COLLATIONS | SORTLEN | 1 |
| COLLATION_CHARACTER_SET_APPLICABILITY | CHARACTER_SET_NAME | 1 |
| COLLATION_CHARACTER_SET_APPLICABILITY | COLLATION_NAME | 1 |
| COLUMNS | CHARACTER_MAXIMUM_LENGTH | 1 |
| COLUMNS | CHARACTER_OCTET_LENGTH | 1 |
| COLUMNS | CHARACTER_SET_NAME | 1 |
| COLUMNS | COLLATION_NAME | 1 |
| COLUMNS | COLUMN_COMMENT | 1 |
| COLUMNS | COLUMN_DEFAULT | 1 |
| COLUMNS | COLUMN_KEY | 1 |
| COLUMNS | COLUMN_NAME | 1 |
+---------------------------------------+--------------------------+----------+
20 rows in set (0.06 sec)

Happy debugging everyone!

No comments: