Today we had an interesting situation where the same query was executed significantly slower when it was written with GROUP BY instead of DISTINCT and I saw many people still had the assumption that these two types of queries are actually equivalent which is simply not true. Although DISTINCT queries can be implemented using GROUP BY but not every GROUP BY query can be translated to DISTINCT. Depending on the brand and the optimizer the database server may actually use group by internally for the execution of distinct but that won’t make them equivalent. Let’s see why…
GROUP BY as the name suggest groups the result by some set of parameters and evaluate the whole result set. In most databases group by is implemented based on sorting and the same rules applies to it as well.
DISTINCT will make sure that the same row won’t be returned in the result set twice. Distinct doesn’t necessary need sorting. It can be implemented with a temporary table where during execution rows can be checked if they are in already or not and act in that regard.
The temporary table method can be very expensive for group by as it usually means counting, calculating, aggregating something while distinct doesn’t do any of these and can be much faster than sorting especially if not everything (filters and sorts) is covered by indexes. Smart developers programmed this assumption in the query optimizers as well.
Real situation in MySQL
The queries are actual queries from one of our system but I changed them in order to hide sensitive informations. The table has many indexes unfortunately none of them were entirely optimal for the group by but the order could be done by one of them. Here’s what happened:
Execution
GROUP BY:
MySQL preferred to use the index and read it in the t1.some_id order as it assumes (99% of the times correctly) that it is faster for the GROUP BY than doing random read. In this case though it wasn’t true. It had to do read_next 410429 times because the index wasn’t perfect for the filtering and sorting. Query time was 200 milliseconds. Not bad but …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
root@somehost [test_db]> flush status; Query OK, 0 rows affected (0.00 sec) root@somehost [test_db]> SELECT SQL_NO_CACHE t1.some_id, t1.slug FROM Table1 t1 JOIN Table2 t2 ON (h.slug = t1.slug) WHERE t2.deleted = 0 group by t1.some_id; 192 rows in set (0.20 sec) root@somehost [test_db]> show session status like 'handler%'; +----------------------------+--------+ | Variable_name | Value | +----------------------------+--------+ | Handler_commit | 1 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 0 | | Handler_read_first | 1 | | Handler_read_key | 260 | | Handler_read_last | 0 | | Handler_read_next | 410429 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 0 | | Handler_rollback | 0 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 0 | +----------------------------+--------+ |
DISTINCT
MySQL figured it out that it doesn’t have to do any kind of sorting so why bother reading in any type of order. You can see it had the same 260 read_key and then it did 254 read_next instead of the previous ~400k and 194 read_rnd_next. Of course it had to write too to the temporary table which resulted in 192 handler_write (the size of the result set). Query time was 0.00 sec.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
root@somehost [test_db]> flush status; Query OK, 0 rows affected (0.00 sec) root@somehost [test_db]> SELECT SQL_NO_CACHE DISTINCT t1.some_id, t1.slug FROM Table1 t1 JOIN Table2 t2 ON (h.slug = t1.slug) WHERE t2.deleted = 0; 192 rows in set (0.00 sec) root@somehost [test_db]> show session status like 'handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 0 | | Handler_read_first | 1 | | Handler_read_key | 260 | | Handler_read_last | 0 | | Handler_read_next | 254 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 194 | | Handler_rollback | 0 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 192 | +----------------------------+-------+ |
Profiles
Turning profile on is always useful for troubleshooting query performance to see what was the bottleneck.
1 2 3 4 5 6 7 8 |
root@somehost [test_db]> show profiles; +----------+------------+----------------------------------------+ | Query_ID | Duration | Query | +----------+------------+----------------------------------------+ | 1 | 0.00292700 | SELECT SQL_NO_CACHE DISTINCT ... | | 2 | 0.19719600 | SELECT SQL_NO_CACHE ... group by ... | +----------+------------+----------------------------------------+ 2 rows in set (0.00 sec) |
1 2 3 4 5 6 7 8 9 |
root@somehost [test_db]> show profile for query 1; +----------------------+----------+ | Status | Duration | +----------------------+----------+ ... | Copying to tmp table | 0.002321 | ... +----------------------+----------+ 21 rows in set (0.00 sec) |
1 2 3 4 5 6 7 8 |
root@somehost [test_db]> show profile for query 2; +----------------------+----------+ | Status | Duration | +----------------------+----------+ ... | Sending data | 0.196726 | ... +----------------------+----------+ |
The significant time for group by was to talk to the storage engine (sending data) and for the distinct it was creating the temporary table (Copying to tmp table). 197 milliseconds vs 2 milliseconds. This is where the speed difference came from. So it was a tread off between memory and time.
Does this mean distinct is better than group by? NO! It only means that in this particular situation it was a better choice. If you have proper indexes group by actually runs faster because the you can avoid the temporary table writes and read only the rows you actually need for the result set.
Conclusion
- DISTINCT and GROUP BY are not equivalent and interchangeable even though DISTINCT can be rewritten by using GROUP BY.
- Both have their own purpose and should be used when appropriate.
- Always test your queries, the query execution. That tells you much more than explain can which is a good start but not comprehensive debug and/or profiling tool.
Recent comments