Difference between DISTINCT and GROUP BY

Today we had an interesting situation where the same query was executed significantly slower when it was written with GROUP BY instead of DISTINCT and I saw many people still had the assumption that these two types of queries are actually equivalent which is simply not true. Although DISTINCT queries can be implemented using GROUP BY but not every GROUP BY query can be translated to DISTINCT. Depending on the brand and the optimizer the database server may actually use group by internally for the execution of distinct but that won’t make them equivalent. Let’s see why…

GROUP BY as the name suggest groups the result by some set of parameters and evaluate the whole result set. In most databases group by is implemented based on sorting and the same rules applies to it as well.

DISTINCT will make sure that the same row won’t be returned in the result set twice. Distinct doesn’t necessary need sorting. It can be implemented with a temporary table where during execution rows can be checked if they are in already or not and act in that regard.

The temporary table method can be very expensive for group by as it usually means counting, calculating, aggregating something while distinct doesn’t do any of these and can be much faster than sorting especially if not everything (filters and sorts) is covered by indexes. Smart developers programmed this assumption in the query optimizers as well.

Real situation in MySQL

The queries are actual queries from one of our system but I changed them in order to hide sensitive informations. The table has many indexes unfortunately none of them were entirely optimal for the group by but the order could be done by one of them. Here’s what happened:

Execution

GROUP BY:

MySQL preferred to use the index and read it in the t1.some_id order as it assumes (99% of the times correctly) that it is faster for the GROUP BY than doing random read. In this case though it wasn’t true. It had to do read_next 410429 times because the index wasn’t perfect for the filtering and sorting. Query time was 200 milliseconds. Not bad but …

DISTINCT

MySQL figured it out that it doesn’t have to do any kind of sorting so why bother reading in any type of order. You can see it had the same 260 read_key and then it did 254 read_next instead of the previous ~400k and 194 read_rnd_next. Of course it had to write too to the temporary table which resulted in 192 handler_write (the size of the result set). Query time was 0.00 sec.

Profiles

Turning profile on is always useful for troubleshooting query performance to see what was the bottleneck.

The significant time for group by was to talk to the storage engine (sending data) and for the distinct it was creating the temporary table (Copying to tmp table). 197 milliseconds vs 2 milliseconds. This is where the speed difference came from. So it was a tread off between memory and time.

Does this mean distinct is better than group by? NO! It only means that in this particular situation it was a better choice. If you have proper indexes group by actually runs faster because the you can avoid the temporary table writes and read only the rows you actually need for the result set.

Conclusion

  1. DISTINCT and GROUP BY are not equivalent and interchangeable even though DISTINCT can be rewritten by using GROUP BY.
  2. Both have their own purpose and should be used when appropriate.
  3. Always test your queries, the query execution. That tells you much more than explain can which is a good start but not comprehensive debug and/or profiling tool.

You might like these too

How to decrease IOPS when running MySQL on ZFS Recordsize on data volume This is the first thing you read when it comes to running MySQL on ZFS. InnoDB is using 16k pages so in theory this makes a...
MySQL 5.6 mysql_install_db script problem We're always testing the latest versions of MySQL with most of the environments to make sure that we can find the critical issues before it goes to pr...
How to troubleshoot MySQL replication issues? In MySQL a big portion of the problems you're facing is different replication delays so there's no surprise this is one of the most common interview q...