Group by limit per group in PostgreSQL

In web applications it’s very common to try to limit the results by group. For example showing all the new posts with the the two latest comments on them. Or have the best selling categories in an e-commerce website showing the 3 most popular products in those categories.

In MySQL you only have one option to iterate through your parent model (Post, Category) and do a query for every subsequent parent. This may or may not be what you want. If you have a large system you can even run them parallel and merge the results in the application which might lead to better overall query time but it costs you complexity and more resources being used concurrently making the database reaching its limits sooner.

In PostgreSQL you can do that in a single query. And actually two very different ways. Window functions are there for quite some time by now in PG and since 9.3 LATERAL JOINs are also available. Both enables us to get the results in a single query very efficiently. And we will also see how they compare in terms of performance.

Using window function

Prior to lateral joins this was the only option to have the limit per group functionality so it became sort of a de fact default for it.

It’s pretty simple once you get familiar with the syntax of window functions. It works the same way how we use group by but here we use partition by and gather the extra information row_number() which we can use in an outer query to filter out results.

Using LATERAL JOIN

The lateral join enables us to refer the outer query’s tables in the joined subselect which makes it a very powerful tool. It is a bit easier to understand and gives a bit of more flexibility with the joins comparing to the previous example.

But which one runs faster?

Benchmark

I will run the queries on a sample database with ~14000 products and ~350 categories first querying for random categories than for the largest ones only.

Random run

AVG Std dev Min Max
Window 5,197 ms 2,059 ms 2,562 ms 18,273 ms
Lateral 4,223 ms 1,358 ms 1,866 ms 10,528 ms
scatter plot

Randomly chosen categories

Large categories only

AVG Std dev Min Max
Window 20,746 ms 5,114 ms 13,535 ms 31,894 ms
Lateral 10,813 ms 1,392 ms 8,495 ms 15,875 ms
scatter plot

Results for querying large categories

Conclusion

The results are not so surprising looking at the product distribution in the top 10 categories which shows that some categories are stuffed while the others have at least an order of magnitude less products.

Querying the large categories obviously takes longer times but the impact is much more noticeable using the window functions because it actually has to fetch every row in the categories mark them with the row number and only filter that in the outer query while the lateral join limits the results for every category.

Are there any other ways to have the same results? Feel free to comment if you have further ideas to benchmark.

You might like these too

Split or leave frequently updated column in Postgr... I have a database migrated from MySQL to PostgreSQL (I had my good reasons but this post is not about that). In MySQL because of MVCC behaviour it mak...
Postgresql server fails to start in recovery with ... I ran into an issue while trying to setup a simple test system for myself to move from CentOS 6 to 7. When I was about to start a standby slave system...
Postgresql full text search vs Solr Postgresql has really come a long way from being the standard but relatively slow database to the feature rich and extremely fast database what it is ...

About charlesnagy

I'm out of many things mostly automation expert, database specialist, system engineer and software architect with passion towards data, searching it, analyze it, learn from it. I learn by experimenting and this blog is a result of these experiments and some other random thought I have time to time.
Bookmark the permalink.
  • Andriy Tatarchuk

    Thank you Charles for the post. I am trying to understand how it works. Could you please expand more on this part of the query: “where public and category_id = category.id”? public is a column name?

    • Владимир Аюпов

      It’s clearly a boolean-ish column from a table “product”

      • Hi, thanks for the comment! Indeed public is a boolean marking that product is published aka can be shown on the frontend to users and category_id is a foreign key to the category table.

        The query itself is a simplified version of an actual webshop’s most popular N products implementation. It was originally implemented with window functions and when the database was upgraded there was a question if it’s worth to change it to LATERAL JOINs.