Split or leave frequently updated column in PostgreSQL

I have a database migrated from MySQL to PostgreSQL (I had my good reasons but this post is not about that). In MySQL because of MVCC behaviour it makes sense and it’s actually a recommendation to split frequently updated columns from large tables especially if using a web framework like Django which always updates the full row so even if MySQL stores the TEXT column  off-page (look at InnoDB Blob storage for more details) the TEXT columns as well will be read from the separate extent and written to the Doublewrite buffer and to the Undo space (and binary logs if it’s set). To overcome this situation it’s beneficial to have the counters in a separate table.

or alternatively if we want to sort on popularity for example we want those metrics in the main table so we can split the blob to it’s separate table. It has it’s own benefit of being able to list all the posts with comment and read counters by using one table only. And we join the details table when someone read an individual post.

Having said all these I was wondering if the same thing is true for PostgreSQL. So I have setup a benchmark.

I’ve been running the test with over 1000 transactions / sec with 3 different ways:

  • Normal design where everything is in one table
  • Split counters into their own table
  • Alternate design where counters are in the main table and blobs are separate

I used the open source shakespeare database for that. Every query was run against the DB 100 times to have a large enough sample.

The results are in milliseconds.

Updating the counter column only

The queries

Results (update times)

Although the average query time for split was the lowest but taking the standard deviation into account you can see there are no significant difference between the different designs.

Avg StdDev Min Max 95% lower 95% upper
Normal 0,658 0,160 0,459 1,886 0,345 0,970
Split 0,613 0,121 0,444 1,200 0,377 0,850
Alternate 0,675 0,144 0,479 1,551 0,393 0,956

Updating the full row

What if we mimic what Django would do and update the full row and not just the counters.

Results (update times)

Avg StdDev Min Max 95% Lower 95% Upper
Normal 0,860 0,181 0,606 1,695 0,504 1,215
Split 0,645 0,164 0,458 1,697 0,324 0,966
Alternate 0,748 0,137 0,541 1,477 0,480 1,017

Here you definitely can notice a slight difference between the split design and the other two. The split design yields 25% better average performance comparing to the large table. But the confidence intervals shows that it is still statistically irrelevant. Although it’s unlikely but possible that the normal is actually the same performant as the split design.

The alternate design gives roughly 13% better average performance comparing to the large table. The same also has to be noted as above. Statistically still irrelevant.

Select performance

Obviously it has it’s own penalty for your selects if you need to join two tables instead of querying one. So let’s see an example where I fetch the top 10 most read posts.

After creating indexes on the read_counter column on every table.

Queries:

Please note that using the alternative design I didn’t need to join my details table unless I want the full body of the post.

Results (query times)

Avg StdDev Min Max 95% Lower 95% Upper
Normal 1,011 0,216 0,862 3,044 0,587 1,435
Split 1,536 0,109 1,405 2,380 1,321 1,751
Alternate 0,890 0,083 0,766 1,499 0,728 1,052

With the split design the query time increased significantly for the split design by almost 52% (this is statistically relevant as well) while the alternative design even improved the performance by 12% (this is statistically not relevant though).

Disclaimer

You can pretty much achieve the same query time with limiting the columns in your select with the normal large table design. I intended to mimic a web framework’s behaviour so I haven’t done that but that’s an absolutely viable option too.

Conclusion

PostgreSQL handles the large blobs quite well so unless you have updates on your tables in the order of magnitude of 1000 or more per seconds you won’t see a significant benefit of splitting your tables based on the update frequency but you’re going to lose some on the retrieval.

If you like to squeeze the last drop of performance out of your system and don’t mind to pay the complexity price try the alternatively split design which boosts both the read and the write performance.

Taking the normal (large table) design as the baseline here are the statistics about what each improves.

Normal Split Alternative
Update 100% 75% 88%
Select 100% 152% 87%

Only bold numbers are statistically relevant.

You might like these too

Group by limit per group in PostgreSQL In web applications it's very common to try to limit the results by group. For example showing all the new posts with the the two latest comments on t...
Postgresql full text search vs Solr Postgresql has really come a long way from being the standard but relatively slow database to the feature rich and extremely fast database what it is ...
Postgresql server fails to start in recovery with ... I ran into an issue while trying to setup a simple test system for myself to move from CentOS 6 to 7. When I was about to start a standby slave system...

About charlesnagy

I'm out of many things mostly automation expert, database specialist, system engineer and software architect with passion towards data, searching it, analyze it, learn from it. I learn by experimenting and this blog is a result of these experiments and some other random thought I have time to time.
Bookmark the permalink.
  • András Váczi

    Nice one! I think your second sentence is somehow a mixture of two. It is not very clear there where you refer to PostgreSQL and where to MySQL.

    The more important remark is that with those frequently updated columns, there is a real issue in PostgreSQL: if the updates are frequent enough (this could be specified a bit more in detail, I admit), autovacuum cannot keep up. This is also the case, when there is chance for certain tuples on all physical pages never being updated – they will sit there alone in the middle of an 8 kB page, bloating the table by an enormous factor. We used to have, for example, a table where the average row count was around 1 at any given time, while the physical table size was seldom less than 60 MB (!).

    • You’re absolutely right. Thank you for the feedback! Reading it back indeed it wasn’t clear. The first part was about MySQL and design decisions in the MySQL database which got migrated to PG. I updated the post so hopefully it got clarified.

      Very good point! Fortunately my database is not there yet. Only having updates in the order of magnitude of hundreds per day max per row on popular items so vacuuming is not an issue yet. I have a bunch of order by popularity, last_viewed, last_commented queries and those really benefit from having the two tables merged back into one. I will keep an eye on the bloating. Thanks!