Postgresql has really come a long way from being the standard but relatively slow database to the feature rich and extremely fast database what it is today. Supporting unstructured data (HSTORE, JSON[B]), geo function (PostGIS), full text search (TSVECTOR) it is representing a true competition to the search engines.

Looking at all the analytical functions that was included in PG lately (window functions, lateral join, etc) it also works pretty well for data analysis and BI.

Having said that I was wondering if it still makes sense for the majority of the websites to rely on a separate search engines to provide quick search results over several possible filters with facets and all the nice features.

Disclaimer: I’m a big fan of Solr. I’m maintaining a fork of the sunburnt python client library for Solr on github and if you read my posts you can see I was always interested in getting the most out of Solr.

Hypothesis

Even though Solr is a great tool there are some clear drawbacks of a MySQL + Solr setup.

Maintenance: two software to configure, to monitor
Continuous Integration: you need to maintain to schema, two datasource which makes continuous deployment hard and slows down development
Testing: testing has to happen from both sources to cover every cases
Competing for resources: if the MySQL and Solr are running on the same server they are competing for the same resources (memory, disk)
Complexity: Data retrieval and processing algorithms have to be aware of where the origin of the data was
Indexing: the data in MySQL needs to be indexed to keep Solr in sync. It isn’t particularly hard but what if we could get rid of that?

I want to see if replacing the RDBMS + Search engine with Postgresql is a viable option. The decision should evaluate the trade-off between these downsides and the possible performance penalties.

So let’s see the results.

Benchmark

I’ve setup a quite simple benchmark which involves a product catalog of 9000 products with brand, category and tags. I want my full text search to look for matches in product name, brand, category, tag and the short_description.

The data is coming from a running e-commerce website currently using MySQL and Solr.

Setup

Neither Solr or Postgresql uses heavily tuned configs. Solr is having a bump in available memory to hold the full index memory and uses HTTP caching. Postgresql is running on stock configuration except the shared_buffer being increased.

I used python and urllib for Solr and psycopg2 for postgresql.

Criterias

Test has to return a list of dictionaries to be able to compare apple with apple. If we would strictly compare the results that wouldn’t be fair because Solr’s JSON or XML output still needs to be parsed to be usable and a cursor.execute wouldn’t (necessarily) result in data fetching.

Reconnect we know Postgresql is quite heavy on connect time. So I used two different functions: pg and pg_recon. One uses an existing connection while the other reconnects before every query.

Functions

def solr(query):
    _uri = '{solr}select/'.format(solr=settings.SOLR_HOST)
    _params = {
        'q': query,
        'wt': 'json',
        'limit': 10
    }
    _res = urllib.urlopen('{base}?{params}'.format(base=_uri, params=urllib.urlencode(_params)))
    return json.load(_res).get('response', {}).get('docs')

def solr(query):

_uri = '{solr}select/'.format(solr=settings.SOLR_HOST)

_params = {

'q': query,

'wt': 'json',

'limit': 10

}

_res = urllib.urlopen('{base}?{params}'.format(base=_uri, params=urllib.urlencode(_params)))

return json.load(_res).get('response', {}).get('docs')

def pg(query, reconnect=False):
    global connection, _cursor
    if reconnect:
        connection = connections['pg']
        connection.close()
        _cursor = connection.cursor()

    _cursor.execute("""SELECT p.name, c.name, b.name, w.short_description, p.photo_thumbnails, ts_rank_cd(fts_text, qs) as score
    FROM to_tsquery(%s) AS qs, product p
    JOIN category c on c.id = p.category_id
    LEFT OUTER JOIN brand b ON b.id = p.brand_id
    WHERE fts_text @@ qs
    LIMIT 10""", (query, ))
    return _cursor.fetchall()

def pg_recon(query):
    return pg(query, reconnect=True)

def pg(query, reconnect=False):

global connection, _cursor

if reconnect:

connection = connections['pg']

connection.close()

_cursor = connection.cursor()

_cursor.execute("""SELECT p.name, c.name, b.name, w.short_description, p.photo_thumbnails, ts_rank_cd(fts_text, qs) as score

FROM to_tsquery(%s) AS qs, product p

JOIN category c on c.id = p.category_id

LEFT OUTER JOIN brand b ON b.id = p.brand_id

WHERE fts_text @@ qs

LIMIT 10""", (query, ))

return _cursor.fetchall()

def pg_recon(query):

return pg(query, reconnect=True)

Results

Query / Second

Fist I ran queries for 10 seconds multiple rounds to see the amount of queries can run in a second.

Higher is better

Postgresql with reconnect for every query is significantly slower but that was expected. What is surprising that with using the existing connection Postgresql can do 29.6% more queries than Solr per second. That not just being in par but a significant performance difference.

Querytime

After the results above the query times won’t be a big surprise. 95% of the query times.

Lower is better

Roughly the same results. Postgresql is faster with 27% than Solr in terms of query time.

Conclusion

Not just that Postgresql is a viable option to replace a traditional RDBMS + Search Engine setup for simplifying the infrastructure but it is actually faster and able to serve more queries in the same time.

You might like these too

Postgresql server fails to start in recovery with ... I ran into an issue while trying to setup a simple test system for myself to move from CentOS 6 to 7. When I was about to start a standby slave system...

Group by limit per group in PostgreSQL In web applications it's very common to try to limit the results by group. For example showing all the new posts with the the two latest comments on t...

Split or leave frequently updated column in Postgr... I have a database migrated from MySQL to PostgreSQL (I had my good reasons but this post is not about that). In MySQL because of MVCC behaviour it mak...

Postgresql full text search vs Solr

Hypothesis

Benchmark

Setup

Criterias

Functions