Full text search in MySQL

Full text is a critical point when it comes to mysql. It used to have that feature in MyISAM but that’s not really maintained anymore nor it is advised to use unless you have a very specific use case in which it might make sense. There are 3rd party solution which takes the problem away (Lucene, Sphinx, Solr, ElasticSearch) but all bring extra complexity to your setup which has its own cost. So do you need to give up on fulltext search if you’re using MySQL + InnoDB? No! Definitely not.

Alternatives for full-text search in MySQL

I have an live database with 9000+ products with category, brand and short description which is perfect to test my searches on. I’m going to run the queries many times and use profile information to collect more granular and accurate timing information.

LIKE

Well… It’s not a real alternative but just for a sec see how well it behaves.

Avg execution time: 105 ms
Extra space required: 0

MySQL 5.6 comes with integrated FTS on InnoDB

Innodb now support full text search with the MATCH .. AGAINST commands. Unfortunately if you want good performance you need to materialize your columns into a single column what you create your index on. To do this you can do the following change:

With this you now have a single column with all the text you wanted to index providing one single index you can search against:

Avg execution time: 36 ms

Extra space: In my case 38%
it depends heavily on your requirements (what you want to be indexed) but can easily double your table size especially if you want to include relational fields like brand name and category name which make sense, you don’t want one separate FTS index on every column of yours

Disclaimer:

There are solution which could work without materializing but since the primary goal for this test to compare performance I didn’t consider these as viable options.

Can we do better? Yes, we can … implement our own Inverted Index

What is inverted index? An index data structure that maps content such as words or numbers to their location in a file, document or sets of documents. It’s widely used in full text searches. Basically you trade disk space and extra computation on document insertion for speed on the search. Actually InnoDB FTS index is using inverted index as well internally but it’s only in 5.6 and it’s a bit rigid of what and how can you use it.

With these tables we can rewrite our previous query to something similar:

Avg execution time: 9.9 ms
Extra space: In my case it was 7.6% extra

Comparision

Query times

Query time comparison (smaller the better)

extra space

Extra space required by the indexing strategy comparison
(smaller the better)

Advanced features: Ngrams, stemming and tokenising

With this setup you have a flexible solution which you can apply many things on for example stemming and/or tokenisers. Instead of writing the term table with the original text you can do something like this small python code does and write the stemmed words only. (You’re only using that table for searching so storing the stemmed version is totally enough)

Also if you want to store n-grams that’s simple too:

Don’t forget to also change your query to search for the stemmed term.

And that’s it. You have a fully functional feature rich full-text search which is performing very well and can be implemented in any database. Getting this idea and take it further you can also do tokenizing by removing accents, replace different characters to whitespace or anything else you can possible think.

You might like these too

Difference between DISTINCT and GROUP BY Today we had an interesting situation where the same query was executed significantly slower when it was written with GROUP BY instead of DISTINCT and...
How to store boolean in MySQL I always hear newer and newer exotic way to store different types of data in MySQL. People are trying to solve the problem of storing complex and not ...
How to troubleshoot MySQL replication issues? In MySQL a big portion of the problems you're facing is different replication delays so there's no surprise this is one of the most common interview q...