5 steps to self-managing server infrastructure

Managing servers is a tedious work we all have to do to some extent. But it doesn’t have to fill our whole day. What if I tell you that you can build a self-managing system with some discipline and effort? I went through implementing a self-managing database infrastructure of thousands of MySQL servers and I’ll let you know what are the milestones so you can build yours too.

I’m going to use MySQL upgrade as an example

1. Script it

You’re probably tired of doing to same repetitive commands over and over again. Why not script it? What you can copy paste to a console you can copy paste to bash file or perl/python script and run that instead of series of other commands. You probably still need to make sure your server is out of production traffic and won’t impact your customers when do that but still running a command which probably start something like the one below saves you a lot of time that you can spend on taking this to the next level.

Pro tips

  • Keep it simple

You’re four steps away from the total automation so do not overengineer. Make it work and progress in small iteration make it better every time you use it, you touch it.

  • Choice of language

I prefer python but it’s a very personal decision. I recommend though to choose a language which you can build upon later. It will only become more and more complex so choose a language which can support you on this journey.

2. Run it parallel and remotely

SSHing to every single box works for a while but certainly it can be improved. Being able to remotely execute the script you built and even better run it on multiple hosts will take you to the next milestone.

Where hosts is your repository file (you can limit the execution to a certain hosts with -l option) and update_mysql.yml is the playbook you’re going to run. The playbook can only contain one simple task:

Pro tips

  • Don’t reinvent the wheel

There are many existing solutions you can build your automation on. Ansible, SaltStack, Chef just to mention a few but there are gazillion of options. Find what works for you the best and preferably matches your scripting language (recognized I’ve put Ansible and Salt first as they are both Python based? 🙂 ).

3. Unattended execution

After you have gained some trust in your automation you’re ready to run it unattended. Whether you do it by scheduled tasks, cron jobs, SaltStack, Chef, Puppet you’ve just done a major step to free up yourself from daily operation and focus on innovation instead of operations.

This is the time when you can start spending time thinking about how you manage your connections to your servers and if you can start disabling them programatically. Different loadbalancer options like HAProxy for example or an inventory coordinator like ZooKeeper can come very handy in the next phase of automation.

Pro tips

  • Murphy’s law

What can fail it will fail. Don’t worry about failure but make sure you minimize the impact and can learn from it. A detailed (central) logging comes very useful in those situations. My personal favourite is fluentd to push logs to a central repository but you can also use the Elastic stack (Logstash, ElasticSearch, Kibana).

4. Automatic batch job execution

Once you have your tasks stable you can group them into batches and execute them when it’s appropriate. I call those batches jobs. For example you can have a job to upgrade every slave in a certain replication chain.

  1. The executor picks up the next task in the queue
  2. Run the task:
    1. Disable your server (see previous point)
    2. Stop mysql
    3. Do upgrade
    4. Start mysql
    5. Warmup if necessary or do any other post work
    6. Enable server if everything looks good
  3. If
    1. successful go to the next task
    2. failed report the failure and stop execution (since you disabled your server nothing should be impacted)

Pro tips

  • Plan for maintenance

Once you reached that point you probably have a large enough infrastructure to afford running under capacity for having room for maintenance aka. bringing down servers or having unused server for being clone sources for MySQL for example.

  • Trashable servers

In general you shouldn’t have too many single point of failures in your system but it becomes more and more important to not care about individual servers and consider your machines as a pool of workers providing certain service. As long as your pool have enough members your service should be intact.

5. Self-managing servers

Now you have your tasks doing certain operations it’s time to move away from the job queue which I know you just built and take it a step further by plugging it into your monitoring/trending/event system. Noticing a certain condition will result in a certain action. It doesn’t matter if you have a cronjob running which is scanning your graphite or Nagios servers for datapoints or you implement a check-action system by using something like Monit.

Word of advice

You already have almost everything setup so you don’t need my tips but a word of advise if you let me: Operations like this can really take down your entire infrastructure so make sure you did everything to minimize the impact if it happens, recognize it as fast as possible and be able to react (rollback or terminate). This last one might sound obvious but many times this is the trickiest bit. Our solution to this is to have a single central mutex which can prevent every automated task to run. The tasks only proceed if they made sure the mutex is not in place.

Don’t forget the lazy engineer is the best engineer. Happy automation!