How cold is it in your data center? Is it so cold that you have some precious, unique and fragile snowflakes in the corners? You know those servers that no one dares, even fears, to touch anymore. Jokes aside, most likely those same servers are running some ancient version operating system with missing updates, because no one knows if the operating system can be upgraded or patched. And when the server needs to be rebooted, there’s candles and black magic spells involved. And this was the “normal way” back in the days, when servers were mostly bare metal and all different to each other. But there’s no room for such servers to exist anymore in the current DevOps era.
What’s wrong with unique servers?
The first problem is that it's hard or nearly impossible to reproduce the server and its configuration. Next problem comes when you run into problems in production, there’s no way to troubleshoot as the development environment differs from the production. And it isn’t just a problem with server configuration, it isn’t unusual to find out that only 50% of the source code running in production matches the code running in test and dev environments. The biggest problem comes with configuration changes. Changing one thing can break another and fixing the problem might break another.
I remember a story where security updates caused ruby upgrade on a server that was running an application requiring older version of ruby. Actually, such an old version of Ruby, that it wasn’t possible to downgrade the version. This forced the dev team to start fixing dependencies between the code and ruby version while running the application on a disaster recovery environment. Luckily, they had a working disaster recovery procedure in place. Which reminds me, when was the last time you tested your disaster recovery plan? You do have one, right.
Avoiding the situation
One word; automation. A good way to avoid problems with unique servers, is to automate the provisioning and configuration of the environment with a fully automated and script-driven DevOps pipeline. Huh, wasn’t that a mouthful? We should also mention infrastructure as code to enable stable, reliable, secure and consistent environments in the same context. If you are running your environments in cloud, you can use for example Terraform, AWS CloudFormation or Azure Resource Manager templates to spin up your environments. If you need any guidance, our cloud specialists can help you with creating templates for your environment following best practices.
Avoid the problem: automate the provisioning and configuration of the environment with a fully automated and script-driven DevOps pipeline
Many organizations have automated tools for server provisioning, but still need someone to do some manual configuration during or after provisioning. This results in quicker provisioning of servers, but the environment will still be unique, only this time with partial automation. Because there is manual work needed, most likely the servers will come out from the process slightly different from each other, depending who does the manual work. We need to try and remove the manual steps.
There are loads of good tools for automatically configuring the server. Some of the industry-wide configuration management tools include Chef, Puppet, Ansible, and SaltStack. All having their pros and cons. I’m not going go through them in this blog writing. I’m sure you can find more information using a search engine. And in case you are interested in server templating tools, some common options include Docker and Packer.
The point of using automation tools is not just to enable easier server rebuilds, but to make it easier to understand and modify the server configuration. One thing to note is that all the configuration files need to be stored in a version control. Some wise man once said that if the code is not in version control, it does not exist. This holds true for all scripts, config files and templates. They all belong to source control. After storing everything in a version control, it becomes the single source of truth, which contains the intended state of different systems.
“It worked on my machine”
Ever heard a developer saying “It worked on my machine” when code is not running as expected? Most likely reason being inconsistency between environments. To ensure consistency between environments, all changes made to production needs to be replicated to all other environments. Instead of making changes and tweaks by manually logging into servers, these changes should be made through automated mechanism, as well as ensuring the changes are updated in version control. If the change is missing from version control, the configuration has drifted and the next time someone pushes another change through version control, the previous change might be overridden and possibly causing a breakage. Not good.
Infrastructure automation in a controlled manner makes reproducible environments possible. We can even use the same automation scripts for deploying exact copies of production to all environments (dev, test, prod, you name it). And when developers are deploying their code to identically provisioned environments, “It worked on my machine” becomes irrelevant.
Picture: From "It worked on my machine" to automation.
Phoenix is reborn from the ashes
Many organizations have prevented uncontrolled configuration changes by disabling remote logins to production servers or by routinely killing and replacing production instances. This method motivates everyone to use version control in a correct way.
These routinely “burned” servers are sometimes called Phoenix servers (coined by Martin Fowler); according to the legend phoenix dies in a big blaze of fire and is reborn from the ashes. It’s good to acknowledge that using phoenix servers is not a fix-all-problems kind of solution, but they do help with managing configuration drift. By using code for provisioning, we can create immutability to the environment, so that re-provisioning a server becomes faster than repairing it. And by the way, it’s also more reliable and we should be aiming to make infrastructure easier to rebuild than repair.
Did you know that the average age for a Netflix AWS instance is twenty-four days, with 60% of them being less than week old? Netflix also has a Chaos Monkey randomly burning down servers for testing resiliency. This does not mean that your company must be Netflix to be able to benefit from devops practices. It just means that they have started the work sooner.
Uniqueness among people needs to be cherished, but uniqueness among servers should be avoided. Scripted infrastructure and automation are the key first steps to get rid of servers needing “special" treatment. During the winter I challenge you to get familiar with the options you have in your way from fragile and unique environments to stable and consistent environments. Yes, there most likely is a learning curve to start using the tools and it will take time, but I can assure that it’s worth it. You will gain speed and simplicity, configuration consistency, increased efficiency and cost savings, while minimizing risks.
So why don’t we agree that during this spring, you will start working to get rid of manually configuring your servers. If you need help with your journey or picking the right tools, we at Cybercom are happy to help you.