Cybercom Cloud is based on OpenStack technology. Cybercom Cloud is an IaaS platform from where we and our customers use capacity. This blog post is about how we upgraded the OpenStack installation, and the post goes trough the procedure used in the upgrade. OpenStack upgrade process is famously complicated and up to date there is not a single known easy way to upgrade without reinstalling.
We have run OpenStack since 2013. Our first installation ran on Folsom release and since then, the setup has been reinstalled two times. First at the end of 2013, it was replaced by a new Grizzly cloud with Neutron networking, and second time on 2014 when we moved to Icehouse release by building a new cloud. On 2015 we decided to make another upgrade.
OpenStack can be divided in Control Plane and Data Plane (the terminology is from Networking). Control Plane manages resources that are on Data Plane. See the diagram below where LDAP, KVM, etc. are the resources that Control Plane controls or uses.
This time our goal from the start was not to reinstall the Cloud, but to do non-disruptive in-place upgrade. We had learned in practice that the Control Plane / Data Plane division holds true, and this time we were confident that we are able to make the upgrade in non-disruptive way. The upgrade mostly touches Control Plane, and resources running in Data Plane are unaffected.
Broadly speaking, in the OpenStack upgrade you replace Control Plane components with newer versions. To make upgrade testable we built a new Control Plane beside the old one. This is an adaptation of a way that Symantec presented at OpenStack Summits year or two ago.
The upgrade steps we used on high-level were
- Create a new Control Plane.
- Clone and migrate the database to the new Control Plane.
- Clone Cinder storage back-ends from the old to the new Control Plane.
- Hijack a compute node with test load from the old Control Plane.
- Test the new setup.
- When the new Control Plane is validated, roll the upgrade in service break.
- Reinstall the old Control Plane nodes to new release and join them to the new Control Plane.
When we rolled the upgrade, we did the database clone and migration operations once more, switched the public API endpoints to point to the new Control Plane and moved rest of the compute nodes to the new side.
The method is illustrated in the diagram below. In the diagram upgrade is in testing phase and we are running both new and old versions simultaneously
We built the new Control Plane using one of the old Controlled nodes and by started rest of the required Controller nodes in the very same Cloud we were upgrading. This was reasonable as Data Plane is unaffected by the upgrade.
Actually, we ended up to this "build a new Control Plane in the same Cloud" because there was a need for a sandbox environment to test a new architecture. In addition to the OpenStack upgrade we did two other major changes: we transitioned to fully High Available Control Plane, and switched from a custom Puppet module to the community supported Puppet OpenStack modules. We created a Vagrantfile that created personal OpenStack environment on top of our current Cloud. It was great and we adapted the same idea to production using terraform.
Of course there were some issues in the upgrade and especially broke our RabbitMQ cluster because the nodes in the Cloud lost networking. This can be avoided if one doesn't upgrade the compute nodes having the RabbitMQ simultaneously.
However, the terraform managed Control Plane was an inspiring idea and was a great exercise for us. We rolled out the upgrade to our Finnish and Swedish regions using this procedure, and we are going to adapt the same to Liberty upgrade too. There are projects that aim to help in the upgrades, either by providing tools (eg. Kolla, TripleO) or trying to solve the problem (eg. Fuel), and we should try to leverage them in the future.