Almost-live OpenShift Upgrade by Reinstalling


2019-09-30, 13:04 Posted by: Antti Rahikainen

Case introduction

OpenShift is a Kubernetes based Container Platform that is supported by Red Hat. Red Hat provides frequent security patches and nearly bi-annual version upgrades. The new versions brings the platform closer to the upstream Kubernetes and adds new features. Unfortunately, applying the upgrades are bit challenging in OpenShift 3, because of the complexity in tooling, platform and dependencies.

Cybercom installed an OpenShift cluster to a customer couple of years ago. Cybercom team helped the customer to upgrade the cluster first time two years ago, when the upgrade was done from OpenShift version 3.4 to OpenShift 3.6. Now it was time to bring the OpenShift cluster to its latest minor version 3.11.

Upgrade method and planning

We started planning the upgrade together with the customer. The previous upgrades were done as in-place upgrades with RedHat provided OpenShift Ansible playbooks. This is the recommended and well supported way to do the OpenShift upgrade.

When updating OpenShift, it is required to upgrade the cluster step-by-step, meaning that if the customer was now running OpenShift 3.6 and the target version was 3.11 we just could not upgrade straight to 3.11 but we would have needed to go 3.6 > 3.7 > 3.9 and so forth. Each step requires its own modifications to installation configuration due new and deprecated features on each version.

The customer has number of different OpenShift environments for different purposes and development stages. In total, the number of clusters is bit below ten. To meet customer requirements regarding development and acceptance testing needs, traditional way to upgrade clusters version-by-version seemed impossible to schedule to match application development sprints. Meaning if development cluster was upgraded first and tested against OpenShift version 3.11 the production cluster would have still been in version 3.6 and this is against the customer policies. Also upgrading first all clusters to version 3.7 and then new round for next version this would have taken so much calendar time, that this approach seemed also quite impossible. So, we needed a different approach.

To meet customer requirements regarding development and acceptance testing needs, traditional way to upgrade clusters version-by-version seemed impossible to schedule to match application development sprints.

Cybercom has previously done a project where one OpenShift cluster was split to two separate development and production clusters. On this method we would install a new cluster and then migrate applications and data to new cluster and then use external load balancer to forward traffic to the new cluster. This approach seemed way-to-go solution also for this upgrade. Meaning that we would install completely fresh 3.11 OpenShift clusters to replace existing old clusters. And then migrate applications and data to the new cluster and finally just route the traffic to the new cluster.

The strict schedule was not the only reason for reinstallation. The customer has strict requirements for testing and documentation. Each release need to be thoroughly tested. By reinstalling we were able to create a clean well defined environment and test it before the takeover. This approach also limited the technical risks involved in multiple successive upgrades. From past experiences upgrading OpenShift clusters can sometimes be more or less frustrating, when new features are added, or old features removed and initially generated Ansible inventory-files are not necessarily compatible with new version.

There were couple open questions on this approach mostly regarding DNS names of the cluster. The plan was to install the cluster using same external facing DNS names as the old one so that external integrations and CI/CD pipelines would not require any changes.

Upgrade

OpenShift installations have two different attributes for cluster hostname; internal and external. Internal communications within the cluster nodes and cluster masters are done using internal hostname, and external communication use public hostname.

For the installation we created a new internal load balancer configuration with openshift-int-new.example.com DNS name and used that for the installation and kept the existing external hostname same as in the old cluster.

During the installation the new cluster was able to communicate internally using the new internal load balancer. And when external access was required for testing purposes, the developers and administrator had the opportunity to use operating systems hosts file to force the external load balancer hostname to the new clusters internal load balancer.

Illustration of the network is shown below.

Internal and External Load Balancing in OpenShift

After the network was configured, the installation of the OpenShift 3.11 was done using OpenShift installation scripts provided by RedHat. There are lot of blog posts and examples how OpenShift installation is done so nothing "special" here. OpenShift installation is quite straight forward process using RedHat provided Ansible-playbooks, and all we had to do during the installation was to define new OpenShift nodes to the Ansible inventory files for each cluster and run the installation playbooks.

Migration

Now the new cluster is installed and OpenShift infrastructure components are in place. After some testing, it was time to move to the actual migration phase.

Application deployment configurations are defined in DeploymentConfiguration resources in OpenShift. These DeploymentConfiguration resources can be defined using YAML (or JSON) files:

kind: "DeploymentConfig"
apiVersion: "v1"
metadata:
  name: "frontend"
spec:
  template:
    metadata:
      labels:
        name: "frontend"
    spec:
      containers:
        - name: "helloworld"
          image: "openshift/origin-ruby-sample"
          ports:
          - containerPort: 8080
            protocol: "TCP"
  replicas: 5
  triggers:
    - type: "ConfigChange"
    - type: "ImageChange"
      imageChangeParams:
      automatic: true
      containerNames:
        - "helloworld"
      from:
        kind: "ImageStreamTag"
        name: "origin-ruby-sample:latest"
  strategy:
    type: "Rolling"
  paused: false
  revisionHistoryLimit: 2
  minReadySeconds: 0

Applications usually have also route, service, physical volume and other resources that are described similar way using YAML as the DeploymentConfiguration example above.

These resources can be exported in YAML-format from the existing installation and imported to the new installation. We wrote a small script that exported all the application specific information from the old cluster to a YAML-files, and used these YAML-files to import the application definitions to the new cluster.

The cluster use basic basic NFS for storage. So, the volume data migration was quite simple task. During the new installation we created new NFS shares for the new cluster. Then using simple rsync script, we synced the data from old NFS-server to the new one. There was a little catch though.

OpenShift uses projects (namespaces in Kubernetes) to isolate resources. Projects (or namespaces) are isolated within the cluster. Each project has its own unique UID, and when a project needs physical volume from NFS server, the NFS-exports are chowned (a Linux command) to project specific ID. When we created the new projects to the new cluster, each project got a new ID that was different than the one in the old cluster. So, we needed to scrape these new UIDs and chown the project specific NFS shares on the new NFS server to match new UID.

Overview of NFS ownership in migration

Pseudo code to fix ownership

for old-project in old-projects:
  new-uid = get uid $new-project
  for old-pv in get pv $old-project:
    new-pv = get pv $new-project
    ssh new-nfs && chown -R /var/nfs/$new-pv nfsnobody:$new-uid

Now the data was initially migrated and with correct ownerships also DeploymentConfigurations were in place in new cluster so it was time to move the traffic to the new cluster.

On the scheduled time we started the cut-over by scaling all the application pods down on the old cluster. When that was ready and it was confirmed that there were no connections to external databases, we ran the rsync one last time to get latest changes.When rsync was ready we scaled application pods up on the new cluster. After we confirmed that pods started without problems and applications responded correctly inside the cluster we changed the backends on the external load balancers to point to the new cluster. This was done to both application loadbalancer and master loadbalancer.

The cut-over caused downtime as applications had to be scaled down for the final data migration, but in total the cut-over was done within an hour. And as applications had monthly service break the migration was done within that service break time.

Conclusion

Migration project went in general very well and customer was happy with the result. This upgrade method required far less planning and checks than doing the upgrade one version at a time. Also the cluster had been running for couple years already and was on old RHEL (Red Hat Enterprise Linux) version. With this approach we have fresh, pristine installation of RHEL and OpenShift with no legacy configurations.


comments powered by Disqus

Niko Ylänen

Sales Executive

Call contact
Email contact