Since many years, I’m developing and support microservice environments at customer sites as well as in own environments. The provisioning of servers and services must be done within a few minutes to be able to react quickly to requests. This document is intended to give a small overview of our tools but also the procedures.
I provide servers with the help of terraform1. Terraform is a cli tool which can built servers among others2 like in AWS, Azure, GCloud but also in VMWare VSphere3. Here, we describe the servers in each groups respectively. That means all database servers(or Cluster Nodes, or WebServer) look within the groups of the hardware configuration each the same. In addition to the description of the servers, the network, storage and DNS are also configured here.
After the server is deployed, it is preconfigured according to a standard that applies to all servers. The configuration is done via ansible4. After the preconfiguration the installation and configuration of the service starts. This is also done in a standardized way via Ansible.
Every server configuration, every ansible and terraform script, every script change is in GIT to track who did what and when and to see what the current valid configuration should look like.
Any automation script, whether via ansible or terraform, is re-deployable without interfering with the already existing and running service. That means we execute e.g. every ansible script regularly automated to make sure that the services and servers look like they are out of the box.The data of the services are not touched or changed. Every manually made change of an application will be removed if necessary. This is a behavior we want, because every change should be included in the Ansible script.
I do not patch servers! I always rebuild them to always have a clean system available. This is done as follows; in terraform we give a server or a whole server group a new updated operating system image. After starting terraform, it checks the existing infrastructure, detects the difference in the operating system, destroys the affected server and then rebuilds it. The datastorage is not destroyed and is attached to the server again by terraform after the rebuild. Subsequently, the points from the “Server Provisioning” are carried out. However, the ansible scripts recognize the already existing data and skip the initial creation of the data. A downtime takes (depending on the provider) and complexity of the installation from 10 minutes.
Here we distinguish between legacy, container and CloudNative systems.
Legacy systems are services that should neither run as containers nor be cloud native. These services are updated via ansible. Here, either the server can be rebuilt using the “Server Update” item or you can decide to update the service only. This is done via ansible.
Containers provide a very clean separation between service and server. An update of the service in the container is done via a redeploy of the container. The downtime here is in the range of seconds (provided that the service is not first built automatically in the container). The data of the service is stored in a persistent external storage and will not be lost. The persistent storage is not dependent on the server and can be mounted on any other server prepared for containers. This approach can significantly reduce the downtime of the service during a server update.
CloudNative services run in an Apache Mesos5 resource cluster. Mesos provides resources such as storage, network, CPU, RAM and even GPU to various services. It does not matter if it is a container, a script (e.g. Python) or a complex software like elasticsearch. Mesos is designed for large and complex service environments and can be distributed across multiple global data centers.
An update of a CloudNative service is done by adjusting the version number and restarting the service. This can be done automatically or manually. The restart is done according to the Blue/Green principle. I.e. the old service is only terminated when the new service is functional. A downtime does not take place.
Since we run most of our services in Mesos5, we can automatically update or redeploy entire data center environments. Thanks to the Blue/Green process, this is done with almost no downtime. Here we build a new environment in parallel to the existing one. If all components (server and Mesos services) are functional, the services from the old environment are started in the new one without deactivating them in the old environment. If all services are running here as well, the new environment is activated in the Loadbalancer. As soon as there is no more traffic on the old environment, it will be terminated.
Our servers are always up to date, our services are always up to date, yet a breach can always happen. Here, the container technology is of great advantage over legacy systems.The attacker is in the container and not directly on the server, at a container restart this is gone and an update of the container is done in seconds. But, it depends very much on the development of the container (not only the service in the container) and the environment. Even many official container images are built insecure and allow hijacking of the host system. I have had many bad experiences with freelancers (containers are hip so it must be in the CV) but also companies that misunderstood containers or never ran or maintained a real productive environment with them. You can’t recognize the quality of a service provider if you don’t have a deeper understanding of the topic.
Of course, our approach can neither prevent human nor technical failure. Therefore we map many of our processes in pipelines (pipelines are known from software development, but you can also map many administrative tasks here) to exclude the “forgetting” of scripts or the “forgetting” of the verification. Nevertheless, nothing is perfect. Cloud providers like AWS and Azure may not have any resources available. Or VSphere has API problems, or in ansible functions have been dropped whose deprecated message was ignored months ago. It can also happen that a Linux in a newer version no longer delivers certain commands by default. I.e. a check of the scripts, a check of the procedure is indispensable. Therefore, this form of automation rarely represents a loss of work. This is mainly shifted. Nevertheless, the advantage is obvious. All servers are clean, standardized, always up to date and still highly flexible towards the customer. The support towards the end customer is reduced to a minimum and the downtime is minimized. I have no uncontrolled growth and can very quickly roll out new environments or expand existing ones.
But you have to be fair with everything. I’m a young entrepreneur and I were able to migrate my “grown” environment to the new processes very quickly a few years ago. Therefore it is important to look for an external partner with the appropriate know-how and experience to start a slow migration together.