High Availability Concepts for OpenStack Cloud
A production OpenStack deployment must need the following requirements in order to serve the business needs with performance and efficiency:
1. High availability of the service
3. Automated business and management operations
There are different deployment approaches that can be employed to ensure the above qualities. This article examines such an approach.
Any High Availability (HA) approach will be focused on the different OpenStack Services that are divided into different groups.
The HTTP/REST services provided by the API servers: nova-api, glance-api, glance-registry and keystone can be made highly available by distributing loads. For this, a load-balancer that supports health checking of the API servers can be utilized that distributes load evenly as well as ensure that the servers are healthy.
There are the Compute services that are responsible for provisioning and managing the VMs and the resources needed by them. These are the nova-compute, nova-network, and nova-volume. These are the fundamental management services that make the OpenStack running. So at the very primitive level, it is to be ensured that these services are up and running all the time. Since all the 3 services need to work simultaneously, any single failure can bring the system down even if temporarily. So the main approach to HA here should be to avoid a single point of failure in the coordinated working of these services. This can be achieved by using an external monitoring system to monitor these services and to handle common failure situations using recovery scenarios implemented as failure event handlers. These event handlers, on a minimum can notify the admin to restart the system or make an automatic restart. For the nova-network, that provides the networking service, the routing of projects task can be separated and handed over to an external hardware router. Such a multi-host setup removes a single point of failure, since nova-network then deals with only DHCP functions.
The Scheduler and Queue Server
The RabbitMQ message broker is the queue server that facilitates communication between the nova services. While queue mirroring is in-built into RabbitMQ, a load balancer can be used to distribute connection loads between RabbitMQ servers that are setup as a cluster. The nova-scheduler service accepts messages from the scheduler queue in the RabbitMQ server. New scheduler instances are created automatically upon need and all of them work synchronously to achieve redundancy and HA.
OpenStack Database High Availability
The multi-master-replication-manager (MySQL-MMM) is a commonly used database solution with OpenStack. It provides high availability and scalability for the OpenStack database service. Along with it, the wsrep API based Galera Cluster solution, provides excellent clustering and scalability to the entire OpenStack Cloud database service. The synchronous multi-master topology of Galera Cluster for MySQL is well-known for providing high availability and it fits well with OpenStack.
Scalability through Node Roles
Horizontal scalability is typically achieved in a cloud by adding additional instances to the cloud and includes them in the load balancer network configuration. For OpenStack this process is simple and flexible. But an issue with this kind of scaling is the role of the node. In OpenStack cloud each and every node has a set of services associated with the node. So it becomes necessary to define the services a node is to assign with, also known as its role during instantiating it. A development OpenStack deployment only need 2 types of nodes: compute nodes to perform computational services and host VMs and controller nodes to run all other management services. But a production deployment needs the additional API service to run on compute nodes. The below node roles are to be defined in the production deployment:
Endpoint node: responsible to perform load balancing and other high availability services. So this node contains the needed load balancing and clustering software/firmware. Instead of a single node, a dedicated load balancer network also can perform this task. There is a minimum requirement of at least 2 nodes for any cluster for redundancy and failover.
Controller node: this is responsible for managing the control and communication between the various services of the cloud like the queue server, database, dashboard etc. It can also host the scheduler service nova-scheduler and API servers. To achieve redundancy, the API servers must be load balance by an endpoint node and at least 2 controller nodes are needed.
Compute node: these are responsible for hosting the hypervisor and other VM instances. It also provides the needed resources for the VM instances. Further it can be used as network controller for the instances hosted in it, with a multi-host network scheme.
Volume node: this is associated with the nova-volume service that aids in storage volume management.
To achieve high availability, configuration management is needed in the following scenarios.
1. When a controller node is to be added to the cluster as part of scaling out, configuration changes in multiple places are needed like, node deployment, service start and finally load balancer configuration update to include the new node. For compute nodes, the level of configuration needed is much lesser than that needed for controller nodes but may be needed at bare-metal to services levels. Fine automation of these configuration change management are needed for HA.
2. If the controller node and endpoint node are to be combined in a single host, configuration changes are needed for the nova services and load balancer. This also to be automated for ensuring HA.
Configuration Management and Automation
Since configuration changes are needed at node levels, load balancers, replication etc. automated configuration management and scalability implementing scripts are necessary for any OpenStack cloud platform. Two well known projects for accomplishing this automation are Devstack and Crowbar. The integrated automation provider projects like Puppet and Chef are also highly contributing for deploying automated, scalable and highly available OpenStack cloud platforms. The scaling process can be automated by using orchestration engines also. These engines use configuration templates to be applied every time a new node or instance is allocated, thus providing automation of scalability.
Topologies for High Availability
There are different topologies that utilize different approaches to achieve high availability in OpenStack clouds. Some of these topologies are:
1. With a hardware load balancer: here a physical hardware load balancer appliance is used. It provides connection endpoints for the OpenStack services deployed on different nodes. Compute nodes host API servers, schedulers and nova-scheduler instances. Controller nodes host glance-registry instances and the Horizon dashboard.
2. With a dedicated endpoint node: here, instead of a hardware load balancer, en endpoint node is used to distribute traffic between different OpenStack services. Here the API services are deployed on the controller nodes instead of the compute nodes.
3. With simple controller redundancy: here the endpoint nodes and controller nodes are combined. Controller nodes host API services and nova-scheduler instances and these nodes can be easily scaled by instancing new nodes and reconfiguring HAProxy that provides high availability, load balancing and proxying for TCP and HTTP communications.
The OpenStack instances use static IPs for internal communications (private network) and the network is isolated from outside environment by the firewall implemented by the nova-network component. Outside world communicates with the public network part of this deployment that uses dynamic or floating IPs. Separate component networks can be used for management, storage (for nova-volumes), etc. The public network provides access to the cloud instances through dynamic IPs. Clients can connect to the OpenStack service APIs through virtual IPs of the endpoint nodes.
High availability can be achieved through many ways and load balancing and monitoring are the keys to achieve this. A central control node is not mandatory for a HA deployment and you can reach a benchmark level with some experimental configurations. The main point is the distribution of traffic and load evenly across the multiple instances of a service and also to the extent at which replication can be provided for stateful resources like MySQL and RabbitMQ.