Are there management puppies in your datacenter?

You may have heard about the pets vs cattle discussion – a reference to the way application servers are deployed in the cloud native world. If an application server goes down it can simply be dropped from the mix and a new server added in its place. The practice so far has mostly been applied to application deployments.


Management software on the other hand is treated in a very special manner. Dedicated resources are set aside to run the management software components and several alerting systems are deployed to watch the health of those components. Administrators spend hours each day managing the management infrastructure.

VMware’s vCenter Server, and other management components in the vCloud Suite are good examples of this deployment style. Windows Azure Pack and System Center Virtual Machine Manager (SCVMM) from Microsoft’s cloud platform are similar.

Drawbacks of the traditional deployment architecture-

  1. Initial Sizing: The deployment of management systems is done using some estimate of overall host and VM count. If workloads grow significantly a lot of manual intervention is needed to either augment the controller nodes, or replace the existing controller set up with a completely new and larger nodes. If the customer wants high availability, that would mean doubling the resources that need to be upgraded.

  1. High Availability: Building a highly available management software has been an after-thought for most vendors. Almost all the HA configuration guides have dozens of steps and complex system requirements. Most cloud suites have multiple components each of which has its own HA configuration. This makes matters worse.

  1. Persistent State Management: Most of the configuration, inventory and stats information is stored in traditional SQL databases. These platforms generate a lot of data, not all of which is transactional in nature. Out of the millions of stats that are generated does it really matter if a few are dropped? Customers end up spending a lot of time in monitoring, tuning, and adding capacity to these databases.

  1. Auto Scaling: As the infrastructure grows, customers have to figure out when the management software is not able to keep up and then configure new management nodes. Adding new management nodes can take several weeks. This is because it’s not enough to spin up another instance. The administrator has to configure everything on the new server from scratch: notifications, roles, events, stats and alarm levels. The stricter the IT policies of a datacenter the more time IT administrators spend in setting up repeatable instances of the management platform.

    Another huge problem with these systems is that it is very hard to handle environment variations. A given customer may have  a large number of clients (developers or operators) making API or CLI calls to the system. The actual number of workloads under management may be low, but the sheer volume of API calls can bog down the performance of the management systems. In situations like this, it would be great to have the ability to just scale the API layer. However, that is only possible if the API layer is exposed as a separate service that can be scaled independent of other management systems.

  1. Upgrade: Upgrading from one version to another often becomes a multi-month project involving professional services (there goes some of your ELA Dollars!).

All of these management tasks consume a big chunk of an IT organization’s resources, time and make the task of running a private cloud appear quite daunting.

Aren’t management services the real puppies in the datacenter requiring a lot of love, care, time and resources? Why can’t the lessons from cloud native apps be applied to management software?