Self-driving Clouds: From Vision to Reality

Imagine you are in New York for some customer meetings and meet your brother as well for a quick catchup. Your brother just bought a Tesla last year and decided to drop you off at the airport, so as to give you both more time to chat. During the drive he put the car on AutoPilot, where the car goes with the traffic without any human intervention. Visualize this surreal experience to see a 4,500-pound object going at 65mph while taking care of so many movements and objects around it and reacting to them in real time.

Even at home, we are all enjoying a new set of intelligent devices like Google Home, Alexa, Siri and Cortana. Although some of these personal assistants are there to help with easier shopping (as if we need to make it any easier). Already, impulse buys are on the rise! But devices are getting smarter every day to have a more meaningful role in your life.

In all these cases, there are big datacenters and cloud environments that are running these applications to help with simple tasks – to help you park the car, tell you when to leave for the airport and remind you of your meetings. In fact, these datacenters consumed 70 billion KWH of electricity in 2014 and the trend is on the rise:

http://www.datacenterknowledge.com/archives/2016/06/27/heres-how-much-energy-all-us-data-centers-consume/

The question is, “Have we made the clouds to be self-driving and self-optimizing?”

We are using Alexa to play songs from Adele, but we’re still using Excel worksheets to make infrastructure buying decisions that could be worth hundreds of thousands of dollars. We are still carrying pagers to deal with failures and manual deployment of physical servers when there is a need.

So what would it take to build self-driving clouds?

Just like any other technology in this space, one needs several systems to work well together, do self-monitoring, healing, learning and to create models for self-optimization. Here is a list of technologies that need to be present for self-driving clouds:

1. No Day 0 — Automatic install and configuration:

The first step is an install process that does not require much human intervention. The building blocks for a cloud are servers, storage and networking. With hyper-converged systems, servers and storage are combined together and one needs software-defined networking to minimize the reliance on physical network changes. So the first requirement is a server + storage building block with all the software pre-installed and baked into the operating system image. You just need to image a few servers and power them on. Once that is done, the cloud should come up automatically without admins knowing anything about various services and their persistent stores. The image software should pool together servers, storage and networking resources to create a highly resilient cloud. This is the same as when a car comes out of a dealer and you don’t need to install any new software on it – everything is pre-installed and built-in.

2. Integration with other clouds and internal systems:

A cloud is not supposed to work in isolation, so one should be able to quickly connect it with existing virtualized infrastructure and other public clouds. Even better would be to add your existing storage systems and make them as part of this cloud. This is an optional step, but it’s critical if you want to leverage your existing investments in storage and servers. Similarly, most customers want to integrate with AD/LDAP as well to have a single source of users and authentication. This is similar to integrating your phone, garage door, and music subscriptions with your car. Some of the new cars have APIs for iPhone and Android called CarPlay to show your phone apps on car dashboard.

3. Deploy applications in a self-service manner:
The ultimate goal for any cloud is to provide you with an IaaS and PaaS platform that can be consumed by various teams in a self-service manner. For example, developers can use it for application development, CI/CD; support teams can use it to bring up replicas of customer environments to troubleshoot any support issues; sales can bring up quick PoCs for trial and finally IT can bring up staging or production deployment of various applications. These steps need to be fully automated, so that one can repeat them without spending too much time. Any cloud solution should provide a self-service interface with application templates for quick deployment. This is similar to a car being available for anyone to drive: it does not require a specific driver all the time.

4. Real-time monitoring for events, stats, logging, auditing:
Since cloud is a shared environment, one needs to be able to monitor various events, stats and dashboards in real time. This is similar to being able to see a dashboard in a car to know if any door is open, tire pressure is low or if the engine is overheating. This is also required to know the state of applications and what actions other users have performed. IT should be able to get logs and audit the action of all users. For example, if a service is down since 10pm last night, it is good to know if a user or script shutdown a VM providing that service by mistake.

5. Self-monitoring & self-healing:
Any system as complex as a cloud needs to monitor all of the critical services and also help monitor the workloads. If any hardware component or software service fails, the system should detect and fix the situation. Then, it can alert the admin as to which component had failed. If this was a hardware component like a server, hard disk, SSD or NIC, the admin can take corrective action to restore the capacity of the system. This is similar to having self-fixing tires that can handle a flat on the fly. Users can later take corrective action if needed. Similarly, if an engine fails in an airplane, other engines kick in or take over that load. This is an absolute minimal requirement for a self-driving cloud.

6. Machine learning for long-term decision making:
Since the self-healing layer takes care of short-term decisions, we need another layer of automation that can observe the cloud and applications over a longer period to help optimize the cloud, improve efficiency and plan for future. A self-driven cloud platform collects telemetry or operational data and leverages machine learning to guide data scientists how to develop algorithms that now model this behavior. The algorithms help customers make decisions. This layer should observe the usage to do prediction capacity modeling and order new servers. It should also determine what sort of servers to add in terms of their CPU, memory and IO ratio. For instance, if the applications are more CPU-intensive, one should order servers with more cores and less storage. Another area is to help optimize the size of VMs based on utilization. Customers pay for peak capacity on public clouds, but the average utilization is less than 15 percent in most cases. At that point, you are paying 5x the cost that you would pay in a private environment if you consolidate the workloads. All these savings can be passed to you instead of cloud vendors keeping them. A learning system can also help you detect any anomalies in your environment. For one of the customers, we noticed that suddenly their VM was sending a lot of data to other public IPs. This was a result of their machine getting hacked by a bot, and any such security risk can be detected using a smart anomaly detection system. These learning systems are going to be needed for our cars and phones as well in near future. The list of learning-based algorithms can get long, but the key is to have a platform where these can be easily added over time.

7. Hands-Free Upgrades!
Upgrading a cloud is like changing the tires on a running car. Admins spend a lot of time dreading it and finally doing it. With a live cloud running a variety of workloads, it is critical that the upgrade process be completely handled by an intelligent software layer, and not by humans who are reading release notes from vendors to figure out the right path to upgrade for their environment. In many cases, it is about reading release notes from multiple vendors! Who wants to sign up for that now?

At ZeroStack, we have built a self-driving cloud from the ground up using all of these key principles. Here is the technical architecture of the solution:

  • Z-COS: ZeroStack’s cloud operating system converts bare-metal servers into a resilient cloud cluster, while providing hypervisor, SDN, clustered storage and a self-healing control plane running across all servers.
  • Z-Brain: The on-premises cloud is further monitored and consumed via a SaaS portal. The Z-brain collects telemetry for long-term analysis, and uses machine learning to provide insights for decision making and optimizations.
  • Z-AppStore: The application store allows single-click deployment of applications using templates. It already contains around 50 templates for some of the most popular applications for CI/CD, NoSQL stores, application stacks and big data analytics. Customers can deploy their own customer application templates to automate the task for developers, support, sales and IT teams.

Figure 1: Overall architecture of ZeroStack self-driving cloud

Now, coming to the self-driving nature of the cloud, different ZeroStack components work together in tandem to provide all of the seven key elements for a self-driving cloud. Table I provides a summarized view.

Self-driving requirement Z-COS Z-Brain and  Z-AppStore
Automatic install and configuration All software components built-in Does initial configuration
Integration with other systems Z-COS integrates locally with vSphere, external storage, AD/LDAP Works with AWS, adds new feature at fast pace
Deploy apps using self-service All applications run locally Provides self-service workflows, application templates
Real time monitoring Sends telemetry data Stores in a big data cluster. Does real-time analysis
Self-monitoring, self-healing Does monitoring of hardware, software components. Heals any failures and reports to Z-Brain Shows any actions taken by Z-COS and any replacements needed from admin.
Machine learning, long-term decision making Creates model for consumption at per-VM level. Does learning and provides insights for optimization and planning.
Upgrades Done by Z-COS in a clustered manner Driven using a state machine that controls the upgrade process

Table I: Self-driving needs and the role played by ZeroStack components

If you want to learn more and need a cloud solution to replace VMware or lower AWS costs, while getting all the benefits of self-driving, please reach out at www.zerostack.com

Download WhitePaper here!

 

Leave a Reply