Networking is often considered to be a major pain point in any cloud. There are many possible options when trying to select the right architectures for the physical network infrastructure as well as the Software Defined Networking (SDN) layer. This is the first of several blog posts on how we reason about networking at ZeroStack. The intent is to share our learnings and best practices for running cloud networks in medium sized clusters. In this blog post we begin the series by describing the various problems we are solving.
Ease of insertion and inter-operability of the ZeroStack solution
State of the art large scale data center networks have Layer 3 leaf spine topologies. That architecture and investment makes sense when the compute/storage scale grows beyond a few racks. New customer deployments start small (one ZeroStack Cloud Building Block or ZBlock) and are expected to grow incrementally. Moreover, a large majority of our customers have legacy Layer 2 networks where they would be willing to deploy their cloud in a VLAN or a set of VLANs. Also, many of the switches used can be a few years old, support only 1 Gbps and are layer 2 only. We support several different types of physical network environments:
- Layer 2 and Layer 3
- Port speeds of 10Gbps and 1Gbps
- Bonded and non-bonded links to the Top-of-Rack (ToR) switch
- Connectivity to a single ToR switch or two ToR switches serving as Virtual Port Channel (VPC) or Multichassis Link Aggregation (MLAG) peers
- Trunk and access ports for Layer 2 environments
- Different VLANs for each service, e.g., management, tunnel, storage, etc.
Ensuring high availability (HA) of the network
High-availability is a multi-dimensional problem. Our solution includes high-availability of ZeroStack services, OpenStack services, the host itself and redundancy in the physical network. The goal is to have a high uptime of the cloud.
Physical Network HA
We need to deal with scenarios where the network and cloud administrators may be different people or teams. Thus, deploying a ZBlock does not require the cloud administrator to configure the ToR switch. However, for production environments up to about 3-4 racks of ZBlocks, we recommend the Layer 2 leaf spine topology shown below. Each host in a ZBlock has 2 10GBase-T NICs and each NIC is connected to a different ToR switch in an active-active configuration using the Link Aggregation Control Protocol (LACP). Reboot or software upgrade of a single ToR switch results in temporary performance degradation instead of an outage.
HA for the SDN
We use the term SDN to refer to the Openstack Neutron driven networking as well as all logical networking constructs in our solution. Our constructs include the following:
- Interfaces and subnets used by the ZeroStack distributed services to communicate with each other
- The Openstack API network – this is the network used by Openstack services to communicate with each other and infrastructure services like MySQL and RabbitMQ
- The tunnel network used to encapsulate tenant VM-to-VM traffic across hosts using Generic Routing Encapsulation (GRE)
- The storage network used by our distributed file system
On each bare-metal host in a ZBlock, we have a controller VM which runs our services as well as most Openstack services. There are few services that run on the host too – these are typically the ones that need to run on every host. Our software does periodic health checks of both the underlying hardware as well as Openstack services. The health checks have varying degrees of detail and also capture dependencies across services. Upon detecting hardware or service failures, we replicate, migrate and balance services across the available hosts and controller VMs as appropriate. Here are a few salient features of service networking HA:
- Each service has a virtual IP – this helps us deal with service migrations as the API endpoint is not tied to the physical network
- Each service interface has a deterministic MAC address – this ensures that there are no inconsistent ARP cache entries when services migrate
- Use of static ARPs when appropriate – when we control all the logical interfaces and assign their IP and MAC addresses, it makes sense to use static ARPs. This is especially true for situations where the management overhead is not expensive.
- Increase the ARP cache timeout for situations where the management overhead of static ARPs is prohibitive
- The use of ARP filters and per interface source based routing – this influences the Linux networking stack to be more interface centric as opposed to its default behavior, which is host centric.
Network visibility, failure localization and analytics
There are several manual tasks which operators perform when trying to troubleshoot common problems, e.g., various source-destination combinations of pings, checking link statuses, etc. Here are some items that we have automated and use as alerts and/or on-demand diagnostics:
- Link failure/flakiness of any physical link connecting a host
- Periodic network health monitoring – currently we gather the following network statistics:
- Per-host TX/RX/drop counters from each NIC
- Per-VM TX/RX/drop counters using libvirt
- Packet loss rates for pings from each controller VM to every OpenStack service and other controller VMs
- TCP stats (/proc/net/snmp) from each controller VM
We have built a data pipeline that allows us to gather the above statistics periodically and store them in a Cassandra based time-series database in our SaaS platform. This allows us to perform the following tasks:
- Generate alerts for sustained packet losses to various services
- Detect network partitions and glitches
- Help in diagnosing workload performance issues
- Detect several kinds of anomalous behavior that might be indicative of serious problems and require further investigation
In subsequent posts, we will go over the details of our networking architecture and share some interesting problems we have faced and solved. We are just getting started and are actively working on exciting new features. If you would like to work on challenging problems in cloud networking, we would love to hear from you! Please contact us at firstname.lastname@example.org