The word “Big” in Big Data doesn’t even come close to capturing what is happening today in our industry and what is yet to come. The volume, velocity, and variety of data that is being generated has overwhelmed the capabilities of infrastructure and analytics we have today.
We are now experiencing Moore’s law for data growth: data is doubling every 18 months.
No wonder IDC forecasts that by 2025, the global datasphere will grow to 163 zettabytes (that is a trillion gigabytes). That’s ten times the data generated in 2016.
Data scientists typically may have to combine data from various sources with different volume, variety, and velocity needs simultaneously to gain useful insights but that in turn puts different demands on processing power, storage & network performance, latencies etc.
Here is a quick look at the different types of Big Data sources:
Unstructured data: The type of data generated by sources such as social media, log files, sensor data is not very structured and hence is generally not amenable to traditional database analysis methods. A large variety of Big Data tools, techniques, and approaches have emerged in the last few years to ingest, analyze, and extract customer sentiment from social media data. Newer approaches include Natural Language Processing, News Analytics, unstructured text analysis etc.
Semi-structured data: Some unstructured data may in fact have some structure to them. Examples include Email, Call Center logs, IoT data. Some in the industry have coined a new term Semi-structured data to describe these data sources. These may require a combination of traditional databases and newer Big Data tools to extract useful insights from these types of data.
Streaming data brings in the dimension of higher velocity and real-time processing constraints. Velocity of data varies widely depending on the type of application, IoT data tends to be small packets of data regularly streamed at low velocity. On the other hands, 4K video streams stretch the velocity to the highest end of the spectrum.
The alluring promise of these new use cases–and associated emerging technologies and tools–is that they can generate useful insights faster so that companies can take actions to achieve better business outcomes, improve customer experience, and gain significant competitive advantage.
No wonder Big Data projects have been on the CIO top ten initiatives for the past decade – almost 70% of Fortune 1000 firms rate big data as important to their businesses; over 60% already have at least one big data project in place.
While data scientists are dealing with this complexity of how to derive value from diverse data sources, IT practitioners need to figure out the most efficient way to deal with the infrastructure requirements of big data projects. Traditional bare-metal infrastructure with its siloed management of servers, storage, and networks is not flexible enough to tackle the dynamic nature of the new Big Data workloads. This is where cloud-based systems shine. However, many challenges remain to be addressed in the areas of workload scaling, performance and latency, data migration, bandwidth limitations, and application architectures.
There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments and/or use cloud platforms both public or private, and there are also some best practices companies can use to address those pain points.
PAIN POINT 1: LONG COMMUTE FROM STORAGE TO COMPUTE
As data amounts grow from terabyte to petabyte and beyond, the time it takes to transport this data closer to compute and perform data processing and analytics takes longer and longer, impeding the agility of the the organization. Public cloud vendors like AWS, who are all about centralized data centers, want to get your data into their cloud and go to extreme lengths (see AWS snowmobile) to get it. Furthermore, data transfer fees are mostly unidirectional, i.e., only data that is going out of an AWS service is subject to data transfer fees.
Not only is this a classic lock-in scenario, but it also goes against other key emerging trends:
Edge Computing and Artificial Intelligence, especially for use cases such as IoT, 5G, image/speech recognition, Blockchain, and others, where there is a need to place processing and data closer to each other and/or closer to where the user or device is.
Edge computing delivers faster data analytics results with the data being closer to processing while simultaneously reducing the cost of transporting data to the cloud.
Artificial Intelligence systems are more effective the more data they are given. For example, in deep learning, the more cases (data) you give to the system, the more it learns and the more accurate its results become. This is a case where you need massive parallel processing (e.g., using GPUs) of large data sets. Big Data analytics and AI can complement each other to improve speed of processing and produce more useful and relevant results.
To address the need to get data to where the compute is or vice versa, IT leaders should look for hyper-converged, scale-out solutions that bring together compute, storage, and networking, thus reducing data I/O latency and improving data processing and analytics times. For even better performance, they should look for solutions that can bring the computing units (VMs or containers) as close to the physical storage as possible, without losing the manageability of the storage solution and while maintaining multi-tenancy across the cluster. For example, a Hadoop Data Node VM running on the same physical host and accessing local SSDs will experience the highest performance and faster results overall without impacting other workloads running within other tenants.
IT leaders can take advantage of many emerging memory technologies such as persistent memory (a new memory technology between DRAM and flash that will be non-volatile, with low latency and higher capacity than DRAMs), NVMe, and faster flash drives. With prices falling rapidly, there seems little need for spinning disks for primary storage.
IT administrators should implement a central way to manage all the edge computing sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and Project-level RBAC and security controls.
PAIN POINT 2 -DISTRIBUTED TEAMS, LOCAL PERFORMANCE NEEDS
For data science development and testing use cases, companies do not build a single huge data processing cluster in a centralized data center for all of their big data teams spread around the world. Building such a cluster in one location has DR implications, not to mention latency and country-specific data regulation challenges. Typically, companies want to build out separate local/edge clusters based on location, type of application, data locality requirements, and the need for separate development, test, and production environments.
Having a central pane of glass for management becomes crucial in this situation for operational efficiency, simplifying deployment, and upgrading these clusters. Having strict isolation and role-based access control (RBAC) is often a security requirement.
IT administrators should implement a central way to manage diverse infrastructures in multiple sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and Project-level RBAC and security controls.
PAIN POINT 3 – STUCK ON BARE METAL AND ITS SILO INEFFICIENCIES
Companies still run majority of their Big Data workloads, particularly Hadoop-based workloads, on bare metal. This is is obviously not as scalable, elastic, or flexible as a virtual or cloud platform. Traditional bare metal environments are famous for creating silos where various specialist teams (storage, networking, security) form fiefdoms around their respective functional areas. Silos impede velocity because they lead to complexity of operations, lack of consistency in the environment, and lack of automation. Automating across silos turns into to an exercise of custom scripts and lot of “glue and duct tape,” which makes maintenance and change management complex, slow, and error-prone.
A virtualized environment for Big Data allows data scientists to create their own Hadoop, Spark or Cassandra clusters and evaluate their algorithms. These clusters need to be self-service, elastic and high performing. IT should be able to control the resource allocation to data scientists and teams using quotas and role-based access control.
Better yet, look for an orchestration platform that can deal both with bare metal and virtual environments, so IT can place workloads in the best target environment based on performance and latency requirements.
PAIN POINT 4 – BIG DATA TOOLS EXPLOSION AND DEPLOYMENT COMPLEXITY
In the past decade, technologies such as Hadoop and MapReduce have become common frameworks to speed up processing of large datasets by breaking up them up into small fragments, running themin distributed farms of storage and processors clusters, and then collating the results back for consumption. Companies like Cloudera, Hortonworks and others have addressed many of the challenges associated with scheduling, cluster management, resource and data sharing, and performance tuning of these tools. And typically, such deployments are optimized to run on bare metal or on virtualization platforms like VMware and therefore tend to remain in their own silo because of the complexity of deploying and operating these environments.
Modern big data use cases, however, need a whole bunch of other technologies and tools. You have Docker. You have Kubernetes. You have Spark. You have NoSQL Databases such as Cassandra and MongoDB. And when you get into machine learning you have TensorFlow, etc.
Deploying Hadoop, which is quite complex, is one thing, arguably made relatively easy by companies like Cloudera and Hortonworks, but then if you need to deploy Cassandra or MongoDB, you have to put in effort to write Ansible or Puppet or Chef scripts to deploy. And depending on the target platform (bare metal, VMware, Microsoft), you will need to maintain and run multiple scripts. You then have to figure out how to network the Hadoop cluster with the Cassandra cluster and of course, inevitably, deal with DNS services, load balancers, firewalls, etc. Add other Big Data tools to be deployed, managed, and integrated, and you will begin to appreciate the challenge.
IT teams should address this challenge with a unifying platform that can not only deploy multiple Big Data tools and platforms from a curated “application and big data catalog,” but also provide a way to virtualize all the underlying infrastructure resources along with an infrastructure-as-code framework via open API access This greatly simplifies the IT burden when it comes to provisioning the underlying infrastructure resources, and end users can simply deploy the tools they want and need with a single click and have the ability to use APIs to automate their deployment, provisioning, and configuration challenges.
PAIN POINT 5 – ONE BIG DATA CLUSTER DOESN’T ADDRESS ALL NEEDS
Organizations have diverse Big Data teams, production and R&D portfolios, and sometimes conflicting requirements for performance, data locality, cost, or specialized hardware resources. One single, standardized data cluster is not going to meet all of those needs. Companies will need to deploy multiple, independent Big Data clusters with possibly different underlying CPU, memory, and storage footprints. One cluster could be dedicated and fine-tuned for a Hadoop deployment with high local storage IOPS requirements, another one may be running Spark jobs with more CPU and memory-bound configurations, and others like machine learning will need GPU infrastructure. Deploying and managing the complexity of such multiple diverse clusters will place a high operational overhead on the IT team, reducing their ability to respond quickly to Big Data user requests, and making it difficult to manage costs and maintain operational efficiency.
To address this pain point, the IT team should again have a unified orchestration/management platform and be able to set up logical business units that can be assigned to different Big Data teams. This way, each team gets full self-service capability within quota limits imposed by the IT staff, and each team can automatically deploy its own Big Data tools with a few clicks, independently of other teams.
PAIN POINT 6: SKYROCKETING IT OPERATIONS COSTS
Developing, deploying, and operating large-scale enterprise big data clusters can get complex, especially if it involves multiple sites, multiple teams, and diverse infrastructure, as we have seen in previous pain points. The operational overhead of these systems can be expensive and manually time-consuming. For example, IT operations teams still need to set up firewalls, load balancers, DNS services, and VPN services, to name a few. They still need to manage infrastructure operations such as physical host maintenance, disk additions/removals/replacements, and physical host additions/removals/replacements. They still need to do capacity planning, and they still need to monitor utilization, allocation, and performance of compute, storage, and networking.
IT teams should look for a solution that addresses this operational overhead through automation and through the use of modern SaaS-based management portals that help the teams optimize sizing, perform predictive capacity planning, and implement seamless failure management.
PAIN POINT 7 – CONSISTENT POLICY-DRIVEN SECURITY AND CUSTOMIZATION REQUIREMENTS
Enterprises have policies around using their specifically hardened and approved gold images of operating systems. The operating systems often need to have security configurations, databases, and other management tools installed before they can be used. Running these on public cloud may not be allowed, or they may run very slowly.
The solution is to enable an on-premises data center image store where enterprises can create customized gold images. Using fine-grained RBAC, the IT team can share these images selectively with various development teams around the world, based on the local security, regulatory, and performance requirements. The local Kubernetes deployments are then carried out using these gold images to provide the underlying infrastructure to run containers.
PAIN POINT 8 – DR STRATEGY FOR EDGE COMPUTING AND BIG DATA CLUSTERS
Any critical application and the data associated with it needs to be protected from natural disasters regardless of whether or not these apps are based on containers. None of the existing solutions provides an out-of-the-box disaster recovery feature for critical edge computing clusters or Big Data analytics applications. Customers are left to cobble together their own DR strategy.
As part of a platform’s multi-site capabilities, IT teams should be able to perform remote data replication and disaster recovery between remote geographically-separated sites. This protects persistent data and databases used by these clusters.
THE ZEROSTACK SOLUTION FOR SOLVING BIG DATA WORKLOAD CHALLENGES
The ZeroStack Cloud Platform provides a virtualized environment where data scientists can spin up multiple Big Data application clusters on demand and scale them as needed. These clusters can be geographically closer to data scientists and other users as well as co-located closer to data sources, and can be networked onto the same high-speed local area networks for faster data ingestion and processing. Multiple clusters can be managed and monitored from a single web-based interface for any-time, any-place, any-device access. This is particularly useful for edge computing use cases.
The platform also allows optimal utilization of resources and performance guarantees to run these applications. ZeroStack has unique local storage capabilities to avoid double replication of data both at the infrastructure and application level, specifically designed for Big Data application use cases. IT can allocate projects with specific quotas to one or more users to allow them to work independently.
AUTOMATED DEPLOYMENT TEMPLATES FOR MANY POPULAR BIG DATA TOOLS
ZeroStack supports several Big Data applications via its built-in App Store. This built-in App Store offers pre-built application templates that enable customers to deploy Big Data applications with ease. Some of the example templates include multi-node Hadoop clusters, Apache Spark clusters, Cassandra clusters, Apache Storm clusters, Cloudera Express, Splunk, and HDFS. Users can “import” these templates to their ZeroStack private cloud in a single click and then deploy them. These templates have configuration options that allow for storage, networking, and compute optimization as needed for a given environment. Users can also create and upload their own custom Big Data application templates to the App Store.
Watch how ZeroStack deploys Cloudera Express with a few clicks:
The following out-of-the-box capabilities solve many of the challenges outlined earlier;
Resource sharing using virtualization
Eliminate physical silos and consolidate multiple big-data applications on a single platform. 50 percent lower CapEx due to resource sharing, high utilization, and 90 percent lower Opex with self-service.
Faster Time to Value
Provide self-service deployment of Big Data applications like Hadoop, Spark, Cassandra to development and R&D teams. They can deploy applications within minutes.
Control consumption of resources using projects with quotas and policies governed by IT. Monitor over-commitment and add capacity using built-in capacity planning indicators. Get insights to improve efficiency and performance based on actual application stats and machine learning. Optimize capacity using long-term analytics.
Scale on Demand
Scale infrastructure to meet compute performance and data growth. Build with one server at a time and grow based on actual usage.
Eliminate operational complexity
With cloud-based monitoring and analytics in the web-based Z-Brain, customers do not need any local infrastructure monitoring solution. IT teams do not need any certificates or expertise to operate infrastructure. This reduces total cost of ownership (TCO) by 50 percent while cutting operational complexity by 90 percent.
With the ZeroStack Cloud Platform, enterprises can deploy, operate and manage Big Data projects with high performance and low overhead. On-premises cloud is the key to centralized management, and ZeroStack is the premier of on-premises cloud platforms.