Big Data refers to large volume of data and processes applied on these data for getting insights for better decision making and strategic planning. Big Data operations include data capturing, data storage, data sharing, data analysis, search, transfer, querying, projection etc. Modern Big Data concepts encompass predictive analytics, user behavior analytics etc. to extract value from data and use them for developing insights, products and services from the analysis. On the Big Data perspective, it’s not the amount of data but what businesses do with the data that matters. Big Data analysis helps businesses to find solutions for saving cost and time, development and implementation of new technologies, optimize product offerings, improve operational efficiencies, etc.
Big Data can be defined by the five Vs:
Volume: the combined quantity of data generated and stored. The size of the data is a deterministic parameter for labeling the data as eligible for analysis to provide insights and even to label as Big Data.
Variety: the type and format of the data. It can be anything like structured numeric data, unstructured text data, media, emails, transactions etc.
Velocity: the speed at which data is acquired and processed to create the Big Data set and how this influence the growth and development of the business.
Variability: the inconsistency of the data as well as the data flow. There will be seasonal and triggered peaks as well as bottoms that may affect the processes that handle the data.
Veracity: the variation in the quality of the captured data and it can affect the accuracy of the analysis.
Important sources for Big Data are:
1. Data received from connected devices that include IT, electronics and appliances, sensors, monitors, communication equipments etc.
2. Social media, marketing responses, feedback and surveys, etc.
3. Portals, data banks and other public data sources.
Big Data Technologies
Once the data is received, it is to be decided what to store and what to discard, how much to analyze and how to utilize the resulting insights. For the storage, processing, management and analysis of Big Data, following technology requirements need to be met:
1. Large amount of storage at affordable cost.
2. Fast and performance optimized processors.
3. Affordable and distributed Big Data platforms for hosting the data.
4. Functional capabilities like parallel processing, clustering, virtualization, large grid environments, and high availability and fault tolerant systems.
5. Cloud computing and flexible provisioning facilities.
6. Standardized operations in the data center.
7. Data security and privacy.
To handle the vastness and power of Big Data you need an infrastructure that is capable of managing and processing high volumes of structured and unstructured data with high data privacy and security. There are 2 classes of Big Data technology:
a. Operational Big Data: systems that have operational capabilities for real time capturing, processing and storage of data.
b. Analytical Big Data: systems like Massively Parallel Processing (MPP) database systems and MapReduce that enables to implement a system that can be scaled out enormously to perform complex analytical computations.
In traditional RDBMS based centralized systems there is limit for maximum amount of data that can be handled without performance and storage constraints, and therefore Big Data need a more scalable solution. Implementing Big Data computation over a cloud based infrastructure allows increased flexibility, better resource utilization, scaling capacity, and lower costs. A private cloud boosts the security and compliance adherence while a multi-technology hybrid cloud offers an elastic infrastructure suitable for hosting the high end storage and processing needs of Big Data.
Apache’s Hadoop, the Big Data framework, with MapReduce, implements one of the solutions that provides distributed data storage and processing in which large amount of data storage and processing are distributed across a cluster of commodity hardware based parallel processing network that can be scaled easily to thousands of computational nodes.
Hadoop and OpenStack
Hadoop, combined with the open source cloud platform OpenStack provides an infrastructure that meets the technical, functional and non functional requirements needed for any Big Data processing infrastructure. OpenStack’s Sahara provides provisioning of the Hadoop data-intensive cluster (or any other framework like Spark) on top of OpenStack. OpenStack supports this architecture through its cloud based component services named nova (compute), neutron (networking), cinder (block storage), keystone (identity), swift (object storage), etc. The components of Hadoop facilitates a distributed computing service and named as The Hadoop File System (HDFS) for replicated data storage, YARN for job scheduling and cluster resource management, MapReduce – a YARN based system for parallel data processing and Hadoop Common, a set of utilities for managing Hadoop modules.
Hadoop and OpenStack are integrated to provide Big Data Platform as a Service (BDPaaS), a hybrid cloud-based Big Data as a service offering, that uses Hadoop’s Big Data analytics capability, RHEL’s OpenStack platform and Intel based x86 servers. This service collects data from different sources, performs comprehensive, strategic and business analysis operations on this data and incorporates the resulting insights obtained from this analysis into the business processes. The advantages of such a solution are:
1. Open source and commonly available components.
2. Compliance to industry standards.
3. Simple deployment and flexible operation.
4. Reduced risk, high security and scalability.
5. Distribution of workload intelligently to the most appropriate business component, whether it is on-premise, a private cloud or even a public cloud.
6. Built-in security practices for protecting data and insights.
Hadoop cluster on OpenStack
The main advantage that OpenStack provides for this integration is the template-based provisioning for reducing provisioning and deployment time. Cluster or node level templates can be specified for allowing automated provisioning of Hadoop clusters. Templates also allow flexibility in configuration and also in defining cluster type, that is whether Hadoop-based or not Hadoop-based. This architecture also provides efficient cluster time-sharing, load distribution and high availability of the infrastructure. OpenStack’s Sahara provides simple provisioning of Hadoop clusters and elastic data processing capabilities. Users can manage the Hadoop clusters through the OpenStack dashboard service Horizon.
Organizations that utilize and benefit from this Big Data Platform as a Service model using OpenStack spans across a wide range of business domains like insurance, health care, technology, logistics, media, advertising, education, manufacturing, and government departments. The integrated Big Data analysis and insights are used to deliver better products and services, enhance customer satisfaction and experience and thus producing higher sales and larger profits.