BigData: Intro

Showing posts with label Intro. Show all posts

Thursday, March 31, 2016

Big Data Open Source Stack vs Traditional Stack

Introduction

We get this question asked by IT and Business executives all the time. No, it is not what is big data and how can it be useful for me (that also), but more, in a simple way, can you tell me how big data compares to my traditional BI architecture stack?

As you will see below, we will compare the most prominent open source initiatives in big data with the traditional stack, as well as give you a glimpse of the value proposition brought by the open source initiatives in each layer. We will also briefly explain each of the initiatives and address some of the most commonly asked questions.

In Part II, we will map each of the open source stack layers to commercial vendors and explain what they bring to the table. Our aim is that this will help you connect the dots between open source and the marketing emails you are getting flooded with!
Obviously there are further technicalities involved when you consider implementation, but that is beyond the scope as of now.

Open Source Big Data Stack vs. Traditional BI/Analytics Stack

The below diagram compares the different layers of a Analytics Stack, provides you with the Open Source Initiatives that fit into the layer, and the value that you gain at each layer with the Open Source Stack.

Ok, have you spent at least 2 minutes looking at the above picture? If not, please go back and look at it again! Let us take this from the bottom…

· Infrastructure: When a node stops working, the system redirects work to another location of the same data and the processing continues without missing a beat.

· Data Storage: All open source stack initiatives are file based storage. Come to think of it, even traditional RDBMS’s are also files based, but you interact with the traditional RDBMS using SQL. On the other hand, you interact with files in Hadoop. What does this mean? That means, you move files into Hadoop (also known as HDFS – Hadoop Distributed File System) and you write code to read these files and write to another set of files (speaking in a simplistic manner). Yes, there is a way to manage some of this interaction using SQL “like” language.

· Data Ingestion and Processing: Just like traditional BI, if you want to extract data from a source, all you need to do is load it and apply logic in it, the same process applies to Hadoop eco system. The primary mechanism for doing this is MapReduce, a programming framework that allows you to code your input (ingestion), logic and storing. Now, since the data sources may exist in a different format (just like traditional BI, where you may extract data from SFDC, Oracle, etc.), there are mechanism available to simplify such tasks. Sqoop allows you to move data from an RDBMS to Hadoop, while Flume allows to read streaming data (e.g. twitter feeds) and pass it to Hadoop. Just like your ETL process may be made of multiple steps orchestrated by a native or third party workflow engine, Oozie workflow provides you the capability to group multiple MapReduce jobs into functional workflow and manage dependencies and exceptions. Another alternative to MapReduce is to use Pig Latin, which provides a framework to LOAD, TRANSFORM, and DUMP /STORE data (internally this will create a set of Map Reduce code).

Hold on! There is a new top level initiative now called Apache Spark. Spark is an in-memory data processing framework that enables logic processing to be up to 100x faster than MapReduce (as if MapReduce was not fast enough!). Spark currently has three well defined functionalities, Shark (which can process SQL), Streaming, and MlLib for Machine Learning.

· Data Access and Data Management: Once you have the data within Hadoop, you will need a mechanism to access it, right? That is where HIVE and SPARK come in. HIVE allows people who know SQL to interact with Hadoop using very much SQL like syntax. Under the covers, HIVE queries are converted to MapReduce (no surprise there!).

· Reporting, Dashboards and Visualization: Pretty much all vendor tools can be used to create and run reports and dashboards off Hadoop. Wait, there is a caveat here. All of these tools (except those purpose built for Hadoop) use Hive and associated drivers to manage the queries. Since Hive convers the queries into MapReduce and runs it as a job, the time it takes to return the query results may be more than running the same query on a traditional RDBMS. Yes, if you have Spark, then the queries will be much faster, for sure Additionally, Hive does not support all ANSI SQL features as of now. So, you may need to consider work abounds (e.g. preprocessing some of the logic) to make sure you can get a reasonable response time.

But here is the good news. We believe the current limitations of Hive will go away very soon (maybe even in months) as the fantastic open source community continues to work on Spark to enable richer, faster SQL interaction.

· Advanced Analytics: This is the area where open source beats everyone to the ground. Advanced Analytics is a combination of predictive analysis, algorithms and co-relations you need to understand patterns and drive decisions based on the suggested output. Open source initiatives like Mahout (part of Apache Projects) and R can run natively on Hadoop. What does this mean? It means predictive algorithms that may use to take days to run can now be run in minutes, and re-rerun and re-run as many times as you want (because such types of analysis required multiple iterations to reach an accuracy and probability level that is acceptable).

Now we come to addressing some of the most common questions we get:

· Is open source big data stack ready for prime time? The answer is Yes. From a technology and commercial vendor support, it is very much mature to be deployed in enterprises. This is based on our own experience of implementing the stack at multiple clients and observing the stability of platform over a period of time.

· Is this solution applicable to me? This is a tough question to answer, but we will make it simple – if you have less than 10TB of data for analysis/BI and/or have no major value in analyzing data from social feeds like Facebook or twitter, you should wait. But, also think if your data size is less than 10TB because you cannot keep all the data business wants due of cost? Can business drive more value if historical data were made available? And if I make that amount of data available, will it be more than 10 TB?

· Can I basically move my Data Warehouse to Hadoop? We wish we could give you a black and white answer. Unfortunately, this answer will be fifty shades of grey, so to say! First of all not yet, completely (Refer to the section on Reporting, Dashboards and Visualization above). Secondly, it depends on the pain points and future vision you are addressing. There are components of this solution that can address technology pains you may be experiencing with traditional BI environment, for sure.

However, the overall benefit of this comes when you use all layers of the stack and transform the technology and business to make use of the capabilities brought forth by this architecture.

· Are there enough people who know this stuff?Unfortunately, no. There are many companies who talk about this, but there are only a few who have gotten their hands dirty. However, based on our experience and assessment, pool of people who know the open source stack is increasing. Additionally, the opex required to implement and manage Hadoop environment will be significantly lower when you add the capex savings realized realize with this solution.

· My existing platform, storage and BI vendors claim they work on Hadoop. So can’t I just go with that?The grey becomes even more “greyer” here. The big data hype has made many vendors to claim the compatibility. Some claims are accurate and some are not because each vendor has defined their own definition of compatibility. One simple question you can ask is “ Does your offering run natively on Hadoop, i.e. does it make use of the distributed, parallel processing nodes enabled by Hadoop or Do you basically offer a way to read from/write to Hadoop”?

Wednesday, February 17, 2016

An Introduction to Hadoop

Hadoop : Introduction

Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer. The Apache Hadoop framework is composed of the following modules:

· Hadoop Common which contains the libraries and utilities needed by other Hadoop modules.

· Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.

· Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.

· Hadoop MapReduce - a programming model for large scale data processing.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.

Hadoop Key Characteristics

· Reliable: When a node stops working, the system redirects work to another location of the same data and the processing continues without missing a beat.

· Economical: Hadoop brings massively parallel computing to commodity servers (systems with average configurations). The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

· Scalable: New nodes can be added whenever need arise, and they can be added without requiring any change to the data formats, how data is loaded, how jobs are written, or the applications on top.

· Flexible: Hadoop is schema-less, and can absorb any type of data, structured, unstructured, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Difference between Hadoop and RDBMS
Difference on	RDBMS	Hadoop
Data Types	Structured	Multi and Unstructured
Processing	Limited	Processing coupled with data
Schema	Required on Write	Required on Read
Speed	Reads are fast	Writes are fast
Cost	Software License	Support only
Resources	Known entity	Growing, Complex, Wide
Best Fit Use	Interactive OLAP Analytics Complex ACID Transactions, Operational Data Store	Data Discovery Processing Unstructured Data Massive Storage/Processing

Hadoop Ecosystem

· Apache Oozie: Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

· Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

· Pig Latin: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

· Mahout: Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.

· MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.

· HBase: A distributed, column-oriented non-relational database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).

· HDFS: A distributed filesystem that runs on large clusters of commodity machines.

· Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

· Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Hadoop Core Components

· Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

· Distributed across “nodes”

· Natively redundant

· NameNodetracks locations.

· MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

· Splits a task across processors

· “near” the data & assembles results

· Self-Healing, High Bandwidth

· Clustered storage

Hadoop Distributed File System (HDFS) Architecture :

HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, and scales to proven deployments of 100PB and beyond.

Main HDFS features:

· Scale-Out Architecture: Servers can be easily added to increase capacity.

· Fault Tolerance: Capability of automatically and seamlessly recover from any failures.

· High Availability: It serve mission-critical workflows and applications.

· Load Balancing: Place data intelligently keeping the load balanced for maximum efficiency and utilization.

· Flexible Access: Multiple and open frameworks for serialization and file system mounts.

· Tunable Replication: Multiple copies (by default 3) of each file provide data protection and computational performance.

· Security: POSIX-based file permissions for users and groups with optional LDAP integration

Assumptions and Goals

· Hardware Failure: Hadoop runs on clusters of commodity hardware which are low cost and not very reliable hardware. It’s instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Thus, detection of these faults and a quick, automatic recovery from them is an important architectural goal of HDFS.

· Continuous Data Access: Applications that run on HDFS need continuous access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis in HDFS is on high throughput of data access rather than the low latency of data access. Thus, to increase data throughput rates, some POSIX requirements are removed in a few key areas.

· Large Data Sets: Applications with large data sets (files) run on HDFS. Gigabyte to terabytes is a typical file size in HDFS. Thus, HDFS is tuned to support large files. It should be capable of supporting tens of millions of files in a single instance.

· Coherency Model: HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. This is true with Hadoop 1.x versions. Hadoop 2.x versions support appending-writes to files.

· “Moving Computation is Cheaper then Moving Data”: A computation requested by an application is much more efficient if it is executed near the data it operates on. This statement is especially true if the size of the data set is really huge. This approach minimizes network congestion and also increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

· Portability across Heterogeneous Hardware and Software Platforms: It is the design of HDFS that makes it easy to be portable from one platform to another.

Main Components of HDFS

· NameNode: It is the centerpiece(master) of an HDFS file system. It keeps the directory tree of all the files present in the file system, and tracks where across the cluster the file’s data is kept. We can say it stores the metadata about the cluster. It does not store the data of these files itself. It is a master server that manages the file system namespace and regulates access to files by clients. Whenever client applications wish to locate a file or they want to add/copy/move/delete a file, they can talk to the NameNode. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

· The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.

· DataNode: A DataNode stores data in the Hadoop File System. DataNodes are slaves which are deployed on each machine and provide the actual storage. A functional filesystem has more than one DataNode, with data replicated across them. DataNodes are responsible for serving read and write requests for the clients. DataNode instances can talk to each other, which is what they do when they are replicating data. They can also perform block creation, deletion, and replication upon instruction from the NameNode.

· Relationship between NameNode and DataNodes: The NameNode and DataNode are pieces of software designed to run on commodity machines across heterogeneous operating systems. HDFS is built using the Java programming language; therefore, any machine that supports the Java programming language can run HDFS. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software.

· The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

· Secondary NameNode: The secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. It is not a hot standby for the primary/master NameNode.

· File System Namespace: HDFS supports a traditional hierarchical file organization in which a user or an application can create directories and store files inside them. The file system namespace hierarchy is similar to most other existing file systems; you can create, rename, relocate, and remove files. The NameNode maintains the file system namespace. NameNode records any change to the file system namespace or its properties. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

· Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are of the same size. The blocks of a file are replicated for fault tolerance. We can configure the block size and replication factor per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions in regards to the replication of blocks. Each DataNode in the cluster periodically sends a Heartbeat and Blockreport to the NameNode. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

· Replica Placement: HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica placement makes HDFS unique from most other distributed file systems, and is facilitated by a rack-aware replica placement policy that uses network bandwidth efficiently. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. Large HDFS environments typically operate across multiple installations of computers. Communication between two data nodes in different installations is typically slower than data nodes within the same installation. Therefore, the name node attempts to optimize communications between data nodes. The name node identifies the location of data nodes by their rack IDs. A simple but non-optimal policy is to place replicas on unique racks. For example, when the replication factor is 3, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. When a read request comes, HDFS tries to read a request from a replica that is closest to the reader to minimize global bandwidth consumption and read latency. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request.

· File System Metadata: EditLog is a transaction log used by NameNode to persistently record every change that occurs to file system metadata. For example, when a new file is created in HDFS, the NameNode inserts a record into the EditLog indicating the creation. In the same manner, if replication factor of a file is changed, a new record is inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too. A name node supports multiple copies of the FsImage and EditLog files. With multiple copies of these files in place, any change to either file propagates synchronously to all of the copies. When a name node restarts, it uses the latest consistent version of FsImage and EditLog to initialize itself.

MapReduce

MapReduce is a software framework for easily writing applications which process multi-terabyte data-sets in-parallel on thousands of nodes of low cost commodity hardware in a reliable, fault-tolerant manner. A MapReduce job consists of two separate tasks: Map Taskand Reduce Task. An input data-set is usually split into independent chunks by a MapReduce job , which are then processed by the map tasks in a completely parallel manner. The outputs of these maps are then sorted, and given as input to the reduce tasks. The scheduling, monitoring of these tasks are taken care by the framework and also the framework re-executes any failed tasks.

With MapReduce and Hadoop, instead of moving data to the compute location, the computation happens at the location of the data; storage of storage and the processing coexist on the same physical nodes in the cluster which results in very high aggregate bandwidth across the cluster. The MapReduce framework consists of :

· A single master JobTracker and

· one slave TaskTracker per cluster-node.

The responsibilities of master include scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves are responsible for executing the tasks as directed by the master.

Logical View of MapReduce:

· The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

Map(k1, v1) list(k1,v1)

Every key-value pair in the input dataset is processed by the Map function which produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.

· After that, the Reduce function is applied in parallel to each group created by the Map function, which in turn produces a collection of values in the same domain:

Reduce(k2, list (v2)) list(v3)

Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are then collected as the desired result list.

Therefore, the MapReduce transforms a list of (key, value) pairs into a list of values (as shown in the figure below)

The two biggest advantages of MapReduce are:

· Taking processing to the data.

· Processing data in parallel.

Hadoop Configuration Files

The main files for configuring Hadoop are:

· hadoop-env.sh: This file contains the environment variables that are used in the scripts to run Hadoop.

· core-site.xml: It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

· hdfs-site.xml: All the configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes are specified or can be specified in this file.

· mapred-site.xml: Configuration settings related to MapReduce daemons : the job-tracker and the task-trackers can be done here.

· masters: It contains a list of machines (one per line) that each run a secondary namenode.

· slaves: It contains a list of machines (one per line) that each run a datanode and a task-tracker.

· hadoop-metrics.properties: All the properties for controlling how metrics are published in Hadoop can be specified in this file.

· log4j.properties: It contains the properties for system log files, the namenode audit log and the task log for the task-tracker child process.

Difference between Hadoop 1.x and Hadoop 2.2

Limitations of Hadoop 1.x:

· Limited upto 4000 nodes per cluster.

· O (# of tasks in a cluster)

· JobTracker bottleneck – resource management, job scheduling and monitoring

· Only has one namespace for managing HDFS

· Map and Reduce slots are static

· Only job to run is MapReduce

Apache Hadoop 2.2.0 consists of significant improvements over the previous stable release (hadoop-1.x). Below is a short overview of the improvments to both HDFS and MapReduce in Hadoop 2.2

· HDFS Federation: Multiple independent Namenodes/Namespaces are used in federation to scale the name service horizontally. The Namenodes are federated, which means the Namenodes are independent and they don't require any coordination with each other. All the Namenodes use datanodes as common storage for blocks. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.

· MapReduce NextGen (YARN/MRv2): The two major functions of the JobTracker were resource management and job life-cycle management till Hadoop 1.x. into separate components. But, The new architecture introduced in hadoop-0.23, divides these task into separate components.

Now, the new introduced ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚Äôs scheduling and coordination. An application is either a single job in the sense of classic MapReduce jobs or a DAG (Directed Acyclic Graph) of such jobs. The computation fabric is formed by the ResourceManager and the daemon which manages the user processes on that machine i.e. NodeManager daemon. It is the responsibility of per-application ApplicationMasterwhich is a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Hadoop2.2 features:

· Potentially up to 10,000 nodes per cluster

· O (cluster size)

· supports multiple namespace for managing HDFS

· Efficient cluster utilization (YARN)

· MRv1 backward and forward compatible

· Any application can integrate with Hadoop