BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

In this chapter, we will understand the basics of Hadoop and Spark, how Spark is different from MapReduce, and get started with the installation of clusters and setting up the tools needed for analytics.
在本章,我们将理解Hadoop和Spark的基础,以及Spark和MapReduce的区别。并开始研究集群的安装,并且设置分析的基本工具。

This chapter is divided into the following subtopics:
[TOC]

Introducing Apache Hadoop 介绍Apache Hadoop

Apache Hadoop is a software framework that enables distributed processing on large clusters with thousands of nodes and petabytes of data. Apache Hadoop clusters can be built using commodity hardware where failure rates are generally high. Hadoop is designed to handle these failures gracefully without user intervention. Also, Hadoop uses the move computation to the data approach, thereby avoiding significant network I/O. Users will be able to develop parallel applications quickly, focusing on business logic rather than doing the heavy lifting of distributing data, distributing code for parallel processing, and handling failures.

Apache Hadoop 是一个软件框架,他提供了一个分布式大量集群处理方案,通过这个框架我们可调动上千个节点、处理P字节(拍字节-1PB=1024TB==250Bytes)级数据。Apache Hadoop集群可以使用误码率较高的商品级硬件来搭建。Hadoop就是用来在没有外界介入的情况下处理这些错误的。同时,Hadoop使用数据计算中的移动计算方法,因此避免了重要的网络输入输出。用户可以快速设计并行应用,我们可以集中注意力在业务逻辑上而不是在分布式数据、并行处理的分布式代码和处理错误

Apache Hadoop has mainly four projects: Hadoop Common, Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), and MapReduce.
In simple words, HDFS is used to store data, MapReduce is used to process data, and YARN is used to manage the resources (CPU and memory) of the cluster and common utilities that support Hadoop. Apache Hadoop integrates with many other projects, such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark.

Apache Hadoop 主要由四个部件组成:Hadoop基础包Hadoop分布式文件系统(HDFS-Hadoop Distributed File System)、另一种资源协调者(YARN-Yet Another Resource Negotiator)和MapReduce
简单来说,HDFS用来存储数据,MapReduce用来处理数据,而YARN则用来管理资源集群和公共支撑资源(如CPU和存储)。Hadoop还可以与其他项目进行集成,如:Avro, Hive, Pig, HBase, Zookeeper以及Apache Spark。

Hadoop mainly brings the following three components to the table:
• A framework for reliable distributed data storage: HDFS
• Multiple frameworks for parallel processing of data: MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph
• A framework for cluster resource management: YARN and Slider

Hadoop主要提供了如下的部件:

  • 一个可靠的分布式数据存储架构:HDFS
  • 分布式数据处理的多种架构:MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark和 Giraph
  • 一个集群资源管理的架构:YARN和Slider

Let’s take a look at Hadoop’s adoption drivers with respect to the economy, business, and technical areas:
• Economy: Low cost per terabyte processing when compared to commercial solutions. This is because of its open source software and commodity hardware.
• Business: The ability to store and process all the data on a massive scale provides higher business value.
• Technical: The ability to store and process any Variety, Volume, Velocity, and Veracity (all four Vs) of Big Data.

下面列出Hadoop在经济、业务和技术上的优点:

  • 经济上:与商业化解决方案相比,较低的每太字节功耗。这要归功于开元软件和商业化硬件
  • 业务上:在广范围的数据存储和处理能力。
  • 技术上:有处理海量数据多样性、大量性、高流量、真实性的能力

The following list provides the typical characteristics of Hadoop:
• Commodity: Hadoop can be installed using commodity hardware onpremise or on any cloud provider.
• Robust: It can handle hardware failures at the software layer without user intervention and process failures gracefully without user intervention.
• Scalable: It can commission new nodes to scale out in order to increase the capacity of the cluster.
• Simple: Developers can focus on business logic only, and not on scalability, fault tolerance, and multithreading.
• Data locality: The data size is up to petabytes whereas code size is up to kilobytes. Moving code to the node where data blocks reside provides great reduction in network I/O.

Hadoop的主要特点如下:

  • 日用化:Hadoop可以安装在使用日用硬件布置的基础之上或者任何云服务提供商上
  • 健壮性:可以在软件层无需干预地处理硬件错误,并能够平滑地处理错误。
  • 可扩展性:可以在范围外再添加节点来提升集群能力
  • 简单:开发人员可以专注于业务逻辑,而不是可扩展性、容错性和多线程处理
  • 数据局部性:数据容量达到拍字节,代码达到千字节。将代码移动到数据存储的节点(而不是相反)在网络输入输出方面来说减少了很多工作

Hadoop Distributed File System Hadoop分布式文件系统

HDFS is a distributed filesystem that provides high scalability and reliability on large clusters of commodity hardware.
HDFS是一个提供了基于日用硬件构建的集群的高可扩展性和可靠性的分布式文件系统。

HDFS files are divided into large blocks that are typically 128 MB in size and distributed across the cluster. Each block is replicated (typically three times) to handle hardware failures and block placement exposed by NameNode so that computation can be moved to data with the MapReduce framework.

HDFS文件被分成很多的部分,这些数据块一般是128MB大小,并且分布在集群上。每个数据块都重复存储(一般是三次重复)来处理硬件错误,并且数据块位置可以被管理节点(或者称之为命名空间-NameNode)管理,由此计算可以通过MapReduce架构而移动到数据。

BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

In the preceding image, when storing File1, it’s divided into a single block (B1) as its size (100 MB) is less than the default block size (128 MB) and replicated on Node 1, Node 2, and Node 3. Block1 (B1) is replicated on the first node (Node 1) and then Node 1 replicates on Node 2 and Node 2 replicates on Node 3. File2 is divided into two blocks as its size (150 MB) is greater than the block size, and block2 (B2) and block3 (B3) are replicated on three nodes (B2 on Node 1, Node 3, and Node 4 and B3 on Node 1, Node 2, and Node 3). Blocks’ metadata (file name, blocks, location, date created, and size) is stored in NameNode, as shown in the preceding image. HDFS has a bigger block size to reduce the number of disk seeks needed to read the complete file.

在上图中,在存储文件1的时候,它被划分为100MB的数据块1B1),比默认的数据块大小(128MB)要小,并复制在节点1(N1),节点2(N2),节点3(N3)。就是B1复制到N1,然后N1复制到N2,然后N2复制到N3。文件2由于大小为150MB,比默认数据块大小要大,被划分为来哦个数据块——B2和B3,并复制在三个节点上(B2在N1,N3和N4;B3在N1,N2和N3上)。数据块的元数据(文件名、数据块、位置、穿见者、大小)被存储在管理节点。HDFS使用更大的数据块来减少读取整个文件所需要的访问次数。

The creation of a file seems like a single file to the user. However, it is stored as blocks on DataNodes and metadata is stored in NameNode. If we lose the NameNode for any reason, blocks stored on DataNodes become useless as there is no way to identify the blocks belonging to the file names. So, creating NameNode high availability and metadata backups is very important in any Hadoop cluster.

对用户来说,创建文件就是一个文件。其实,文件是存储在数据节点DataNodes),元数据存储在管理节点。如果管理节点损坏了,存在数据节点上的数据块也就没有意义了。所以对Hadoop集群来说,控制管理节点的访问权限并且做好备份是十分重要的。

Features of HDFS HDFS的特性

HDFS is becoming a standard enterprise Big Data storage system because of the unlimited scalability and yet provides most features needed for enterprise-grade Big Data applications. The following table explains the important features of HDFS:

HDFS已经成为一个标准商用的大数据存储系统。这是因为它有不受限制的可扩展性,并能够提供提供商业化大数据应用需求的特性。下表介绍了HDFS的特性:

Feature Description
High availability Enabling high availability is done by creating a standby NameNode.
Data integrity When blocks are stored on HDFS, computed checksums are stored on the DataNodes as well. Data is verified against the checksum.
HDFS ACLs HDFS implements POSIX-style permissions that enable an owner and group for every file with read, write, and execute permissions.In addition to POSIX permissions, HDFS supports POSIX Access Control Lists (ACLs) to provide access for specific named users or groups.
Snapshots HDFS Snapshots are read-only point-in-time copies of the HDFS filesystem, which are useful to protect datasets from user or application errors.
HDFS rebalancing The HDFS rebalancing feature will rebalance the data uniformly across all DataNodes in the cluster.
Caching Caching of blocks on DataNodes is used for high performance.DataNodes cache the blocks in an off-heap cache.
APIs HDFS provides a native Java API, Pipes API for C++, and Streaming API for scripting languages such as Python, Perl, and others. FileSystem Shell and web browsers can be used to access data as well. Also, WebHDFS and HttpFs can be used to access data over HTTP.
Data encryption HDFS will encrypt the data at rest once enabled. Data encryption and decryption happens automatically without any changes to application code.
Kerberos authentication When Kerberos is enabled, every service in the Hadoop cluster being accessed will have to be authenticated using the Kerberos principle.This provides tight security to Hadoop clusters.
NFS access Using this feature, HDFS can be mounted as part of the local filesystem, and users can browse, download, upload, and append data to it.
Metrics Hadoop exposes many metrics that are useful in troubleshooting.Java Management Extensions (JMX) metrics can be viewed from a web UI or the command line.
Rack awareness Hadoop clusters can be enabled with rack awareness. Once enabled,HDFS block placement will be done as per the rack awareness script,which provides better fault tolerance.
Storage policies Storage policies are introduced in order to allow files to be stored indifferent storage types according to the storage policy (Hot, Cold,Warm, All_SSD, One_SSD, or Lazy_Persist).
WORM Write Once and Read Many (WORM) times is a feature of HDFS that does not allow updating or deleting records in place. However, records can be appended to the files.

MapReduce

MapReduce (MR) is a framework to write analytical applications in batch mode on terabytes or petabytes of data stored on HDFS. An MR job usually processes each block (excluding replicas) of input file(s) in HDFS with the mapper tasks in a parallel manner. The MR framework sorts and shuffles the outputs of the mappers to the reduce tasks in order to produce the output. The framework takes care of computing the number of tasks needed, scheduling tasks, monitoring them, and re-executing them if they fail. The developer needs to focus only on writing the business logic, and all the heavy lifting is done by the HDFS and MR frameworks.
For example, in Figure 2.1, if an MR job is submitted for File1, one map task will be created and run on any Node 1, 2, or 3 to achieve data locality. In the case of File2, two map tasks will be created with map task 1 running on Node 1, 3, or 4, and map task 2 running on Node 1, 2, or 3, depending on resource availability. The output of the mappers will be sorted and shuffled to reducer tasks. By default, the number of reducers is one. However, the number of reducer tasks can be increased to provide parallelism at the reducer level.

MapReduce(MR)是一个编写处理太字节或者拍字节量级的批处理分析应用,它处理的数据应当存储在HDFS上。一个MR任务一般是使用映射任务(mapper tasks)并发处理每一个存储在HDFS的输入数据块(包括复制)。MR架构通过对映射任务的输出进行重新排序和洗牌来生成输出。这一架构用来计算需要的任务数、为任务排期并监管这些任务,如果任务失败的话,再处理这些任务。开发者秩序关注业务逻辑,而其他的复杂工作只需要交给HDFS和MR架构来处理。

BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

沿用上一个例子,根据上图所示,如果MR工作被安排给文件1,只有一个映射任务会被穿件并运行在节点N1,N2,N3上来实现数据局部化。而对文件2,两个映射任务会被创建,映射任务1将会运行在节点N1,N3,N4上;而映射任务2将会运行在N1,N2,N3上,取决于资源的可用性。映射器的输出将会被排序和洗牌并传递给归约任务。默认的,归约器只有一个。然而归约器的数量可以增加来在归约器层提供并行性。

MapReduce features

MR provides you with excellent features to build Big Data applications. The following table describes MR’s key features and techniques used, such as sorting and joining:
MR提供了建立大数据应用的绝佳特性。下面的表格介绍了MR的性质和技术,例如排序和链接:

Feature/techniques Description
Data locality MR moves the computation to the data. It ships the programs to the nodes where HDFS blocks reside. This reduces the network I/O significantly.
APIs Native Java API. Pipes: C++ API. Streaming: Any shell scripting such as Python and Perl.
Distributed cache A distributed cache is used to cache files such as archives, jars, or any files that are needed by applications at runtime.
Combiner The combiner feature is used to reduce the network traffic, or, in other words, reduce the amount of data sent from mappers to reducers over the network.
Custom partitioner This controls which reducer each intermediate key and its associated values go to. A custom partitioner can be used to override the default hash partitioner.
Sorting Sorting is done in the sort and shuffle phase, but there are different ways to achieve and control sorting—total sort, partial sort, and secondary sort.
Joining Joining two massive datasets with the joining process is easy. If the join is performed by the mapper tasks, it is called a map-side join. If the join is performed by the reducer task, it is called a reduce-side join. Map-side joins are always preferred because it avoids sending a lot of data over the network for reducers to join.
Counters The MR framework provides built-in counters that give an insight in to how the MR job is performing. It allows the user to define a set of counters in the code, which are then incremented as desired in the mapper or reducer.

MapReduce v1 versus MapReduce v2

Apache Hadoop’s MapReduce has been a core processing engine that supports the distributed processing of large-scale data workloads. MR has undergone a complete refurbishment in the Hadoop 0.23 version and now it’s called MapReduce 2.0 (MR v2) or YARN.

Apache Hadoop的MR是一个支持大范围数据工作分布式处理的核心处理引擎。从Hadoop 0.23版以后进行了一次大整改,现在被称为MapReduce 2.0 (MR v2) 或者YARN。

MapReduce v1, which is also called Classic MapReduce, has three main components:
• An API to develop MR-based applications
• A framework to execute mappers, shuffle the data, and execute reducers
• A resource management framework to schedule and monitor resources

MapReduce v1,被称为经典MR,含有三个主要的组成元件:

  • 对应开发基于MR应用的数据接口(API)
  • 一个处理映射、洗牌数据、处理归约的框架
  • 一个资源管理框架来对资源进行排期和监控

MapReduce v2, which is also called NextGen, moves resource management to YARN。
MapReduce v2,也被称为NextGen,将资源管理的功能转移到了YARN。

BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

MapReduce v1 challenges

MapReduce v1 had three challenges:
• Inflexible CPU slots configured on a cluster for Map and Reduce led to the underutilization of the cluster
• Resources could not be shared with non-MR applications (for example, Impala or Spark)
• Limited scalability, only up to 4,000 nodes The following table shows you the differences between v1 and v2:

MR v1有三个挑战:

  • 映射和归约集群上不灵活的CPU插槽降低了几群的利用率、
  • 资源不能被非MR结构的应用所共享(例如:Impala或者Spark)
  • 有限的扩展性,最多可扩展到4000个节点

下面的表格显示了v1与v2之间的区别:

- MR v1 MR v2
Components used Job tracker as master and task tracker as slave Resource manager as master and node manager as slave
Resource allocation DataNodes are configured to run a fixed number of map tasks and reduce tasks Containers are allocated as needed for any type of task
Resource management One job tracker per cluster, which supports up to 4,000 nodes One resource manager per cluster, which supports up to tens of thousands of nodes
Types of jobs MR jobs only Supports MR and other frameworks such as Spark, Impala, and Giraph

YARN

YARN is the resource management framework that enables an enterprise to process data in multiple ways simultaneously for batch processing, interactive analytics, or real-time analytics on shared datasets. While HDFS provides scalable, fault-tolerant, and cost-efficient storage for Big Data, YARN provides resource management to clusters. Figure 2.3 shows you how multiple frameworks are typically run on top of HDFS and YARN frameworks in Hadoop 2.0. YARN is like an operating system for Hadoop, which manages the cluster resources (CPU and Memory) efficiently. Applications such as MapReduce, Spark, and others request YARN to allocate resources for their tasks. YARN allocates containers on nodes with the requested amount of RAM and virtual CPU from the total available on that node:

YARN是一个资源管理架构。YARN可以同时对共享数据集进行批处理、交互分析或者实时分析。HDFS提供了可扩展的、高容错的、低能耗的大数据存储,而YARN提供了集群的资源管理。下图显示了在Hadoop 2.0 中多种架构如何在HDFS和YARN上运行。YARN就像Hadoop的操作系统一样,它高效地管理集群的资源(CPU和存储)。像MapReduce、Spark以及其他的应用需要YARN来为他们的任务分配资源。YARN根据任务需求的内存和虚拟CPU,来从节点的总可用资源中,在节点上分配容器资源:

BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker (which are part of MapReduce v1) into separate entities:
• ResourceManager
• A per-application ApplicationMaster
• A per-node slave NodeManager
• A per-application container running on NodeManager

YARN的初衷是将作业追踪器/任务追踪器(MR v1的一部分)的两大主要工作分离成如下对象(实体):

  • 资源管理器
  • 针对每个应用的应用管理器
  • 针对每个节点的从节点管理器
  • 针对每个应用有一个运行在节点管理器的容器

ResourceManager keeps track of the resource availability of the entire cluster and provides resources to applications when requested by ApplicationMaster. ApplicationMaster negotiates the resources needed by the application to run their tasks. ApplicationMaster also tracks and monitors the progress of the application. Note that this monitoring functionality was handled by TaskTrackers and JobTrackers in MR v1, which led to overloading the JobTracker.

资源管理器监控集群内的可用资源,并在应用管理器提出自愿申请的时候提供资源。应用管理器协调应用间的资源来运行任务。应用管理器同时还追踪和监管应用的进程。在MR v1阶段,这样的监管功能使用任务追踪器和作业追踪器来实现的,这也经常引起作业追踪器的过载问题。

NodeManager is responsible for launching containers provided by ResourceManager, monitoring the resource usage on the slave nodes, and reporting to ResourceManager.
The application container is responsible for running the tasks of the application. YARN also has pluggable schedulers (Fair Scheduler and Capacity Scheduler) to control the resource assignments to different applications. Detailed steps of the YARN application life cycle are shown in Figure 2.4 with two resource requests by an application:

节点管理器负责启动资源管理器提供的容器,监管从节点的资源使用情况,并向资源管理器反馈。
应用容器负责运行应用的任务。YARN也有可插拔的调度器(公平调度器和容量调度器)来控制不同应用之间的资源分配。下图展现的是YARN应用的生命周期的细节步骤:

BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

The following is our interpretation of the preceding figure:
• The client submits the MR or Spark job
• The YARN ResourceManager creates an ApplicationMaster on one NodeManager
• The ApplicationMaster negotiates the resources with the ResourceManager
• The ResourceManager provides resources, the NodeManager creates the containers, and the ApplicationMaster launches tasks (Map, Reduce, or Spark tasks) in the containers
• Once the tasks are finished, the containers and the ApplicationMaster will be terminated

下面使我们对前面图形的理解:

  • 用户提交MR或者Spark作业
  • YARN资源管理器在一个节点管理器上创建应用管理器
  • 应用管理器与资源管理器协调资源
  • 资源管理器提供资源,节点管理器提供容器,应用管理器触发任务,并在容器中运行
  • 一旦任务结束,容器和应用管理器就被放弃

Let’s summarize the preceding points concerning YARN:
• MapReduce v2 is based on YARN:
°° YARN replaced the JobTracker/TaskTracker architecture of MR v1 with the ResourceManager and NodeManager
°° The ResourceManager takes care of scheduling and resource allocation
°° The ApplicationMaster schedules tasks in containers and monitors the tasks
• Why YARN?
°° Better scalability
°° Efficient resource management
°° Flexibility to run multiple frameworks
• Views from the user’s perspective:
°° No significant changes—the same API, CLI, and web UIs.
°° Backward-compatible with MR v1 without any changes

总结一下以上涉及到YARN的点:

  • MR v2是基于YARN的:
    • YARN使用资源管理器和节点管理器取代了作业追踪器和任务追踪器
    • 资源管理器负责调度和资源调度
    • 应用管理器调度在容器中的任务,并监管任务
  • 为什么使用YARN?
    • 更好的扩展性
    • 有效的资源管理
    • 处理多架构的灵活性
  • 从用户的角度来看
    • 没有重大改变——同样的API、CLI和Web UIs
    • 向后兼容,兼容MR v1

Storage options on Hadoop

XML and JSON files are well-accepted industry standard formats. So, why can’t we just use XML or JSON files on Hadoop? There are many disadvantages of XML and JSON, including the following:
• Larger size of the data because of storing schema along with the data
• Does not support schema evolution
• Files cannot be split on Hadoop when compressed
• Not efficient when transferring the data over network

XML和JSON文件是广泛接受的工业标准结构。我们可以在Hadoop上直接使用XML和JSON。XML和JSON有很多缺点:

  • 因为存储数据结构而需要更多的存储空间
  • 不支持表结构更改
  • 文件压缩之后不可分割
  • 在文件传输的时候效率不高

When storing data and building applications on Hadoop, some fundamental questions arises: What storage format is useful for my application? What compression codec is optimum for my application? Hadoop provides you with a variety of file formats built for different use cases.
Choosing the right file format and compression codec provides optimum performance for the use case that you are working on. Let’s go through the file formats and understand when to use them.

那么在Hadoop上应用什么样子的数据结构呢?

Hadoop provides you with a variety of file formats built for different use cases. Choosing the right file format and compression codec provides optimum performance for the use case that you are working on. Let’s go through the file formats and understand when to use them.

Hadoop提供了应对多种情况的多样的文件格式

File formats - 文件格式

File formats are divided into two categories. Hadoop can store all the data regardless of what format the data is stored in. Data can be stored in its raw form using the standard file format or the special Hadoop container file format that offers benefits in specific use case scenarios, which can be split even when data is compressed. Broadly, there are two types of file formats: Standard file formats and Hadoop file formats:

Hadoop提供的文件格式可以划分为两类。首先,Hadoop可以存储所有类型的数据。在Hadoop中,数据可以存储为任意其他的格式(或者说是数据原来的格式),或者使用Hadoop提供的容器文件格式,Hadoop的容器文件格式支持对压缩文件的分割,在某些特定的情境下可以获得某些便利。总的来说,文件格式可以划分为两类:标准文件格式和Hadoop文件格式。

• Standard file formats:
°° Structured text data: CSV, TSV, XML, and JSON files
°° Unstructured text data: Log files and documents
°° Unstructured binary data: Images, videos, and audio files
• Hadoop file formats:
Provides splittable compression
°° File-based structures:
Sequence file
°° Serialization format:
Thrift
Protocol buffers
Avro
°° Columnar formats:
RCFile
ORCFile
Parquet

  • 标准文件格式
    • 结构文本数据:CSV,TSV,XML和JSON 格式
    • 无格式文本数据:log文件,文档
    • 无格式二进制数据:图像、视频、音频等文件
  • Hadoop文件格式
    • 基于文件的结构:序列文件
    • 组合格式:Thrift、协议缓存、Avro
    • 容器格式:RCfile、ORCFile、Parquet

Let’s go through the Hadoop file format features and use cases in which they can
be used.

Sequence file - 序列文件

Sequence files store data as binary key-value pairs. It supports the Java language only and does not support schema evolution. It supports the splitting of files even when the data is compressed.

序列文件将数据存储为二进制键值对。序列文件支持Java语言,不支持后期对表结构的修改。支持压缩文件分离。

Let’s see a use case for the sequence file:应用场景
- Small files problem: On an average, each file occupies 600 bytes of space in memory. One million files of 100 KB need 572 MB of main memory on the NameNode. Additionally, the MR job will create one million mappers.文件小、数量多,需要的MR进程多。
- Solution: Create a sequence file with the key as the filename and value as the content of the file, as shown in the following table. Only 600 bytes of memory space is needed in NameNode and an MR job will create 762 mappers with 128 MB block size:

Key Value Key Value Key Value
File1.txt File.txt content File2.txt File2.txt content FileN.txt FileN.txt content

Protocol buffers and thrift - 协议缓冲与Thrift

Protocol buffers were developed by Google and open sourced in 2008. Thrift was developed at Facebook and offers more features and language support than protocol buffers. Both of these are serialization frameworks that offer high performance while sending over the network. Avro is a specialized serialization format that is designed for Hadoop.
协议缓冲与Thrift都是序列化结构,使用这两种格式可以使信息在网络中的传输得到大幅度的优化。Avro是Hadoop提供的序列化格式。
A generic usage pattern for protocol buffers and thrift is as follows:
• Use Avro on Hadoop-specific formats and use protocol buffers and thrift for non-Hadoop projects.
对协议缓冲和Thrift格式的使用已经形成了一种模式;
- 处理Hadoop格式的时候使用Avro,对非Hadoop格式使用Thrift

Avro

Avro is a row-based data serialization system used for storage and sends data over the network efficiently. Avro provides the following benefits:
Avro是基于行的数据序列系统,主要用于实现数据存储和通过网络高效地进行数据传输。
- Rich data structures - 支持丰富的数据结构类型
- Compact and fast binary data format - 紧密且快速的二进制数据格式
- Simple integration with any language - 能够简便地与任何程序语言进行整合
- Support for evolving schemas - 支持对表格式的改变
- Great interoperability between Hive, Tez, Impala, Pig, and Spark - 能够与Hive, Tez, Impala, Pig和Spark提供很好的交互

A use case for Avro is as follows:Avro的使用场景
- Data warehouse offloading to Hadoop: Data is offloaded to Hadoop where Extract, Transform, and Load (ETL) tasks are performed. The schema changes frequently.数据从数据仓库导入到Hadoop,数据在导入到Hadoop的过程中,要经历ETL过程,在这个过程中,数据的表格式经常需要改变
- Solution: Sqoop imports data as Avro files that supports schema evolution, less storage space, and faster ETL tasks.解决方案:使用Sqoop将数据导入成Avro文件,Avro文件支持表结构更改,使用较少的存储空间和快速的ETL任务。

Parquet

Parquet is a columnar format that skips I/O and decompression (if applicable) on columns that are not part of the query. It is generally very efficient in terms of compression on columns because column data is similar within the same column than it is in a block of rows.

Parquet是一种基于列的数据格式,这种格式通过跳过数据中对与query无关的列的I/O和解压缩。与行块比较来说,列数据之间的相似度比较高,因此在处理列的压缩问题时,Parquet是非常高效的。

A use case for Parquet is as follows:Parquet应用
- BI access on Hadoop: Data marts created on Hadoop are accessed by users using Business Intelligence (BI) tools such as Tableau. User queries always need a few columns only. Query performance is poor.Hadoop的BI接口:使用向Tableau这样的BI产品访问在Hadoop上创建的数据集市。用户创立的query总是只是用到少量的几个列,query的表现较差。
- Solution: Store data in Parquet, which is a columnar format and provides high performance for BI queries.解决方案:使用Parquet来将数据存储为列格式。提高BI query的效率。

RCFile and ORCFile

Record Columnar File (RCFile) was the first columnar format for Hive that provided efficient query processing. Optimized Row Columnar (ORC) format was introduced in Hive 0.11 and offered better compressions and efficiency than the RCFile format. ORCFile has lightweight indexing that enables the skipping of irrelevant columns.
柱记录文件Record Columnar File-RCFile)是一种为Hive提供高效query处理的列格式。优化行柱记录格式Optimized Row Columnar-ORC)提供了更好的压缩方案,也更加高效,它是在Hive 0.11首次推出的。ORCFile有轻量级的索引,可以支持跳过对不相干的列的处理。

A use case for ORC and Parquet files is as follows:ORC和Parquet的使用
• Both ORC files and Parquet files are columnar formats and skip columns and rows (predicate pushdown) while reading data. Choose ORC or Parquet, depending on the application and integration requirements with other components of the project. A common use case for ORC will be same as the Parquet use case described earlier, exposing data to end users with BI tools.
ORC文件和Parquet文件都是柱格式的(或者说是列格式的)并且在读数据的时候可以跳过行和列(predicate pushdown-就是在实际数据读取和SQL实际执行之前预先执行条件语句进行预处理和过滤)。使用ORC或者Parquet,需要考虑数据的应用和与其他组件的集成。ORC的一般应用和Parquet是一样的,就是处理BI工具和终端用户之间的数据问题。

Compression formats - 压缩格式

A variety of compression formats are available for Hadoop storage. However, if Hadoop storage is cheap, then why do I need to compress my data? The following list answers your question:
Hadoop提供了跟多压缩格式,Hadoop使用压缩格式也是有一定考虑的:
- Compressed data can speed up I/O operations - 压缩可以加速输入输出操作
- Compressed data saves storage space - 节省存储空间
- Compressed data speeds up data transfer over the network - 加快网络上的数据传输
Compression and decompression increases CPU time. Understanding these trade-offs is very important in providing optimum performance of jobs running on Hadoop.
压缩和解压数据占用了CPU时间。理解其中的问题所在能够为工作在Hadoop提供性能上的提升。

Standard compression formats - 标准压缩格式

The following table shows you the standard compression formats available on the Hadoop platform:
下表是Hadoop平台上支持的标准压缩格式:

Compression format Tool Algorithm File extension Splittable?
gzip Gzip DEFLATE .gz No
bzip2 bizp2 bzip2 .bz2 Yes
LZO Lzop LZO .lzo Yes, if indexed
Snappy N/A Snappy .snappy No

Recommended usage patterns for compression are as follows:压缩格式的使用模式
- For storage only: Use gzip (high compression ratio) - 只存储:使用高压缩比的gzip
- For ETL tasks: Use Snappy (optimum compression ratio and speed) - 对ETL任务:改进的压缩率和速度(使用Snappy)

Introducing Apache Spark - 介绍Apache Spark

Hadoop and MR have been around for 10 years and have proven to be the best solution to process massive data with high performance. However, MR lacked performance in iterative computing where the output between multiple MR jobs had to be written to HDFS. In a single MR job, it lacked performance because of the drawbacks of the MR framework.
Let’s take a look at the history of computing trends to understand how computingparadigms have changed over the last two decades.
Hadoop和Mr已经被使用超过10年了,并被证明是为海量数据处理提供高效服务的最好解决方案。当大量MR工作的输出需要写入到HDFS中去的时候,MR的再处理迭代运算时表现不佳。在一个MR工作中,MR架构的回退也是它处理表现不佳的原因之一。
下面我们展示近20年计算潮流的转变:
The trend has been to Reference the URI when the network was cheaper (in 1990), Replicate when storage became cheaper (in 2000), and Recompute when memory became cheaper (in 2010), as shown in
在二十世纪20年代,当网络资源变得便宜时,参考URI变成了主流(统一资源标识符Uniform Resource Identifier,或URI)。在2000年左右,存储价格下降,备份存储的方式(使用存储代计算)成为主流。而在2010年,由于内存的成本降低,再计算就变成了计算的主流。
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Let’s understand why memory-based computing is important and how it provides
significant performance benefits.
下面我们介绍基于内存的计算的重要性和基于内存的计算是如何答复提升计算的。
Figure 2.6 indicates the data transfer rates from various mediums to the CPU. Disk to CPU is 100 MB/s, SSD to CPU is 600 MB/s, and over the network to CPU is 1 MB to 1GB/s. However, RAM to CPU transfer speed is astonishingly fast, which is 10 GB/s. So, the idea is to cache all or partial data in-memory so that higher performance can be achieved:
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

上图展示了数据传输速率由于CPU的RAM之间的数据传输速率极快,所以用cache存储所有数据,并使用内存的存储速度就成了理所当然的想法。

Spark history

Spark started in 2009 as a research project in the UC Berkeley RAD Lab, which later became the AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce and observed that MR was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas such as support for
in-memory storage and efficient fault recovery. In 2011, the AMPLab started to develop higher-level components on Spark such as Shark and Spark Streaming. These components are sometimes referred to as the Berkeley Data Analytics Stack (BDAS). Spark was first open sourced in March 2010 and transferred to the Apache Software Foundation in June 2013.
In February 2014, it became a top-level project at the Apache Software Foundation. Spark has since become one of the largest open source communities in Big Data. Now, over 250 contributors in over 50 organizations are contributing to Spark development. The user base has increased tremendously from small companies to Fortune 500 companies. Figure 2.7 shows you the history of Apache Spark:
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

What is Apache Spark?

Let’s understand what Apache Spark is and what makes it a force to reckon with in Big Data analytics:下面我们来看一下Spark在

  • Apache Spark is a fast enterprise-grade large-scale data processing engine,which is interoperable with Apache Hadoop.SPARK是一个商用级别大规模的快速数据处理引擎,而且能够与Hadoop进行整合
  • It is written in Scala, which is both an object-oriented and functional programming language that runs in a JVM. - 使用Scala编写,Scala既是面向对象的也是面向过程的(使用函数的),并且在JVM中运行
  • Spark enables applications to distribute data reliably in-memory during processing. This is the key to Spark’s performance as it allows applications to avoid expensive disk access and performs computations at memory speeds.Spark能够在数据处理的过程中将应用数据可靠地分发到不同的处理单元。Spark能够高效运行的关键在于使用内存计算而不是用昂贵的硬盘。
  • It is suitable for iterative algorithms by having every iteration access data through memory.由于使用内存计算,对迭代计算提供了很好的支持
  • Spark programs perform 100 times faster than MR in-memory or 10 times faster on disk (http://spark.apache.org/).计算速度很快
  • It provides native support for Java, Scala, Python, and R languages with interactive shells for Scala, Python, and R. Applications can be developed easily, and often 2 to 10 times less code is needed.提供原生的Java、Scala。Python和R,提供这些语言的交互内核。应用可以更容易地开发,同时使用的代码也会少很多。
  • Spark powers a stack of libraries, including Spark SQL and DataFrames for interactive analytics, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time analytics. You can combine these features seamlessly in the same application.Spark支持一些的库:Spark SQL;交互数据结构;为机器学习设计的MLlib;为图形处理设计的GraphX;实时分析的Spark Streaming。在spark中可以将这些组件无缝地组合
  • Spark runs on Hadoop, Mesos, standalone cluster managers, on-premise hardware, or in the cloud.spark可以适用于广泛的环境中,如:Hadoop、独立集群管理机、底层硬件或者云,等等。

What Apache Spark is not

Hadoop provides HDFS for storage and MR for compute. However, Spark does not provide any specific storage medium. Spark is mainly a compute engine, but you can store data in-memory or on Tachyon to process it.
Hadoop提供HDFS来存储数据,使用MR来进行计算。但是Spark不提供任何的存储介质。它主要是一个计算引擎,Spark让你可以使用内存来计算,或者使用Tachyon来进行处理。

tachyon (Apache开源分布式存储系统)
Tachyon是一个高性能、高容错、基于内存的开源分布式存储系统 ,并具有类Java的文件API、插件式的底层文件系统、兼容Hadoop MapReduce和Apache Spark等特征。Tachyon能够为集群框架(如Spark、MapReduce等)提供内存级速度的跨集群文件共享服务。Tachyon充分使用内存和文件对象之间的世代(Lineage)信息,因此速度很快,官方号称最高比HDFS吞吐量高300倍。目前,很多公司(如Pivotal、EMC、红帽等)已经在使用Tachyon,并且来自20个组织或公司(如雅虎、英特、红帽等)的60多个贡献者都在为其贡献代码。Tachyon是于UC Berkeley数据分析栈(BDAS)的存储层,它还是Fedroa操作系统自带应用。
(来源于百度:https://baike.baidu.com/item/tachyon/19863509
tachyon的详细介绍http://geek.csdn.net/news/detail/51168

Spark has the ability to create distributed datasets from any file stored in the HDFS or other storage systems supported by Hadoop APIs (including your local filesystem, Amazon S3, Cassandra, Hive, HBase, Elasticsearch, and others).
Spark有能力使用HDFS中的文件或者Hadoop APIs支持的存储系统中的文件来进行数据集的分发(分布式处理)。
It’s important to note that Spark is not Hadoop and does not require Hadoop to run it. It simply has support for storage systems implementing Hadoop APIs. Spark supports text files, sequence files, Avro, Parquet, and any other Hadoop InputFormat.
spark 支持丰富的数据存储系统,条件是这些系统有对应的Hadoop APIs。要注意,Spark不依赖于Hadoop,也并不需要Hadoop来运行。同时,Spark支持丰富的数据文件类型,只要这些文件类型被Hadoop所支持。

Does Spark replace Hadoop?
Spark is designed to interoperate with Hadoop. It’s not a replacement for Hadoop, but it’s a replacement for the MR framework on Hadoop. All Hadoop processing frameworks (Sqoop, Hive, Pig, Mahout, Cascading, and Crunch) using MR as an engine now use Spark as an additional processing engine.

MapReduce issues - MR存在的问题

MR developers faced challenges with respect to performance and converting every business problem to an MR problem. Let’s understand the issues related to MR and how they are addressed in Apache Spark:
在谈到表现和将业务信息转换到MR问题的时候会出现一些挑战

  • MR creates separate JVMs for every Mapper and Reducer. Launching JVMs takes a considerable amount of time.MR为每一个Mapper和Reducer都会建立独立的JVMs,在启动JVMs的时候需要占用可观的时间。
  • MR code requires a significant amount of boilerplate coding. The programmer needs to think and design every business problem in terms of Map and Reduce, which makes it a very difficult program. One MR job can rarely do a full computation. You need multiple MR jobs to finish the complete task, and need to design and keep track of optimizations at all levels. Hive and Pig solve this problem. However, they are not suitable for all use cases.MR程序需要大量的样板编程。程序员对所有的业务问题都要映射和归约的角度进行分析和设计,这样一来,MR编程就变成了一个比较困难的问题。我们需要大量的MR工作来完成总体任务,并需要跟踪不等级别的优化。Hadoop家族中的Hive和Pig可以解决这一问题,但并不能适应所有情况。
  • An MR job writes the data to disk between each job and hence is not suitable for iterative processing.将中间数据写入到硬盘(或称之为外存)中,不适合迭代计算
  • A higher level of abstraction, such as Cascading and Scalding, provides better programming of MR jobs. However, it does not provide any additional performance benefits.像 Cascading 和 Scalding 这样的更高级别的抽象,为MR工作任务的编程提供了便利,但并没有带来任何的效率上的提高。
  • MR does not provide great APIs either.没有提供好的APIs

Boilerplate code or Boilerplate
In computer programming, boilerplate code or boilerplate refers to sections of code that have to be included in many places with little or no alteration. It is often used when referring to languages that are considered verbose, i.e. the programmer must write a lot of code to do minimal jobs.
The need for boilerplate can be reduced through high-level mechanisms such as metaprogramming (which has the computer automatically write the needed boilerplate code or insert it at compile time), convention over configuration (which provides good default values, reducing the need to specify program details in every project) and model-driven engineering (which uses models and model-to-code generators, eliminating the need for boilerplate manual code).
(来自*:https://en.wikipedia.org/wiki/Boilerplate_code
MR is slow because every job in an MR job flow stores the data on disk. Multiple queries on the same dataset will read data separately and create high disk I/O, as shown in Figure 2.8:
造成MR运行缓慢的主要原因是MR工作流将数据储存在外存里。当大量的查询对相同的一个数据集进行访问的时候,会造成外存的高I/O。
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Spark takes the concept of MR to the next level to store intermediate data in-memory and reuses it, as needed, multiple times. This provides high performance at memory speeds, as shown in Figure 2.8.
Spark将MR的思想进一步发展,它将中间数据存储在内存中,有需要的时候就可以再使用。这种设置就提供了一种使用内存高存储速度的方案。

If I have only one MR job, does it perform the same as Spark?
No, the performance of the Spark job is superior to the MR job because of in-memory computations and its shuffle improvements. The performance of Spark is superior to MR even when the memory cache is disabled. A new shuffle implementation (sort-based shuffle instead of hash-based shuffle), new network module (based on netty instead of using block manager to send shuffle data), and new external shuffle service make Spark perform the fastest petabyte sort (on 190 nodes with 46 TB RAM) and terabyte sort. Spark sorted 100 TB of data using 206 EC2 i2.8xlarge machines in 23 minutes. The previous world record was 72 minutes, set by a Hadoop MR cluster of 2,100 nodes. This means that Spark sorted the same data 3 times faster using 10 times fewer machines. All the sorting took place on disk (HDFS) without using Spark’s in-memory cache (https://databricks.com/blog/2014/10/10/sparkpetabyte-sort.html).

To summarize, here are the differences between MR and Spark:下面是MR与Spark的比较

MR Spark
Ease of use It is not easy to code and use
Performance Performance is relatively poor when compared with Spark
Iterative processing Every MR job writes the data to disk and the next iteration reads from the disk
Fault tolerance It’s achieved by replicating the data in HDFS
Runtime architecture Every Mapper and Reducer runs in a separate JVM
Shuffle Stores data on disk
Operations Map and Reduce
Execution model Batch only
Natively supported programming languages Java only

Spark’s stack - Spark组

Spark’s stack components are Spark Core, Spark SQL, Datasets and DataFrames, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR as shown in Figure 2.9:
Spark的原件包括 Spark Core, Spark SQL, Datasets and DataFrames, Spark Streaming, Structured Streaming, MLlib, GraphX, 和 SparkR
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Here is a comparison of Spark components with Hadoop Ecosystem components:
下面是Spark组件与Hadoop组件之间的比较

Spark Hadoop Ecosystem
Spark Core MapReduce、Apache Tez
Spark SQL, Datasets and DataFrames Apache Hive,Apache Impala,Apache Tez,Apache Drill
Spark Streaming,Structured Streaming Apache Storm,Apache Storm Trident,Apache Flink,Apache Apex,Apache Samza
Spark MLlib Apache Mahout
Spark GraphX Apache Giraph
SparkR RMR2,RHive

To understand the Spark framework at a higher level, let’s take a look at these core components of Spark and their integrations:
下面介绍Spark组件和组件之间的交互

Feature Details
Programming languages Java, Scala, Python, and R.Scala, Python, and R shell for quick development.
Core execution engine Spark Core: Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. It provides Java, Scala, Python, and R APIs for the ease of development.
Tungsten: It provides Memory Management and Binary Processing,Cache-aware Computation, and Code generation.
Frameworks Spark SQL, Datasets, and DataFrames: Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called Datasets and DataFrames and can also act as a distributed SQL query engine.
Spark Streaming: Spark Streaming enables us to build scalable and fault-tolerant streaming applications. It integrates with a wide variety of data sources, including File Systems, HDFS, Flume, Kafka, and Twitter.
Structured Streaming: Structured Streaming is a new paradigm shift in streaming computing that enables building continuous applications with end-to-end exactly once guarantee and data consistency, even in case of node delays and failures.
MLlib: MLlib is a machine learning library used to create data products or extract deep meaning from the data. MLlib provides high performance because of in-memory caching of data.
GraphX: GraphX is a graph computation engine with graph algorithms to build graph applications.
SparkR: SparkR overcomes the R’s single-threaded process issues and memory limitations with Spark’s distributed in-memory processing engine. SparkR provides a distributed DataFrame based on DataFrame API and Distributed Machine Learning using MLlib.
Off-heap storage Tachyon: Reliable data sharing at memory-speed within and across cluster frameworks/jobs. Spark’s default OFF_HEAP (experimental) storage is Tachyon.
Cluster resource managers Standalone: By default, applications are submitted to the standalone mode cluster and each application will try to use all the available nodes and resources.
YARN: YARN controls the resource allocation and provides dynamic resource allocation capabilities.
Mesos: Mesos has two modes—coarse-grained and fine-grained. The coarse-grained approach has a static number of resources just like the standalone resource manager. The fine-grained approach has dynamic resource allocation just like YARN.
Storage HDFS, S3, and other filesystems with the support of Hadoop InputFormat.
Database integrations HBase, Cassandra, Mongo DB, Neo4J, and RDBMS databases.
Integrations with streaming sources Flume, Kafka and Kinesis, Twitter, Zero MQ, and File Streams.
Packages http://spark-packages.org/ provides a list of third-party data source APIs and packages.
Distributions Distributions from Cloudera, Hortonworks, MapR, and DataStax.
Notebooks Jupyter and Apache Zeppelin.
Dataflows Apache NiFi, Apache Beam, and StreamSets.

The Spark ecosystem is a unified stack that provides you with the power of combining SQL, streaming, and machine learning in one program. The advantages of unification are as follows:
Spark生态系统是能够将SQL,流处理和机器学习集合在一个程序里的统一族。它的统一性的优点在于:

  • No need of copying or ETL of data between systems - 不需要在系统时间进行数据的复制或者ETL
  • Combines processing types in one program - 在一个程序中集成不同的数据处理类型
  • Code reuse - 代码重复利用
  • One system to learn - 只需学习一个系统
  • One system to maintain - 只需维护一个系统
    BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Why Hadoop plus Spark?

Apache Spark shines better when it is combined with Hadoop. To understand this, let’s take a look at Hadoop and Spark features.
下面先展示Hadoop和Spark的特性

Hadoop features

Feature Details
Unlimited scalability Stores unlimited data by scaling out HDFS
Effectively manages cluster resources with YARN
Runs multiple applications along with Spark
Thousands of simultaneous users
Enterprise grade Provides security with Kerberos authentication and ACLs authorization
Data encryption
High reliability and integrity
Multi-tenancy
Wide range of applications Files: Structured, semi-structured, and unstructured
Streaming sources: Flume and Kafka
Databases: Any RDBMS and NoSQL database

Spark features

Feature Details
Easy development No boilerplate coding
Multiple native APIs such as Java, Scala, Python, and R
REPL for Scala, Python, and R
Optimized performance Caching
Optimized shuffle
Catalyst Optimizer
Unification Batch, SQL, machine learning, streaming, and graph processing
High level APIs DataFrames, Data sets and Data Sources APIs

When both frameworks are combined, we get the power of enterprise-grade
applications with in-memory performance, as shown in Figure 2.11:
将两者的优势结合在一起,我们就可以得到商用级别(企业级别)的达到内存效率的应用。
BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Frequently asked questions about Spark - Spark的常见问题

The following are frequent questions that practitioners raise about Spark:

  • My dataset does not fit in-memory. How can I use Spark?
    • Spark’s operators spill the data to disk if it does not fit in-memory, allowing it to run on data of any size. Likewise, cached datasets that do not fit inmemory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD’s storage level. By default, Spark will recompute the partitions that don’t fit in-memory. The storage level can be changed as MEMORY_AND_DISK to spill partitions to disk. Figure 2.12 shows you the performance difference between fully cached and on disk:
      BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark
  • How does fault recovery work in Spark?
    • Spark’s built-in fault tolerance based on the RDD lineage will automatically recover from failures. Figure 2.13 shows you the performance over failure in the 6th iteration in a k-means algorithm:
      BDAHS 第2章 Getting Started with Apache Hadoop and Apache Spark

Installing Hadoop plus Spark clusters - 安装Hadoop和Spark集群

Before installing Hadoop and Spark, let’s understand the versions of Hadoop and Spark. Spark is offered as a service in all three popular Hadoop distributions from Cloudera, Hortonworks, and MapR. The current Hadoop and Spark versions are 2.7.2 and 2.0 respectively as of writing this book. However, Hadoop distributions might have a lower version of Spark as Hadoop and Spark release cycles do not coincide.
在安装Spark和Hadoop之前,让我们先来了解一下Hadoop和Spark的版本。在当前流行的三种Hadoop版本中,Spark都是作为一种服务。当前流行的Hadoop版本为2.7.2和2.0。如果Hadoop和Spark的版本不匹配,那么,应当是Hadoop搭配低版本的Spark
For the upcoming chapters’ practical exercises, let’s use one of the free virtual machines (VM) from Cloudera, Hortonworks, and MapR, or use an open source version of Apache Spark. These VMs makes it easy to get started with Spark and Hadoop. The same exercises can be run on bigger clusters as well.
对接下来的内容,我们先用免费的Cloudera, Hortonworks, 和 MapR的虚拟机,或者使用一个开源的Spark版本。这些虚拟机使Spark和Hadoop的入手变得更加容易。相同的搭建可以被用于更大的集群。
The prerequisites to use virtual machines on your laptop are as follows:

  • RAM of 8 GB and above
  • At least two virtual CPUs
  • The latest VMWare Player or Oracle VirtualBox must be installed for Windows or Linux OS
  • The latest Oracle VirtualBox or VMWare Fusion for Mac
  • Virtualization is enabled in BIOS
  • Chrome 25+, IE 9+, Safari 6+, or Firefox 18+ is recommended (HDP Sandbox will not run on IE 10)
  • Putty
  • WinSCP

The instructions to download and run Cloudera Distribution for Hadoop (CDH) are
as follows:
1. Download the latest quickstart CDH VM from http://www.cloudera.com/content/www/en-us/downloads.html. Download the appropriate versionbased on the virtualization software (VirtualBox or VMWare) installed on the laptop.
2. Extract it to a directory (use 7-Zip or WinZip).
3. In case of VMWare Player, click on Open a Virtual Machine, and point to the directory where you have extracted the VM. Select the clouderaquickstart-vm-5.x.x-x-vmware.vmx file and click on Open.
4. Click on Edit virtual machine settings and then increase memory to 7 GB (if your laptop has 8 GB RAM) or 8 GB (if your laptop has more than 8 GB RAM). Increase the number of processors to four. Click on OK.
5. Click on Play virtual machine.
6. Select I copied it and click on OK.
7. This should get your VM up and running.
8. Cloudera Manager is installed on the VM but is turned off by default. If you would like to use Cloudera Manager, double-click and run Launch Cloudera Manager Express to set up Cloudera Manager. This will be helpful in the starting / stopping / restarting of services on the cluster.
9. Credentials for the VM are username (cloudera) and password (cloudera).

If you would like to use the Cloudera Quickstart Docker image, follow the instructions on http://blog.cloudera.com/blog/2015/12/docker-is-the-newquickstart-option-for-apache-hadoop-and-cloudera. The instructions to download and run Hortonworks Data Platform (HDP) Sandbox are as follows:

  1. Download the latest HDP Sandbox from http://hortonworks.com/products/hortonworks-sandbox/#install. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
  2. Follow the instructions from install guides on the same downloads page.
  3. Open the browser and enter the address as shown in sandbox, for example, http://192.168.139.158/. Click on View Advanced Options to see all the links.
  4. Access the sandbox with putty as the root user and hadoop as the initial password. You need to change the password on the first login. Also, run the ambari-admin-password-reset command to reset Ambari admin password.
  5. To start using Ambari, open the browser and enter ipaddressofsandbox:8080 with admin credentials created in the preceding step. Start the services needed in Ambari.
  6. To map the hostname to the IP address in Windows, go to C:\Windows\System32\drivers\etc\hosts and enter the IP address and hostname with a space separator. You need admin rights to do this.

The instructions to download and run MapR Sandbox are as follows:

  1. Download the latest sandbox from https://www.mapr.com/products/mapr-sandbox-hadoop/download. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
  2. Follow the instructions to set up Sandbox at http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop.
  3. Use Putty to log in to the sandbox.
  4. The root password is mapr.
  5. To launch HUE or MapR Control System (MCS), navigate to the URL provided by MapR Sandbox.
  6. To map the hostname to the IP address in Windows, go to C:\Windows\System32\drivers\etc\hosts and enter the IP address and hostname with a space separator.

The instructions to download and run Apache Spark prebuilt binaries, in case you have a preinstalled Hadoop cluster, are given here. The following instructions can also be used to install the latest version of Spark and use it on the preceding VMs:

  1. Download Spark prebuilt for Hadoop from the following location:
    wget http://apache.mirrors.tds.net/spark/spark-2.0.0/
    spark-2.0.0-bin-hadoop2.7.tgz
    tar xzvf spark-2.0.0-bin-hadoop2.7.tgz
    cd spark-2.0.0-bin-hadoop2.7

  2. Add SPARK_HOME and PATH variables to the profile script as shown in the following commands so that these environment variables will be set every time you log in:
    [[email protected] ~]$ cat /etc/profile.d/spark2.sh
    export SPARK_HOME=/home/cloudera/spark-2.0.0-bin-hadoop2.7
    export PATH=$PATH:/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin

  3. Let Spark know about the Hadoop configuration directory and Java home by adding the following environment variables to spark-env.sh. Copy the template files in the conf directory:“`cp conf/spark-env.sh.template conf/spark-env.sh
    cp conf/spark-defaults.conf.template conf/spark-defaults.conf
    vi conf/spark-env.sh
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera”’

  4. Copy hive-site.xml to the conf directory of Spark.
    cp /etc/hive/conf/hive-site.xml conf/

  5. Change the log level to ERROR in the spark-2.0.0-bin-hadoop2.7/conf/log4j.properties file after copying the template file.

Programming languages version requirements to run Spark:
Java: 7+
Python: 2.6+/3.1+
R: 3.1+
Scala: Spark 1.6 and below 2.10, and Spark 2.0 and above 2.11
Note that the preceding virtual machines are single node clusters. If you are planning to set up multi-node clusters, follow the guidelines as per the distribution, such as CDH, HDP, or MapR. If you are planning to use a standalone cluster manager, the setup is described in the following chapter.

Summary

Apache Hadoop provides you with a reliable and scalable framework (HDFS) for Big Data storage and a powerful cluster resource management framework (YARN) to run and manage multiple Big Data applications. Apache Spark provides in-memory performance in Big Data processing and libraries and APIs for interactive exploratory analytics, real-time analytics, machine learning, and graph analytics. While MR was the primary processing engine on top of Hadoop, it had multiple drawbacks, such as poor performance and inflexibility in designing applications. Apache Spark is a replacement for MR. All MR-based tools, such as Hive, Pig, Mahout, and Crunch, have already started offering Apache Spark as an additional execution engine apart from MR.
Nowadays, Big Data projects are being implemented in many businesses, from large Fortune 500 companies to small start-ups. Organizations gain an edge if they can go from raw data to decisions quickly with easy-to-use tools to develop applications and explore data. Apache Spark will bring this speed and sophistication to Hadoop clusters. In the next chapter, let’s dive deep into Spark and learn Spark.