Spark DataFrame概述

DataFrame它不是Spark SQL提出的，而是早起在R、Pandas语言就已经有了的。官网：http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

A Dataset is a distributed collection of data：分布式的数据集

A DataFrame is a Dataset organized into named columns.

以列（列名、列的类型、列值）的形式构成的分布式数据集，按照列赋予不同的名称

It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

它在概念上等价于关系数据库中的表或r/python中的数据框架，但在底层有更丰富的优化。

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

DataFrames可以从各种各样的数据源构建，例如:结构化数据文件、Hive中的表、外部数据库或现有的RDDs。

The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows.

DataFrame API可以在Scala、Java、Python和R中使用，在Scala和Java中，DataFrame是由一行数据集表示的。

总结：

A distributed collection of rows organized into named columns(RDD with schema)
It is conceptually equivalent to a table in a relational database or a data frame in R/Python,but with richer optimizations under the hood
An abstraction for selection，filtering，aggregation and plotting structured data
Inspired by R and Pandas single machine small data processing experiences applied to distributed big data，i.e
Previously SchemaRDD （cf.Spark < 1.3）

DataFrame各平台数据处理能力对比

相关推荐