Spark DataFrame概述

DataFrame它不是Spark SQL提出的,而是早起在R、Pandas语言就已经有了的。官网:http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

A Dataset is a distributed collection of data:分布式的数据集

A DataFrame is a Dataset organized into named columns. 

以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称

It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

它在概念上等价于关系数据库中的表或r/python中的数据框架,但在底层有更丰富的优化。

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

DataFrames可以从各种各样的数据源构建,例如:结构化数据文件、Hive中的表、外部数据库或现有的RDDs。

The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows.

DataFrame API可以在Scala、Java、Python和R中使用,在Scala和Java中,DataFrame是由一行数据集表示的。

总结:

  • A distributed collection of rows organized into named columns(RDD with schema)
  • It is conceptually equivalent to a table in a relational database or a data frame in R/Python,but with richer optimizations under the hood
  • An abstraction for selection,filtering,aggregation and plotting structured data
  • Inspired by R and Pandas single machine small data processing experiences applied to distributed big data,i.e
  • Previously SchemaRDD (cf.Spark < 1.3)
DataFrame各平台数据处理能力对比
Spark DataFrame概述