Spark DataFrame概述
DataFrame它不是Spark SQL提出的,而是早起在R、Pandas语言就已经有了的。官网:http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
A Dataset is a distributed collection of data:分布式的数据集
A DataFrame is a Dataset organized into named columns.
A Dataset is a distributed collection of data:分布式的数据集
A DataFrame is a Dataset organized into named columns.
以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称
It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
它在概念上等价于关系数据库中的表或r/python中的数据框架,但在底层有更丰富的优化。
DataFrames
can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
DataFrames可以从各种各样的数据源构建,例如:结构化数据文件、Hive中的表、外部数据库或现有的RDDs。
The
DataFrame API is available in Scala, Java, Python, and R. In Scala and Java,
a DataFrame is represented by a Dataset of Rows.
DataFrame
API可以在Scala、Java、Python和R中使用,在Scala和Java中,DataFrame是由一行数据集表示的。
总结:
- A distributed collection of rows organized into named columns(RDD with schema)
- It is conceptually equivalent to a table in a relational database or a data frame in R/Python,but with richer optimizations under the hood
- An abstraction for selection,filtering,aggregation and plotting structured data
- Inspired by R and Pandas single machine small data processing experiences applied to distributed big data,i.e
- Previously SchemaRDD (cf.Spark < 1.3)
DataFrame各平台数据处理能力对比