获取星火数据集嵌套数组的最小值
问题描述:
我有我想用用获取星火数据集嵌套数组的最小值
Dataset<Row> df = spark.read().json(args[0]);
星火2.2.0和Java API,这是我转换成一个数据集。然后分析一个JSON服务器的日志文件,它生成以下模式:
df.printschema();
root
|-- timestamp: long (nullable = true)
|-- results: struct (nullable = true)
| |-- entities: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- entity_id: string (nullable = true)
| | | |-- score: long (nullable = true)
| | | |-- is_available: boolean (nullable = true)
| |-- number_of_results: long (nullable = true)
我想得分最低的实体,这是可用的,所以我会得到一个数据集类似于:
root
|-- timestamp: long (nullable = true)
|-- results: struct (nullable = true)
| |-- entity: struct (containsNull = true)
| | |-- entity_id: string (nullable = true)
| | |-- score: long (nullable = true)
| | |-- is_available: boolean (nullable = true)
我该如何做这个转变?
答
你可以适用于你的阵列列一个用户定义的函数:
// Define the UDF that takes the min of array
UDF1<Seq<Row>, Row> getElement = seq -> {
Row bestRow = null;
long bestRowScore = Long.MAX_VALUE;
for (Row r : JavaConversions.seqAsJavaList(seq)){
if (r.getBoolean(1) && r.getLong(2)<bestRowScore){
bestRow = r;
bestRowScore = r.getLong(2);
}
}
return bestRow;
};
// Define the return type of UDF
ArrayType arrayType = (ArrayType) df.select(df.col("results.entities")).schema().fields()[0].dataType();
DataType elementType = arrayType.elementType();
// Register UDF
sparkSession.udf().register("getElement", getElement, elementType);
// Apply UDF on dataset
Dataset<Row> transformedDF = df.select(df.col("timestamp"),functions.callUDF("getElement", df.col("results.entities")));
transformedDF.printSchema();
答
您可以使用窗口函数(例如行号),以实现这一目标:
df.registerTempTable("df");
val minPerEntityDF = spark.sql("SELECT *, row_number() over (partition by entity.entity_id order by score asc) as rn
FROM df")
.filter("rn = 1")