将数据转换为火花scala中的类对象列表
问题描述:
我想写一个火花转换代码将下面的数据转换为以下类的对象列表,我对scala和spark完全陌生并试图分割数据并将它们放入案例班,但我无法将其追回。请求你的帮助。将数据转换为火花scala中的类对象列表
数据:
FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2
所需的输出:
PLayerStats{ String FirstName,
String LastName,
String Country,
Map <String,Int> matchandscore
}
答
第一线转换成键值对说(Cristiano, rest of data)
然后应用groupByKey
或reduceByKey
也能正常工作,然后尝试将键值对数据转换通过放置值将groupByKey或reduceByKey应用到您的类中之后。借助着名的单词计数程序。
答
你可以尝试一些如下:
val file = sc.textFile("myfile.csv")
val df = file.map(line => line.split(",")). // split line by comma
filter(lineSplit => lineSplit(0) != "FirstName"). // filter out first row
map(lineSplit => { // transform lines
(lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
toDF("FirstName", "LastName", "Country", "MatchAndScore")
df.schema
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))
df.show
+---------+--------+---------+----------------+
|FirstName|LastName| Country| MatchAndScore|
+---------+--------+---------+----------------+
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
| Lionel| Messi|Argentina|Map(Match1 -> 1)|
| Lionel| Messi|Argentina|Map(Match2 -> 2)|
| Lionel| Messi|Argentina|Map(Match3 -> 1)|
| Lionel| Messi|Argentina|Map(Match4 -> 2)|
+---------+--------+---------+----------------+
答
假设你已经加载数据到一个名为data
一个RDD[String]
:
case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])
val result: RDD[PlayerStats] = data
.filter(!_.startsWith("FirstName")) // remove header
.map(_.split(",")).map { // map into case classes
case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
}
.keyBy(p => (p.FirstName, p.LastName)) // key by player
.reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore))
.map(_._2) // remove key
谢谢!! ti工作 – Bhushan
@Bhushan很高兴帮助 - 您可以接受/提出答案,让未来的读者知道这是有用的 –