将数据转换为火花scala中的类对象列表

问题描述:

我想写一个火花转换代码将下面的数据转换为以下类的对象列表,我对scala和spark完全陌生并试图分割数据并将它们放入案例班,但我无法将其追回。请求你的帮助。将数据转换为火花scala中的类对象列表

数据:

FirstName,LastName,Country,match,Goals 
Cristiano,Ronaldo,Portugal,Match1,1 
Cristiano,Ronaldo,Portugal,Match2,1 
Cristiano,Ronaldo,Portugal,Match3,0 
Cristiano,Ronaldo,Portugal,Match4,2 
Lionel,Messi,Argentina,Match1,1 
Lionel,Messi,Argentina,Match2,2 
Lionel,Messi,Argentina,Match3,1 
Lionel,Messi,Argentina,Match4,2 

所需的输出:

PLayerStats{ String FirstName, 
    String LastName, 
    String Country, 
    Map <String,Int> matchandscore 
} 

第一线转换成键值对说(Cristiano, rest of data)然后应用groupByKeyreduceByKey也能正常工作,然后尝试将键值对数据转换通过放置值将groupByKey或reduceByKey应用到您的类中之后。借助着名的单词计数程序。

http://spark.apache.org/examples.html

你可以尝试一些如下:

val file = sc.textFile("myfile.csv") 

val df = file.map(line => line.split(",")).  // split line by comma 
       filter(lineSplit => lineSplit(0) != "FirstName"). // filter out first row 
       map(lineSplit => {   // transform lines 
       (lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}). 
       toDF("FirstName", "LastName", "Country", "MatchAndScore")   

df.schema 
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true)) 

df.show 

+---------+--------+---------+----------------+ 
|FirstName|LastName| Country| MatchAndScore| 
+---------+--------+---------+----------------+ 
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)| 
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)| 
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)| 
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)| 
| Lionel| Messi|Argentina|Map(Match1 -> 1)| 
| Lionel| Messi|Argentina|Map(Match2 -> 2)| 
| Lionel| Messi|Argentina|Map(Match3 -> 1)| 
| Lionel| Messi|Argentina|Map(Match4 -> 2)| 
+---------+--------+---------+----------------+ 

假设你已经加载数据到一个名为data一个RDD[String]

case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int]) 

val result: RDD[PlayerStats] = data 
    .filter(!_.startsWith("FirstName")) // remove header 
    .map(_.split(",")).map { // map into case classes 
    case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt)) 
    } 
    .keyBy(p => (p.FirstName, p.LastName)) // key by player 
    .reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore)) 
    .map(_._2) // remove key 
+0

谢谢!! ti工作 – Bhushan

+0

@Bhushan很高兴帮助 - 您可以接受/提出答案,让未来的读者知道这是有用的 –