星火 - 读CSV没有新线标志
问题描述:
我解析没有新线标志的CSV文件:星火 - 读CSV没有新线标志
"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3"
是否有可能在星火有效地做到这一点? (我想和3排,3场中的每个获得的数据集)
答
如果我理解你的问题,如果你有输入数据,而行分隔符为
"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3"
而且要作为
输出+-------------+-------------+-------------+
|Column1 |Column2 |Column3 |
+-------------+-------------+-------------+
|"line1field1"|"line1field2"|"line1field3"|
|"line2field1"|"line2field2"|"line2field3"|
|"line3field1"|"line3field2"|"line3field3"|
+-------------+-------------+-------------+
下面的代码应该可以帮助您实现这一
val data = sc.textFile("path to the input file")
val todf = data
.map(line => line.split(",")).map(array => {
val list = new util.ArrayList[Array[String]]()
for(index <- 0 to array.length-1 by 3){
list.add(Array(Try(array(index)) getOrElse "", Try(array(index+1)) getOrElse "", Try(array(index+2)) getOrElse ""))
}
list
})
.flatMap(a => a.toArray())
.map(arr => arr.asInstanceOf[Array[String]])
.map(row => Row.fromSeq(Seq(row(0).trim, row(1).trim, row(2).trim)))
val schema = StructType(Array(StructField("Column1", StringType, true), StructField("Column2", StringType, true),StructField("Column3", StringType, true)))
sqlContext.createDataFrame(todf, schema).show(false)
我希望答案是有帮助的
答
如果你想做一个Spark相关的方式,这应该工作。我只是通过资源文件夹导入了csv文件,但将它放在了它所在的字符串中。
import sqlContext.implicits._
val columnNames: Seq[String] = Seq("Col1","Col2","Col3")
sparkContext.textFile(this.getClass.getResource("/test.csv").toString) // your string location here
.map(x => x.split(',').sliding(3, 3))
.flatMap(x => x)
.map(x => x.toList)
.map { case List(a, b, c) => (a, b, c) } //cleanup needed here to convert to Tuple
.toDF(columnNames: _*)
.show(truncate = false)
这产生了:
+-----------+-----------+-----------+
|Col1 |Col2 |Col3 |
+-----------+-----------+-----------+
|line1field1|line1field2|line1field3|
|line2field1|line2field2|line2field3|
|line3field1|line3field2|line3field3|
+-----------+-----------+-----------+
改变滑动,以配合您列数将用于其它尺寸的柱长工作。您将需要更改元组映射,所以这可能不适用于大量的列。
您也许可以查看List to Tuple Answer来查看映射到未知大小列表的元组。