如何添加头信息来排信息在解析XML火花
问题描述:
我有一个XML结构像如何添加头信息来排信息在解析XML火花
<root>
<bookinfo>
<time>1232314973</time>
<requestID>233</requestID>
<supplier>asd123</supplier>
</bookinfo>
<books>
<book>
<name>book1</name>
<pages>124</pages>
</book>
<book>
<name>book2</name>
<pages>456</pages>
</book>
<book>
<name>book4</name>
<pages>789</pages>
</book>
</books>
</root>
我知道我可以解析books
,如:
val xml = sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("FILENAME")
但我会像将supplier
这样的标题信息添加到每一行中。
有没有办法将这个“headerinfo”添加到所有具有spark的行而不加载文件两次并将信息存储在全局变量/ val中?
在此先感谢!
答
你可以阅读从 “根” 的标签开始的所有XML,然后爆炸所需的标签:
val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)
println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)
println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)
println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)
输出:
root
|-- bookinfo: struct (nullable = true)
| |-- requestID: long (nullable = true)
| |-- supplier: string (nullable = true)
| |-- time: long (nullable = true)
|-- books: struct (nullable = true)
| |-- book: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- pages: long (nullable = true)
+-----------------------+-----------------------------------------------------+
|bookinfo |books |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+
-- supplier --
root
|-- supplier: string (nullable = true)
+--------+
|supplier|
+--------+
|asd123 |
+--------+
-- books --
root
|-- bookDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- pages: long (nullable = true)
+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+
-- bookDetails --
root
|-- name: string (nullable = true)
|-- pages: long (nullable = true)
+-----+-----+
|name |pages|
+-----+-----+
|book1|124 |
|book2|456 |
|book4|789 |
+-----+-----+
感谢,帮助,生病投了答案。 – kf2