使用长度函数在火花

问题描述：

我想在DataFrame 使用字符串函数内部长度的功能，但它给错误使用长度函数在火花

val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))

下面

是错误

error: type mismatch; 
found : org.apache.spark.sql.Column 
required: Int

我正在使用2.1。

答

功能 “EXPR” 可用于：

val data = List("first", "second", "third") 
val df = sparkContext.parallelize(data).toDF("value") 
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)")) 
result.show(false)

输出：

+------+------+ 
|value |cutted| 
+------+------+ 
|first |firs | 
|second|secon | 
|third |thir | 
+------+------+

答

你得到这个错误，因为你的substring签名是

def substring(str: Column, pos: Int, len: Int): Column

的len说法，你逝去的是一个Column，并且应当是一个Int。

你可能想要实现一个简单的UDF来解决这个问题。

val strTail = udf((str: String) => str.substring(1)) 
testDF.withColumn("newCol", strTail($"col"))

答

如果您只想删除字符串的最后一个字符，那么也可以不使用UDF。通过使用regexp_replace：

testDF.show 
+---+----+ 
| id|name| 
+---+----+ 
| 1|abcd| 
| 2|qazx| 
+---+----+ 

testDF.withColumn("newcol", regexp_replace($"name", ".$" , "")).show 
+---+----+------+ 
| id|name|newcol| 
+---+----+------+ 
| 1|abcd| abc| 
| 2|qazx| qaz| 
+---+----+------+

使用长度函数在火花

相关推荐