PySpark数据框：根据条件同时更改两列

问题描述：

我想知道是否有方法可以同时更改PySpark数据框的两个（或多个）列。现在我正在使用withColumn，但我不知道这是否意味着该条件将被检查两次（对于大型数据帧，这对我来说太昂贵了）。这段代码基本上检查其他两列（对于同一行）中的值，并基于它将两列更改为无/空。PySpark数据框：根据条件同时更改两列

condition = is_special_id_udf(col("id"))) & should_hide_response_udf(col("response_created")) 


    new_df = df.withColumn(
      "response_text", 
      when(condition, None) 
      .otherwise(col("response_text")) 
     ) 

    new_df = df.withColumn(
      "response_created", 
      when(condition, None) 
      .otherwise(col("response_created")) 
     )

请分享完整的代码和示例数据。你的代码是不可重现的。 – mtoto

你真的需要这些数据吗？代码工作正常，我只是想知道是否有更好的方法来做同样的事情。 – mfcabrera

您正在创建两个相同的列，是您的问题该怎么做？ – mtoto

答

的第一件事情，你可以简单地将UDF作为新列，用它进行计算，并把它：

condition = is_special_id_udf(col("id"))) & should_hide_response_udf(col("response_created")) 

new_df = df.withColumn("tmp", condition).withColumn(
     "response_text", 
     when(col("tmp"), None) 
     .otherwise(col("response_text")) 
    ).withColumn(
     "response_created", 
     when(col("tmp"), None) 
     .otherwise(col("response_created")) 
    ).drop("tmp")

如果你真的想生成两列，那么你可以做创建一个struct列和压平它（当然，添加列，你需要选择）：

new_df = df.withColumn(
     "myStruct", 
     when(condition, None) 
     .otherwise(struct(col("response_text"), col("response_created"))) 
    ).select("myStruct.*")

第二种选择是相当不错的，使我的答案过时。 – mtoto

我认为答案的第一部分就是我所看到的，我想知道是否有一种方法没有创建一个条件结果列，但看起来更干净。 – mfcabrera

PySpark数据框：根据条件同时更改两列

相关推荐