SparkSQL：使用两列的条件总和

码头工人有信仰

希望您能帮到我。我有一个DF，如下所示：
val df = sc.parallelize(Seq(
  (1, "a", "2014-12-01", "2015-01-01", 100),
  (2, "a", "2014-12-01", "2015-01-02", 150),
  (3, "a", "2014-12-01", "2015-01-03", 120),
  (4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
我很乐意做一个groupBy prodId并汇总“值”，以将其汇总为由“ dateIns”和“
dateTrans”列之间的差异所定义的日期范围。特别是，我希望有一种方法来定义一个条件总和，该总和将上述各列之间的预定义最大差之内的所有值相加。即从dateIns开始的10、20、30天之间发生的所有值（’dateTrans’-‘dateIns’
spark中是否有任何预定义的聚合函数可以进行条件求和？您是否建议开发aggr。UDF（如果有的话，有什么建议）？我正在使用pySpqrk，但也非常高兴获得Scala解决方案。非常感谢！

解决方案:


            让您更有趣一点，以便窗口中有一些事件：
val df = sc.parallelize(Seq(
  (1, "a", "2014-12-30", "2015-01-01", 100),
  (2, "a", "2014-12-21", "2015-01-02", 150),
  (3, "a", "2014-12-10", "2015-01-03", 120),
  (4, "b", "2014-12-05", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns"))
.withColumn("dateTrans", to_date($"dateTrans"))
您所需要的或多或少是这样的：
import org.apache.spark.sql.functions.{col, datediff, lit, sum}
// Find difference in tens of days
val diff = (datediff(col("dateTrans"), col("dateIns")) / 10)
  .cast("integer") * 10
val dfWithDiff = df.withColumn("diff", diff)
val aggregated = dfWithDiff
  .where((col("diff") = 0))
  .groupBy(col("prodId"), col("diff"))
  .agg(sum(col("value")))
结果
aggregated.show
// +------+----+----------+
// |prodId|diff|sum(value)|
// +------+----+----------+
// |    a|  20|    120|
// |    b|  20|    100|
// |    a| 0|    100|
// |    a|  10|    150|
// +------+----+----------+
其中diff是范围（0-> [0，10），10-> [10，20），…）的下限。如果您删除val并调整了导入，这也将在PySpark中起作用。
编辑（每列汇总）：
val exprs = Seq(0, 10,  20).map(x => sum(
  when(col("diff") === lit(x), col("value"))
.otherwise(lit(0)))
.alias(x.toString))
dfWithDiff.groupBy(col("prodId")).agg(exprs.head, exprs.tail: _*).show
// +------+---+---+---+
// |prodId|  0| 10| 20|
// +------+---+---+---+
// |    a|100|150|120|
// |    b|  0|  0|100|
// +------+---+---+---+
与Python等效：
from pyspark.sql.functions import *
def make_col(x):
cnd = when(col("diff") == lit(x), col("value")).otherwise(lit(0))
return sum(cnd).alias(str(x))
exprs = [make_col(x) for x in range(0, 30, 10)]
dfWithDiff.groupBy(col("prodId")).agg(*exprs).show()
## +------+---+---+---+
## |prodId|  0| 10| 20|
## +------+---+---+---+
## |    a|100|150|120|
## |    b|  0|  0|100|
## +------+---+---+---+

SparkSQL：使用两列的条件总和

码头工人有信仰 LV1