PySpark sqlContext JSON查询数组的所有值

唐增清

我目前有一个json文件，我正在尝试使用sqlContext.sql（）进行查询，如下所示：
{
  "sample": {
"persons": [
   {
      "id": "123",
   },
   {
      "id": "456",
   }
]
  }
}
如果我只想要第一个值，请输入：
sqlContext.sql("SELECT sample.persons[0] FROM test")
但是我想要“人”的所有值而不必编写循环。循环会消耗太多的处理能力，并且鉴于这些文件的大小，这将是不切实际的。
我以为我可以在[]括号内放一个范围，但是我找不到任何语法可以做到这一点。

解决方案:


            如果您的架构如下所示：
root
|-- sample: struct (nullable = true)
| |-- persons: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
并想要structs从persons数组访问单个对象，您需要做的就是将其爆炸：
from pyspark.sql.functions import explode
df.select(explode("sample.persons").alias("person")).select("person.id")

PySpark sqlContext JSON查询数组的所有值

唐增清 LV1