spark sql 类型转换array_Spark SQL：怎样修改DataFrame列的数据类型？

要修改Spark DataFrame的列类型，可以使用"withColumn()"、"cast转换函数"、"selectExpr()"以及SQL表达式。需要注意的是，要转换的类型必须是DataType类的子类。

在Spark中，我们可以将DataFrame列修改(或转换)为以下类型，它们都是DataType类的子类：ArrayType

BinaryType

BooleanType

CalendarIntervalType

DateType

HiveStringType

MapType

NullType

NumericType

ObjectType

StringType

StructType

TimestampType

示例

1、创建一个DataFrame。

import org.apache.spark.sql._

import org.apache.spark.sql.types._

// 创建RDD

val simpleData = Seq(

Row("James",34,"2006-01-01","true","M",3000.60),

Row("Michael",33,"1980-01-10","true","F",3300.80),

Row("Robert",37,"1992-06-01","false","M",5000.50)

)

val simpleRDD = spark.sparkContext.parallelize(simpleData)

// 创建Schema

val fields = Array(

StructField("firstName",StringType,true),

StructField("age",IntegerType,true),

StructField("jobStartDate",StringType,true),

StructField("isGraduated", StringType, true),

StructField("gender", StringType, true),

StructField("salary", DoubleType, true)

)

val simpleSchema = StructType(fields)

// 使用指定的Schema将RDD转换为DataFrame

val df = spark.createDataFrame(simpleRDD,simpleSchema)

df.printSchema

df.cache

df.show(false)

输出结果如下：

root

|-- firstName: string (nullable = true)

|-- age: integer (nullable = true)

|-- jobStartDate: string (nullable = true)

|-- isGraduated: string (nullable = true)

|-- gender: string (nullable = true)

|-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+

+---------+---+------------+-----------+------+------+

|James |34 |2006-01-01 |true |M |3000.6|

|Michael |33 |1980-01-10 |true |F |3300.8|

|Robert |37 |1992-06-01 |false |M |5000.5|

+---------+---+------------+-----------+------+------+

2、使用 withColumn和 cast修改列类型

将 age列修改为String类型，将 isGraduated列修改为布尔类型，将jobStartDate修改为日期类型。

import org.apache.spark.sql.functions._

val df2 = df.withColumn("age",col("age").cast(StringType))

.withColumn("isGraduated",col("isGraduated").cast(BooleanType))

.withColumn("jobStartDate",col("jobStartDate").cast(DateType))

df2.printSchema()

df2.show(false)

输出结果如下：

root

|-- firstName: string (nullable = true)

|-- age: string (nullable = true)

|-- jobStartDate: date (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- gender: string (nullable = true)

|-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+

+---------+---+------------+-----------+------+------+

|James |34 |2006-01-01 |true |M |3000.6|

|Michael |33 |1980-01-10 |true |F |3300.8|

|Robert |37 |1992-06-01 |false |M |5000.5|

+---------+---+------------+-----------+------+------+

3、使用selectExpr 修改列类型

同样，将 age列修改为String类型，将 isGraduated列修改为布尔类型，将jobStartDate修改为日期类型。

val df3 = df.selectExpr(

"cast(age as string) age",

"cast(isGraduated as boolean) isGraduated",

"cast(jobStartDate as date) jobStartDate")

df3.printSchema

df3.show(false)

输出结果如下：

root

|-- age: string (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- jobStartDate: date (nullable = true)

+---+-----------+------------+

|age|isGraduated|jobStartDate|

+---+-----------+------------+

|34 |true |2006-01-01 |

|33 |true |1980-01-10 |

|37 |false |1992-06-01 |

+---+-----------+------------+

4、使用SQL 表达式修改列类型

我们还可以使用SQL表达式来修改Spark DataFram列类型。

df.createOrReplaceTempView("CastExample")

val df4 = spark.sql("""

SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate)

from CastExample

""")

df4.printSchema

df4.show(false)

输出结果如下：

root

|-- age: string (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- jobStartDate: date (nullable = true)

+---+-----------+------------+

|age|isGraduated|jobStartDate|

+---+-----------+------------+

|34 |true |2006-01-01 |

|33 |true |1980-01-10 |

|37 |false |1992-06-01 |

+---+-----------+------------+

原文链接：https://blog.csdn.net/weixin_39693950/article/details/112804191