spark sql 类型转换array_Spark SQL:怎样修改DataFrame列的数据类型?

要修改Spark DataFrame的列类型,可以使用"withColumn()"、"cast转换函数"、"selectExpr()"以及SQL表达式。需要注意的是,要转换的类型必须是DataType类的子类。

在Spark中,我们可以将DataFrame列修改(或转换)为以下类型,它们都是DataType类的子类:ArrayType

BinaryType

BooleanType

CalendarIntervalType

DateType

HiveStringType

MapType

NullType

NumericType

ObjectType

StringType

StructType

TimestampType

示例

1、创建一个DataFrame。

import org.apache.spark.sql._

import org.apache.spark.sql.types._

// 创建RDD

val simpleData = Seq(

Row("James",34,"2006-01-01","true","M",3000.60),

Row("Michael",33,"1980-01-10","true","F",3300.80),

Row("Robert",37,"1992-06-01","false","M",5000.50)

)

val simpleRDD = spark.sparkContext.parallelize(simpleData)

// 创建Schema

val fields = Array(

StructField("firstName",StringType,true),

StructField("age",IntegerType,true),

StructField("jobStartDate",StringType,true),

StructField("isGraduated", StringType, true),

StructField("gender", StringType, true),

StructField("salary", DoubleType, true)

)

val simpleSchema = StructType(fields)

// 使用指定的Schema将RDD转换为DataFrame

val df = spark.createDataFrame(simpleRDD,simpleSchema)

df.printSchema

df.cache

df.show(false)

输出结果如下:

root

|-- firstName: string (nullable = true)

|-- age: integer (nullable = true)

|-- jobStartDate: string (nullable = true)

|-- isGraduated: string (nullable = true)

|-- gender: string (nullable = true)

|-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+

|firstName|age|jobStartDate|isGraduated|gender|salary|

+---------+---+------------+-----------+------+------+

|James |34 |2006-01-01 |true |M |3000.6|

|Michael |33 |1980-01-10 |true |F |3300.8|

|Robert |37 |1992-06-01 |false |M |5000.5|

+---------+---+------------+-----------+------+------+

2、使用 withColumn和 cast修改列类型

将 age列修改为String类型,将 isGraduated列修改为布尔类型,将jobStartDate修改为日期类型。

import org.apache.spark.sql.functions._

val df2 = df.withColumn("age",col("age").cast(StringType))

.withColumn("isGraduated",col("isGraduated").cast(BooleanType))

.withColumn("jobStartDate",col("jobStartDate").cast(DateType))

df2.printSchema()

df2.show(false)

输出结果如下:

root

|-- firstName: string (nullable = true)

|-- age: string (nullable = true)

|-- jobStartDate: date (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- gender: string (nullable = true)

|-- salary: double (nullable = true)

+---------+---+------------+-----------+------+------+

|firstName|age|jobStartDate|isGraduated|gender|salary|

+---------+---+------------+-----------+------+------+

|James |34 |2006-01-01 |true |M |3000.6|

|Michael |33 |1980-01-10 |true |F |3300.8|

|Robert |37 |1992-06-01 |false |M |5000.5|

+---------+---+------------+-----------+------+------+

3、使用selectExpr 修改列类型

同样,将 age列修改为String类型,将 isGraduated列修改为布尔类型,将jobStartDate修改为日期类型。

val df3 = df.selectExpr(

"cast(age as string) age",

"cast(isGraduated as boolean) isGraduated",

"cast(jobStartDate as date) jobStartDate")

df3.printSchema

df3.show(false)

输出结果如下:

root

|-- age: string (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- jobStartDate: date (nullable = true)

+---+-----------+------------+

|age|isGraduated|jobStartDate|

+---+-----------+------------+

|34 |true |2006-01-01 |

|33 |true |1980-01-10 |

|37 |false |1992-06-01 |

+---+-----------+------------+

4、使用SQL 表达式修改列类型

我们还可以使用SQL表达式来修改Spark DataFram列类型。

df.createOrReplaceTempView("CastExample")

val df4 = spark.sql("""

SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate)

from CastExample

""")

df4.printSchema

df4.show(false)

输出结果如下:

root

|-- age: string (nullable = true)

|-- isGraduated: boolean (nullable = true)

|-- jobStartDate: date (nullable = true)

+---+-----------+------------+

|age|isGraduated|jobStartDate|

+---+-----------+------------+

|34 |true |2006-01-01 |

|33 |true |1980-01-10 |

|37 |false |1992-06-01 |

+---+-----------+------------+


版权声明:本文为weixin_39693950原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。