要修改Spark DataFrame的列类型,可以使用"withColumn()"、"cast转换函数"、"selectExpr()"以及SQL表达式。需要注意的是,要转换的类型必须是DataType类的子类。
在Spark中,我们可以将DataFrame列修改(或转换)为以下类型,它们都是DataType类的子类:ArrayType
BinaryType
BooleanType
CalendarIntervalType
DateType
HiveStringType
MapType
NullType
NumericType
ObjectType
StringType
StructType
TimestampType
示例
1、创建一个DataFrame。
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// 创建RDD
val simpleData = Seq(
Row("James",34,"2006-01-01","true","M",3000.60),
Row("Michael",33,"1980-01-10","true","F",3300.80),
Row("Robert",37,"1992-06-01","false","M",5000.50)
)
val simpleRDD = spark.sparkContext.parallelize(simpleData)
// 创建Schema
val fields = Array(
StructField("firstName",StringType,true),
StructField("age",IntegerType,true),
StructField("jobStartDate",StringType,true),
StructField("isGraduated", StringType, true),
StructField("gender", StringType, true),
StructField("salary", DoubleType, true)
)
val simpleSchema = StructType(fields)
// 使用指定的Schema将RDD转换为DataFrame
val df = spark.createDataFrame(simpleRDD,simpleSchema)
df.printSchema
df.cache
df.show(false)
输出结果如下:
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = true)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = true)
+---------+---+------------+-----------+------+------+
|firstName|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James |34 |2006-01-01 |true |M |3000.6|
|Michael |33 |1980-01-10 |true |F |3300.8|
|Robert |37 |1992-06-01 |false |M |5000.5|
+---------+---+------------+-----------+------+------+
2、使用 withColumn和 cast修改列类型
将 age列修改为String类型,将 isGraduated列修改为布尔类型,将jobStartDate修改为日期类型。
import org.apache.spark.sql.functions._
val df2 = df.withColumn("age",col("age").cast(StringType))
.withColumn("isGraduated",col("isGraduated").cast(BooleanType))
.withColumn("jobStartDate",col("jobStartDate").cast(DateType))
df2.printSchema()
df2.show(false)
输出结果如下:
root
|-- firstName: string (nullable = true)
|-- age: string (nullable = true)
|-- jobStartDate: date (nullable = true)
|-- isGraduated: boolean (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = true)
+---------+---+------------+-----------+------+------+
|firstName|age|jobStartDate|isGraduated|gender|salary|
+---------+---+------------+-----------+------+------+
|James |34 |2006-01-01 |true |M |3000.6|
|Michael |33 |1980-01-10 |true |F |3300.8|
|Robert |37 |1992-06-01 |false |M |5000.5|
+---------+---+------------+-----------+------+------+
3、使用selectExpr 修改列类型
同样,将 age列修改为String类型,将 isGraduated列修改为布尔类型,将jobStartDate修改为日期类型。
val df3 = df.selectExpr(
"cast(age as string) age",
"cast(isGraduated as boolean) isGraduated",
"cast(jobStartDate as date) jobStartDate")
df3.printSchema
df3.show(false)
输出结果如下:
root
|-- age: string (nullable = true)
|-- isGraduated: boolean (nullable = true)
|-- jobStartDate: date (nullable = true)
+---+-----------+------------+
|age|isGraduated|jobStartDate|
+---+-----------+------------+
|34 |true |2006-01-01 |
|33 |true |1980-01-10 |
|37 |false |1992-06-01 |
+---+-----------+------------+
4、使用SQL 表达式修改列类型
我们还可以使用SQL表达式来修改Spark DataFram列类型。
df.createOrReplaceTempView("CastExample")
val df4 = spark.sql("""
SELECT STRING(age),BOOLEAN(isGraduated),DATE(jobStartDate)
from CastExample
""")
df4.printSchema
df4.show(false)
输出结果如下:
root
|-- age: string (nullable = true)
|-- isGraduated: boolean (nullable = true)
|-- jobStartDate: date (nullable = true)
+---+-----------+------------+
|age|isGraduated|jobStartDate|
+---+-----------+------------+
|34 |true |2006-01-01 |
|33 |true |1980-01-10 |
|37 |false |1992-06-01 |
+---+-----------+------------+