python dataframe drop null值_在DataFrame中使用None / null值替换空字符串

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

See my attempt below, which results in an error.

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

## Create a test DataFrame

testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])

testDF.show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo| 1|

## | | 2|

## |null|null|

## +----+----+

## Try to replace an empty string with None/null

testDF.replace('', None).show()

## ValueError: value should be a float, int, long, string, list, or tuple

## A string value of null (obviously) doesn't work...

testDF.replace('', 'null').na.drop(subset='col1').show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo| 1|

## |null| 2|

## +----+----+

解决方案

It is as simple as this:

from pyspark.sql.functions import col, when

def blank_as_null(x):

return when(col(x) != "", col(x)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

dfWithEmptyReplaced.show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo| 1|

## |null| 2|

## |null|null|

## +----+----+

dfWithEmptyReplaced.na.drop().show()

## +----+----+

## |col1|col2|

## +----+----+

## | foo| 1|

## +----+----+

If you want to fill multiple columns you can for example reduce:

to_convert = set([...]) # Some set of columns

reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

or use comprehension:

exprs = [

blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]

testDF.select(*exprs)

If you want to specifically operate on string fields please check the answer by robin-loxley.


版权声明:本文为weixin_39956350原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。