Apache Spark, add an “CASE WHEN … ELSE …” calculated column to an existing DataFrame

In the upcoming SPARK 1.4.0 release (should be released in the next couple of days). You can use the when/otherwise syntax: // Create the dataframe val df = Seq(“Red”, “Green”, “Blue”).map(Tuple1.apply).toDF(“color”) // Use when/otherwise syntax val df1 = df.withColumn(“Green_Ind”, when($”color” === “Green”, 1).otherwise(0)) If you are using SPARK 1.3.0 you can chose to use a … Read more

Convert null values to empty array in Spark DataFrame

You can use an UDF: import org.apache.spark.sql.functions.udf val array_ = udf(() => Array.empty[Int]) combined with WHEN or COALESCE: df.withColumn(“myCol”, when(myCol.isNull, array_()).otherwise(myCol)) df.withColumn(“myCol”, coalesce(myCol, array_())).show In the recent versions you can use array function: import org.apache.spark.sql.functions.{array, lit} df.withColumn(“myCol”, when(myCol.isNull, array().cast(“array<integer>”)).otherwise(myCol)) df.withColumn(“myCol”, coalesce(myCol, array().cast(“array<integer>”))).show Please note that it will work only if conversion from string to the … Read more

How to drop column according to NAN percentage for dataframe?

You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition – so <.8 means remove all columns >=0.8: df = df.loc[:, df.isnull().mean() < .8] Sample: np.random.seed(100) df = pd.DataFrame(np.random.random((100,5)), columns=list(‘ABCDE’)) df.loc[:80, ‘A’] = np.nan df.loc[:5, ‘C’] = np.nan df.loc[20:, ‘D’] = … Read more