apache-spark-2.0 – Make Me Engineer

Reading csv files with quoted fields containing embedded commas

May 23, 2023 by Tarik

I noticed that your problematic line has escaping that uses double quotes themselves: “32 XIY “”W”” JK, RE LK” which should be interpreter just as 32 XIY “W” JK, RE LK As described in RFC-4180, page 2 – If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped … Read more

What are the various join types in Spark?

July 25, 2022 by Tarik

[*] Here is a simple illustrative experiment: import org.apache.spark.sql._ object SparkSandbox extends App { implicit val spark = SparkSession.builder().master(“local[*]”).getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel(“ERROR”) val left = Seq((1, “A1”), (2, “A2”), (3, “A3”), (4, “A4”)).toDF(“id”, “value”) val right = Seq((3, “A3”), (4, “A4”), (4, “A4_1”), (5, “A5”), (6, “A6”)).toDF(“id”, “value”) println(“LEFT”) left.orderBy(“id”).show() println(“RIGHT”) right.orderBy(“id”).show() val joinTypes = … Read more

Spark 2.0 Dataset vs DataFrame

May 15, 2022 by Tarik

Difference between df.select(“foo”) and df.select($”foo”) is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that. myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let’s take a look … Read more