Reading csv files with quoted fields containing embedded commas

I noticed that your problematic line has escaping that uses double quotes themselves: “32 XIY “”W”” JK, RE LK” which should be interpreter just as 32 XIY “W” JK, RE LK As described in RFC-4180, page 2 – If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped … Read more

What are the various join types in Spark?

[*] Here is a simple illustrative experiment: import org.apache.spark.sql._ object SparkSandbox extends App { implicit val spark = SparkSession.builder().master(“local[*]”).getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel(“ERROR”) val left = Seq((1, “A1”), (2, “A2”), (3, “A3”), (4, “A4”)).toDF(“id”, “value”) val right = Seq((3, “A3”), (4, “A4”), (4, “A4_1”), (5, “A5”), (6, “A6”)).toDF(“id”, “value”) println(“LEFT”) left.orderBy(“id”).show() println(“RIGHT”) right.orderBy(“id”).show() val joinTypes = … Read more

Spark 2.0 Dataset vs DataFrame

Difference between df.select(“foo”) and df.select($”foo”) is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that. myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let’s take a look … Read more