How to load jar dependenices in IPython Notebook

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example: export PACKAGES=”com.databricks:spark-csv_2.11:1.3.0″ export PYSPARK_SUBMIT_ARGS=”–packages ${PACKAGES} pyspark-shell” These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started: packages = “com.databricks:spark-csv_2.11:1.3.0” os.environ[“PYSPARK_SUBMIT_ARGS”] = ( “–packages {0} pyspark-shell”.format(packages) )

Is there a way to include commas in CSV columns without breaking the formatting?

Enclose the field in quotes, e.g. field1_value,field2_value,"field 3,value",field4, etc… See wikipedia. Updated: To encode a quote, use ", one double quote symbol in a field will be encoded as "", and the whole field will become """". So if you see the following in e.g. Excel: ————————————— | regular_value |,,,"| ,"", |""" |"| ————————————— the

Dealing with commas in a CSV file

There's actually a spec for CSV format, RFC 4180 and how to handle commas: Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. So, to have values foo and bar,baz, you do this: foo,"bar,baz" Another important requirement to consider (also from the spec): If double-quotes are used to enclose

What’s the most robust way to efficiently parse CSV using awk?

If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT): $ echo 'foo,"field,""with"",commas",bar' | awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}' 1 <foo> 2 <"field,""with"",commas"> 3 <bar> or the equivalent using any awk: $ echo 'foo,"field,""with"",commas",bar' | awk -v fpat="[^,]*|("([^"]|"")*")" -v OFS=',' '{ rec