How to load jar dependenices in IPython Notebook

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example: export PACKAGES=”com.databricks:spark-csv_2.11:1.3.0″ export PYSPARK_SUBMIT_ARGS=”–packages ${PACKAGES} pyspark-shell” These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started: packages = “com.databricks:spark-csv_2.11:1.3.0” os.environ[“PYSPARK_SUBMIT_ARGS”] = ( “–packages {0} pyspark-shell”.format(packages) )

Is there a way to include commas in CSV columns without breaking the formatting?

Enclose the field in quotes, e.g. field1_value,field2_value,”field 3,value”,field4, etc… See wikipedia. Updated: To encode a quote, use “, one double quote symbol in a field will be encoded as “”, and the whole field will become “”””. So if you see the following in e.g. Excel: ————————————— | regular_value |,,,”| ,””, |””” |”| ————————————— the … Read more

Dealing with commas in a CSV file

There’s actually a spec for CSV format, RFC 4180 and how to handle commas: Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. http://tools.ietf.org/html/rfc4180 So, to have values foo and bar,baz, you do this: foo,”bar,baz” Another important requirement to consider (also from the spec): If double-quotes are used to enclose … Read more

What’s the most robust way to efficiently parse CSV using awk?

If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT): $ echo ‘foo,”field,””with””,commas”,bar’ | awk -v FPAT='[^,]*|(“([^”]|””)*”)’ ‘{for (i=1; i<=NF;i++) print i ” <” $i “>”}’ 1 <foo> 2 <“field,””with””,commas”> 3 <bar> or the equivalent using any awk: $ echo ‘foo,”field,””with””,commas”,bar’ | awk -v fpat=”[^,]*|(“([^”]|””)*”)” -v OFS=’,’ ‘{ rec … Read more