How to make good reproducible Apache Spark examples

Provide small sample data, that can be easily recreated.

At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.

I have the following dataframe:

|index|  X|label|      date|
|    1|  1|    A|2017-01-01|
|    2|  3|    B|2017-01-02|
|    3|  5|    A|2017-01-03|
|    4|  7|    B|2017-01-04|

which can be created with this code:

df = sqlCtx.createDataFrame(
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ('index', 'X', 'label', 'date')

Show the desired output.

Ask your specific question and show us your desired output.

How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?

Desired output:

|index|  X|label|      date|is_divisible|
|    1|  1|    A|2017-01-01|         yes|
|    2|  3|    B|2017-01-02|         yes|
|    3|  5|    A|2017-01-03|         yes|
|    4|  7|    B|2017-01-04|          no|

Explain how to get your output.

Explain, in great detail, how you get your desired output. It helps to show an example calculation.

For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is ‘yes’.

Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is ‘no’.

Share your existing code.

Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.

(*You can leave out the code to create the spark context, but you should include all imports.)

I know how to add a new column that is date plus 7 days but I’m having trouble getting the day of the month as an integer.

from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))

Include versions, imports, and use syntax highlighting

  • Full details in this answer written by desertnaut.

For performance tuning posts, include the execution plan

  • Full details in this answer written by Alper t. Turker.
  • It helps to use standardized names for contexts.

Parsing spark output files

  • MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.

Other notes.

  • Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
  • Read the other answers to this question, which are linked above.
  • Have a good, descriptive title.
  • Be polite. People on SO are volunteers, so ask nicely.

Leave a Comment