pandas – Make Me Engineer

Convert a spark DataFrame to pandas DF

June 10, 2023 by Tarik

following should work Sample DataFrame some_df = sc.parallelize([ (“A”, “no”), (“B”, “yes”), (“B”, “yes”), (“B”, “no”)] ).toDF([“user_id”, “phone_number”]) Converting DataFrame to Pandas DataFrame pandas_df = some_df.toPandas()

Splitting multiple columns into rows in pandas dataframe

June 1, 2023 by Tarik

You can first split columns, create Series by stack and remove whitespaces by strip: s1 = df.value.str.split(‘,’, expand=True).stack().str.strip().reset_index(level=1, drop=True) s2 = df.date.str.split(‘,’, expand=True).stack().str.strip().reset_index(level=1, drop=True) Then concat both Series to df1: df1 = pd.concat([s1,s2], axis=1, keys=[‘value’,’date’]) Remove old columns value and date and join: print (df.drop([‘value’,’date’], axis=1).join(df1).reset_index(drop=True)) ticker account value date 0 aa assets 100 20121231 … Read more

Pandas, group by count and add count to original dataframe?

May 30, 2023 by Tarik

IIUC In [247]: df[‘count’] = df.groupby(‘kind’).transform(‘count’) In [248]: df Out[248]: kind msg count 0 aaa aaa text 1 3 1 aaa aaa text 2 3 2 aaa aaa text 3 3 3 bb bb text 1 4 4 bb bb text 2 4 5 bb bb text 3 4 6 bb bb text 4 4 … Read more

Suppress output of object when plotting in IPython

May 28, 2023 by Tarik

Just put ; after the code. It works only in Jupyter Notebook. plt.hist(…);

(pandas) Create new column based on first element in groupby object

May 18, 2023 by Tarik

You need transform with first: print (df.groupby(‘Person’)[‘Color’].transform(‘first’)) 0 blue 1 green 2 orange 3 blue 4 green 5 orange Name: Color, dtype: object df[‘First_Col’] = df.groupby(‘Person’)[‘Color’].transform(‘first’) print (df) Color Person First_Col 0 blue bob blue 1 green jim green 2 orange joe orange 3 yellow bob blue 4 pink jim green 5 purple joe orange

Grouped seaborn.barplot from a wide pandas.DataFrame

April 29, 2023 by Tarik

I think need melt if want use barplot: data = df.melt(‘date’, var_name=”a”, value_name=”b”) print (data) date a b 0 2017-09-05 A 25 1 2017-09-06 A 261 2 2017-09-07 A 188 3 2017-09-08 A 200 4 2017-09-09 A 292 5 2017-09-05 B 261 6 2017-09-06 B 1519 7 2017-09-07 B 1545 8 2017-09-08 B 2110 9 … Read more

Adding a column thats result of difference in consecutive rows in pandas

April 27, 2023 by Tarik

Use shift. df[‘dA’] = df[‘A’] – df[‘A’].shift(-1)

using time zone in pandas to_datetime

November 29, 2022 by Tarik

You can use tz_localize to set the timezone to UTC/+0000, and then tz_convert to add the timezone you want: start = pd.to_datetime(‘2015-02-24’) rng = pd.date_range(start, periods=10) df = pd.DataFrame({‘Date’: rng, ‘a’: range(10)}) df.Date = df.Date.dt.tz_localize(‘UTC’).dt.tz_convert(‘Asia/Kolkata’) print (df) Date a 0 2015-02-24 05:30:00+05:30 0 1 2015-02-25 05:30:00+05:30 1 2 2015-02-26 05:30:00+05:30 2 3 2015-02-27 05:30:00+05:30 3 … Read more

Jupyter notebook display two pandas tables side by side

November 25, 2022 by Tarik

I have ended up writing a function that can do this: [update: added titles based on suggestions (thnx @Antony_Hatchkins et al.)] from IPython.display import display_html from itertools import chain,cycle def display_side_by_side(*args,titles=cycle([”])): html_str=”” for df,title in zip(args, chain(titles,cycle([‘</br>’])) ): html_str+='<th style=”text-align:center”><td style=”vertical-align:top”>’ html_str+=f'<h2 style=”text-align: center;”>{title}</h2>’ html_str+=df.to_html().replace(‘table’,’table style=”display:inline”‘) html_str+='</td></th>’ display_html(html_str,raw=True) Example usage: df1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns=[‘A’,’B’,’C’,’D’,]) df2 = … Read more

Trouble installing Pandas on new MacBook Air M1

November 10, 2022 by Tarik

Maybe it is too late. But the only solution worked for me is installing from source if you do not want to use rosetta2 or moniconda python3 -m pip install virtualenv virtualenv -p python3.8 venv source venv/bin/activate pip install –upgrade pip pip install numpy cython git clone –depth 1 https://github.com/pandas-dev/pandas.git cd pandas python3 setup.py install