Rolling OLS Regressions and Predictions by Group

You should be able to achieve what you want using the groupby / apply pattern. The below code should be helpful.

Create example data:

from statsmodels.regression.rolling import RollingOLS
from statsmodels.tools.tools import add_constant
import pandas as pd
import numpy as np

# make some toy data
race_dates = pd.to_datetime(['2020-06-09']*3 + ['2020-12-01']*4 + ['2021-01-21']*4 + ['2021-05-04']*5)
distance = [8]*3 + [7]*4 + [4]*4 + [6]*5
avg_speed = 80 + np.random.randn(len(distance))*10
names = ['Smith', 'Douglas', 'Mostern',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson',
         'Smith', 'Douglas', 'Mostern', 'Robinson', 'Thomas']

df = pd.DataFrame({'Date': race_dates,
                  'Name': names,
                  'Distance': distance,
                  'avg_speed_calc': avg_speed})

Define helper functions:

def speed_preference(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a dataframe containing model parameters
    """
    # we should have atleast (number_of_parameters) + 1 observations
    min_obs = 3 if intercept else 2

    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_obs:
        cols = ['const', 'Distance'] if intercept else ['Distance']
        return pd.DataFrame(index=df_racer.index, columns=cols)
    
    y = df_racer['avg_speed_calc']
    x = add_constant(df_racer['Distance']) if intercept else df_racer['Distance']
    
    model = RollingOLS(y, x, expanding=True, min_nobs=min_obs).fit()
    return model.params


def speed_prediction(df_racer, intercept=False):
    """
    Function to operate on the data of a single driver.
    Assumes df_racer has the columns 'Distance' and 'avg_speed_calc' available.
    Returns a series containing predicted speed
    """
    params = speed_preference(df_racer, intercept)
    params_shifted = params.shift(1)
    
    if intercept:
        return (params_shifted.mul(add_constant(df_racer['Distance']), axis=0)\
                .sum(axis=1, min_count=1)).to_frame('predicted_speed_calc')
    
    return (params_shifted.mul(df_racer['Distance'], axis=0))\
                .rename({'Distance': 'predicted_speed_calc'}, axis=1)

The speed_preference function computes the rolling OLS for a single driver, and return the fitted parameter(s). The speed_prediction function computes the predicted speed, using the model up the previous race (notice params_shifted) as requested. To put it all together, just a simple groupby and join is required:

No intercept

grouped = df.groupby('Name')
params = grouped.apply(speed_preference)
predictions = grouped.apply(speed_prediction)

df_out_no_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_no_intercept

output no intercept

With intercept

grouped = df.groupby('Name')
params = grouped.apply(lambda x: speed_preference(x, True))
predictions = grouped.apply(lambda x: speed_prediction(x, True))

df_out_w_intercept = df.join(params, rsuffix='_coef').join(predictions)
df_out_w_intercept

output with intercept

EDIT

If you would like to fit a model from formula instead, you could use:

def speed_preference_from_formula(df_racer, formula, min_nobs):
    """
    Function to operate on the data of a single driver. "formula" should reference column names in df_racer.
    min_nobs should always be >= (# of parameters in the model)+1
    """
    
    # if there are less than min_obs rows in df_racer, RollingOLS will throw an error
    # Instead, handle this case separately
    if df_racer.shape[0] < min_nobs:

        return None
    model = RollingOLS.from_formula(formula, data=df_racer, expanding=True, min_nobs=min_nobs, window=None).fit()
    return model.params

Then for a polynomial model, you could compute the parameters as follows:

grouped = df.groupby('Name')
formula = "avg_speed_calc ~ 1 + Distance + Distance^2"
grouped.apply(lambda x: speed_preference_from_formula(x, formula, 4)) 

Output:

output 3

Note you would need to also edit the speed prediction function to handle the parameters and generated predictions correctly.

Notice that in the formula, I reference column names present in the dataframe being passed in. The 1 indicates an intercept should be used (similar to sm.add_constant), the Distance means use the value in Distance column directly, and the Distance^2 means square the value in Distance column, and then use that value as a feature. If you wanted to fit a cubic model, you can add the term + Distance^3.

For a good reference on how to use thes “R-style” formulas, see here.

EDIT 2: On debugging groupby

Think of the groupby.apply pattern as simply splitting the dataframe into multiple smaller dataframes, applying some function to each of the dataframes individually, and then joining back together. To see each sub-dataframe that is generated by the groupby, you can use:

grouped = df.groupby('Name')
for name, group in grouped:
    print(f"The sub-dataframe for: {name}")
    print(group)

This is useful, because you can now see exactly what gets passed to the function you are using inside the .apply().

So for the error mentioned in comments, you can apply the function to each group individually to narrow down where the error is occurring.

Leave a Comment