Bootstrapped LOWESS for LIGO Estimation Confidence Intervals¶

This notebook outlines how to use the moepy library to generate confidence intervals around the LOWESS estimates, using the famous LIGO gravitational wave data as an example.

N.b. I have no expertise of signal processing in this particular context, this is merely an example of how LOWESS confidence intervals can be used to limit the domain of your prediction to where you have greatest certainty in it.

Imports¶

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from moepy import lowess, eda

Loading Data¶

We'll start by loading the LIGO data in

df_LIGO = pd.read_csv('../data/lowess_examples/LIGO.csv')

df_LIGO.head()

Unnamed: 0	frequency	L1	H1	H1_smoothed
0	0	2.38353e-18	2.21569e-20	3.24e-18
1	0.25	1.68562e-18	2.01396e-20	2.6449e-19
2	0.5	1.24297e-21	5.5185e-21	9e-20
3	0.75	6.6778e-22	2.54612e-21	4.48443e-20
4	1	6.80032e-22	3.33945e-21	2.67769e-20

Baseline Fit¶

We'll quickly plot the observed data alongside the smoothed estimate provided in the raw data

fig, ax = plt.subplots(dpi=150)

ax.scatter(df_LIGO['frequency'], df_LIGO['H1'], color='k', linewidth=0, s=1)
ax.plot(df_LIGO['frequency'], df_LIGO['H1_smoothed'], color='r', alpha=1, label='Existing Smoothing')

ax.set_yscale('log')
ax.legend(frameon=False)
eda.hide_spines(ax)

png

LOWESS Fit¶

x = df_LIGO['frequency'].values
y = np.log(df_LIGO['H1']).values

df_bootstrap = lowess.bootstrap_model(x, y, num_runs=500, frac=0.2, num_fits=30)

df_bootstrap.head()

100% 500/500 [01:37<00:00, 0.19s/it]

x	0	1	2	3	4	5	6	7	8	9	...	490	491	492	493	494	495	496	497	498	499
0	-52.5227	-52.724	-52.4039	-52.5183	-52.5897	-52.4715	-52.5306	-52.4727	-52.5534	-52.5212	...	-52.5536	-52.6798	-52.508	-52.554	-52.4346	-52.6022	-52.355	-52.5031	-52.5719	-52.5057
0.25	-52.5235	-52.7245	-52.4049	-52.5192	-52.5904	-52.4723	-52.5314	-52.4737	-52.5542	-52.522	...	-52.5544	-52.6804	-52.5088	-52.5547	-52.4356	-52.6029	-52.3561	-52.504	-52.5727	-52.5066
0.5	-52.5244	-52.725	-52.4059	-52.5201	-52.5912	-52.4732	-52.5322	-52.4746	-52.555	-52.5228	...	-52.5553	-52.681	-52.5096	-52.5555	-52.4365	-52.6037	-52.3573	-52.5048	-52.5735	-52.5075
0.75	-52.5252	-52.7256	-52.4069	-52.521	-52.5919	-52.4741	-52.533	-52.4755	-52.5559	-52.5236	...	-52.5561	-52.6816	-52.5105	-52.5562	-52.4375	-52.6044	-52.3584	-52.5057	-52.5743	-52.5084
1	-52.5261	-52.7261	-52.4079	-52.5218	-52.5927	-52.4749	-52.5338	-52.4764	-52.5567	-52.5244	...	-52.5569	-52.6822	-52.5113	-52.5569	-52.4384	-52.6051	-52.3595	-52.5065	-52.5751	-52.5093

Using df_bootstrap we can calculate the confidence interval of our predictions, the Pandas DataFrame quantile method makes this particularly simple.

df_conf_intvl = lowess.get_confidence_interval(df_bootstrap, conf_pct=0.95)

# Plotting
fig, ax = plt.subplots(dpi=150)

ax.scatter(df_LIGO['frequency'], df_LIGO['H1'], color='k', linewidth=0, s=1, zorder=1)
ax.fill_between(df_conf_intvl.index, np.exp(df_conf_intvl['min']), np.exp(df_conf_intvl['max']), color='r', alpha=1, label='95% Confidence')

ax.set_yscale('log')
ax.legend(frameon=False)
eda.hide_spines(ax)

png

We can see that we capture the middle and higher frequencies fairly well but due to the smaller number of data-points in the low frequency region the LOWESS fit is unable to model it as well. The frac could be decreased to improve this but then comes at the cost of processing more inaccurate estimates in the other regions. One way to address this could be to introduce an option where frac can be varied for each local regression model.

For now we'll just limit our prediction to the domain where the confidence interval is 'reasonable', we'll start by calculating the 95% confidence interval.

s_95pct_conf_intvl = (df_bootstrap
                      .quantile([0.025, 0.975], axis=1)
                      .diff()
                      .dropna(how='all')
                      .T
                      .rename(columns={0.975: '95pct_pred_intvl'})
                      ['95pct_pred_intvl']
                     )

s_95pct_conf_intvl

x
0.00       0.404673
0.25       0.404049
0.50       0.403448
0.75       0.402822
1.00       0.402187
             ...   
2047.00    0.077741
2047.25    0.077827
2047.50    0.077913
2047.75    0.077999
2048.00    0.078085
Name: 95pct_pred_intvl, Length: 8193, dtype: float64

We'll now define the 'reasonable' confidence interval threshold, in this case using the value that icludes 95% of the values.

x_max_pct = 0.95

# Plotting
fig, ax = plt.subplots(dpi=150)

hist = sns.histplot(s_95pct_conf_intvl, ax=ax)

y_max = np.ceil(max([h.get_height() for h in hist.patches])/1e2)*1e2
x_max = s_95pct_conf_intvl.quantile(x_max_pct)
ax.plot([x_max, x_max], [0, y_max], linestyle='--', color='k', label='95% Coverage')

ax.set_ylim(0, y_max)
eda.hide_spines(ax)

png

We'll then only plot our confidence interval when it's in this 'reasonable' range

conf_intvl_idxs_to_keep = (s_95pct_conf_intvl<x_max).replace(False, np.nan).dropna().index

# Plotting
fig, ax = plt.subplots(dpi=150)

ax.scatter(df_LIGO['frequency'], df_LIGO['H1'], color='k', linewidth=0, s=1, zorder=1)
ax.fill_between(conf_intvl_idxs_to_keep, np.exp(df_conf_intvl.loc[conf_intvl_idxs_to_keep, 'min']), np.exp(df_conf_intvl.loc[conf_intvl_idxs_to_keep, 'max']), color='r', alpha=1, label='95% Confidence')

ax.set_yscale('log')
ax.legend(frameon=False)
eda.hide_spines(ax)

png