3. Working with Time-series Data¶
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from neighbors import NNMF_sgd, estimate_performance, load_toymat, create_sparse_mask
def plot_mat(mat, vmin=1, vmax=100, cmap="Blues", ax=None, title=None):
"Quick helper function to nicely plot a user x item matrix"
ax = sns.heatmap(mat, cmap=cmap, vmin=vmin, vmax=vmax, square=False, cbar=False, ax=ax)
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
ax.yaxis.set_major_formatter(ticker.ScalarFormatter())
ax.set(xlabel="Time-point")
if title:
ax.set(title=title)
def plot_timeseries(model, dilate_by_nsamples):
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
arr1 = model.masked_data.iloc[0, :].copy()
model.dilate_mask(dilate_by_nsamples)
arr2 = model.masked_data.iloc[0, :].copy()
ax1.plot(range(len(arr1)), arr1)
ax1.set(ylabel="Rating", xlabel="Time-point", title="Single User Observed Ratings")
ax2.plot(range(len(arr2)), arr2)
ax2.set(ylabel="Rating", xlabel="Time-point", title="Single User Dilated Ratings")
sns.despine()
In the last two tutorials we saw how to work with dense and sparse data. In this last tutorial we'll demonstrate a unique feature of the toolbox when working with sparse time-series data. Such data encompass situations when users provide continuous ratings over time rather than over unique items, for example when watching a single movie or listening to a single audio track. However, as before these ratings may be sparse such that not every user has a rating for every time-point.
Like before we'll begin with the load_toymat
function to generate a sample dataset where 50 users provided ratings for 200 time-points on a scale from 1-100. Additionally, we'll sparsify this dataset by masking out 50% of the ratings using the create_sparse_mask
function.
Let's plot the mask below:
toy_data = load_toymat(users=50, items=200, random_state=0)
mask = create_sparse_mask(toy_data, n_mask_items=.5)
masked_data = toy_data[mask]
plot_mat(mask, vmin=0, vmax=1, cmap='Greys')
Dilating a time-series a model¶
One way to make higher quality predictions given such data is to leverage the fact that time-series often have intrinsic autocorrelation. In other words, successive time-points are more likely to have similar ratings rather than more distant time-points. The amount of similarity will be dictated by the autocorrelation function of the data, which may be difficult or impossible to estimate with a sparse time-series. Nonetheless, model predictions can often benefit from some initial interpolation or temporal smoothing whereby missing time-points are filled in with values computed from neighboring observed time-points.
Models can use the .dilate_mask
method or the dilate_by_nsamples
argument in their .fit
method to perform this operation prior to estimation. Let's see what that looks like using a single user's ratings where we dilate their observed time-series by 20 samples. Dilation occurs by convolving the observed time-series with a box-car function of the requested width (in number of samples). Overlapping dilated samples are simply averaged:
model = NNMF_sgd(masked_data, random_state=0)
# This is just a convenience function that plots a single user's time-series
# It makes use of model.dilate_mask under-the-hood
# See the function definition at the top of this tutorial
plot_timeseries(model, dilate_by_nsamples=20)
data contains NaNs...treating as pre-masked
Performing this for all users we can now see the difference between the original sparse time-series data and the dilated data. Notice that this has the effect of making the data dense by smoothing, which in many cases can dramatically improve model predictions.
_, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))
# Calling dilate_mask with no arguments "undilates" the data
model.dilate_mask()
plot_mat(model.masked_data, ax=ax1, title="Observed Ratings")
# Now dilate by 20 time-points
model.dilate_mask(20)
plot_mat(model.masked_data, ax=ax2, title="Dilated Ratings")
For convenience it's possible to request dilation directly from a model's .fit
method without the need to call .dilate_mask
first. To verify whether a model has been dilated, you can check it's .is_dilated
attribute:
model.fit(dilate_by_nsamples=20)
model.is_mask_dilated
True
Because we're dealing with sparse data we don't have ground-truth observations to evaluate model performance. But as demonstrated in the previous tutorial, we can use the estimate_performance
function to approximate model performance using cross-validation. In particular this function takes a fit_kwargs
argument which is a dictionary of arguments passed to a model's .fit
method. We can make use of this to request dilation during estimation
group_res_d, user_res_d = estimate_performance(
NNMF_sgd, masked_data, fit_kwargs={"dilate_by_nsamples": 20}
)
Data sparsity is 50.0%. Using cross-validation...
A Few Notes on Dilation¶
- It's critical to note that when using dilation alongside the
estimate_performance
function for dense data, we're never "leaking" observed ratings to the held out observations. The process of dilation simply makes use of the observed ratings to fill-in missing observations. The values that get filled in, make no use of the true ratings at these time-points. - Likewise observation splitting for cross-validation of sparse data, occurs before dilation so that observed ratings treated as test time-points have no influence on the dilation operation
- The optimal amount of dilation to use for a particular dataset will vary based on a number of factors including the type and quality of the data, intrinsic auto-correlation, time-series length, and time-series sparsity
- Dilation can often be most helpful in extremely sparse cases (e.g. < 50% of ratings are observed) when undilated model estimates are poor. However, over-dilating can actually make model estimates worse in some cases.
- It's therefore important to consider what the plausible rate-of-change of a user's responses are are given the sampling rate. For example, dilating emotion ratings collected at 1hz (once every second) by 60 samples (1 minute) maybe too slow and smooth over important temporal dynamics.