2. Working with Sparse Data¶
import seaborn as sns
import matplotlib.ticker as ticker
from neighbors import (
NNMF_sgd,
estimate_performance,
load_toymat,
create_sparse_mask
)
def plot_mat(mat, vmin=1, vmax=100, cmap='Blues'):
"Quick helper function to nicely plot a user x item matrix"
ax = sns.heatmap(mat, cmap=cmap, vmin=vmin, vmax=vmax)
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
ax.yaxis.set_major_formatter(ticker.ScalarFormatter())
In the previous tutorial we saw the basics of how to fit a model, inspect its predictions, and estimate its performance using dense data. In this tutorial we'll see the small differences to keep in mind when working with sparse data.
Like before we'll begin with the load_toymat function to generate a sample dataset where 50 users rated 50 items on a scale from 1-100. Additionally, we're sparsify this dataset by masking out 25% of the ratings using the create_sparse_mask function.
Let's plot the mask below. Dark colors indicate where we observed a rating for a user, while light colors indicate no observed rating:
toy_data = load_toymat(users=50, items=50, random_state=0)
mask = create_sparse_mask(toy_data, n_mask_items=.25)
masked_data = toy_data[mask]
plot_mat(mask, vmin=0, vmax=1, cmap='Greys')
Fitting a model¶
Just like the previous tutorial we initialize a model with the sparse data, but this time we omit the mask and n_mask_items arguments. Models are smart enough to realize that the data containing missing values and all future operations will take this into account
model = NNMF_sgd(masked_data, random_state=0)
# Take a look the first 10 user x items. In this case the model's data and masked data are the same
model.data.iloc[:10,:10]
data contains NaNs...treating as pre-masked
| Item | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| User | ||||||||||
| 0 | 27.440675 | 35.759468 | 30.138169 | 27.244159 | 21.182740 | NaN | 21.879361 | 44.588650 | 48.183138 | NaN |
| 1 | 28.509839 | 21.930076 | NaN | NaN | 10.443838 | 13.065476 | NaN | NaN | NaN | 12.221280 |
| 2 | 33.890827 | 13.500399 | 36.759701 | 48.109427 | 12.437657 | 33.807867 | 29.602097 | NaN | NaN | 47.637451 |
| 3 | 7.472415 | 43.406303 | 8.124647 | 30.777978 | NaN | 47.400411 | 40.365948 | 28.455037 | 20.359165 | 3.458350 |
| 4 | 15.589794 | 34.817174 | 18.887592 | 8.980184 | 1.233936 | 8.362482 | 33.969639 | 22.684842 | 26.828961 | 44.833565 |
| 5 | 17.780637 | 47.021597 | 38.266263 | 37.433181 | 45.185987 | NaN | NaN | 29.223803 | 48.096819 | 14.607376 |
| 6 | 45.327775 | 38.702367 | 16.657258 | 4.055069 | 20.362059 | 16.611707 | NaN | NaN | 36.279718 | NaN |
| 7 | 32.278512 | 1.768122 | 21.520122 | 25.500843 | NaN | NaN | 13.879805 | 6.443028 | 19.633784 | NaN |
| 8 | 20.062975 | 46.464571 | 4.980747 | 47.265077 | NaN | NaN | 16.335044 | NaN | NaN | 1.653730 |
| 9 | 14.827813 | 49.600562 | 12.471002 | 5.295308 | 47.547631 | 16.671013 | 34.488413 | 2.917818 | NaN | NaN |
Like before we can use .fit to train on the observed ratings and generate predictions for the missing values. We can use .plot_predictions to inspect the filled-in user-item ratings matrix. However, because we have no ground truth ratings, it's not possible to score the model's predictions or generate a scatter plot like before.
model.fit()
model.plot_predictions();
/Users/Esh/Documents/pypackages/neighbors/neighbors/base.py:243: UserWarning: Cannot score predictions on missing data because true values were never observed! warnings.warn(
predictions = model.transform()
# Take a look the first 10 user x item predictions
predictions.iloc[:10,:10]
| Item | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| User | ||||||||||
| 0 | 27.440675 | 35.759468 | 30.138169 | 27.244159 | 21.182740 | 37.294706 | 21.879361 | 44.588650 | 48.183138 | 19.172076 |
| 1 | 28.509839 | 19.193658 | -0.609682 | 5.102241 | 10.443838 | 13.065476 | 27.556131 | 12.664580 | 23.315539 | 12.221280 |
| 2 | 33.890827 | 13.500399 | 36.759701 | 48.109427 | 12.437657 | 33.807867 | 29.602097 | 28.612595 | 11.154082 | 47.637451 |
| 3 | 7.472415 | 43.406303 | 8.124647 | 30.777978 | 6.190999 | 47.400411 | 37.153168 | 28.455037 | 20.359165 | 3.458350 |
| 4 | 15.589794 | 34.817174 | 18.887592 | 8.980184 | 1.233936 | 8.362482 | 33.969639 | 22.684842 | 26.828961 | 44.833565 |
| 5 | 17.780637 | 47.021597 | 38.266263 | 37.433181 | 45.185987 | 9.171122 | 27.609623 | 17.176145 | 48.096819 | 14.607376 |
| 6 | 45.327775 | 38.702367 | 16.657258 | 4.055069 | 20.362059 | 16.611707 | 6.624382 | 2.671359 | 36.279718 | 0.571373 |
| 7 | 32.278512 | 1.768122 | 21.520122 | 25.500843 | 39.659211 | 39.069626 | 15.485929 | 84.756570 | 19.633784 | 47.820286 |
| 8 | 20.062975 | 46.464571 | 4.980747 | 47.265077 | 26.221923 | 27.708120 | 17.318159 | 6.951076 | 30.723235 | 1.653730 |
| 9 | 14.827813 | 49.600562 | 12.471002 | 5.295308 | 47.547631 | 16.671013 | 34.488413 | 2.917818 | 36.535455 | 44.086011 |
A model's .summary method will now omit scores for "missing" data because no ground truth was ever observed:
model.summary(verbose=True)
User performance results (not returned) are accessible using .user_results Overall performance results (returned) are accesible using .overall_results
| algorithm | dataset | group | metric | score | |
|---|---|---|---|---|---|
| 0 | NNMF_sgd | observed | all | correlation | 0.998231 |
| 1 | NNMF_sgd | observed | all | mae | 0.918352 |
| 2 | NNMF_sgd | observed | all | mse | 1.800620 |
| 3 | NNMF_sgd | observed | all | rmse | 1.341872 |
| 4 | NNMF_sgd | observed | user | correlation | 0.997944 |
| 5 | NNMF_sgd | observed | user | mae | 0.918352 |
| 6 | NNMF_sgd | observed | user | mse | 1.800620 |
| 7 | NNMF_sgd | observed | user | rmse | 1.289602 |
Benchmarking via Cross-Validation¶
While we can't assess a model's performance in the absence of ground truth ratings, we can approximate this performance via cross-validation. Our old friend estimate_performance is smart enough to understand we're working with sparse data and can perform this kind of approximation. Behind the scenes it works similarily to other libraries like Surprise. A few key caveats are important to note:
- Approximating performance with sparse data works by holding out some of the observed ratings for testing purposes while using the rest of the observed ratings for training purposes. Notice how we're not trying to use the missing ratings to benchmark because that's not possible!
- This has the effect of increasing the sparsity of the already sparse data during training. This can be controlled using the
n_foldsparameter. More folds means less additional sparsity. In the example below we use 10 folds which means that 90% of the observed ratings will be using for training the model while 10% of the observed ratings will be used for testing. We can calculate the additional sparsity incurred as follows:- 50x50 = 2500 total ratings * 25% sparsity = 1875 observed ratings
- 90% * 1875 = 1687 ratings for training
- 10% * 1875 = 187 ratings for testing
- 1 - (1687 / 2500) = 32.5% effective training sparsity
overall_results, user_results = estimate_performance(
NNMF_sgd,
masked_data,
n_folds=10,
)
overall_results
Data sparsity is 24.0%. Using cross-validation...
| algorithm | dataset | group | metric | mean | std | |
|---|---|---|---|---|---|---|
| 0 | NNMF_sgd | test | all | correlation | 0.364798 | 0.080220 |
| 1 | NNMF_sgd | test | all | mae | 17.308642 | 0.882994 |
| 2 | NNMF_sgd | test | all | mse | 468.587220 | 52.112955 |
| 3 | NNMF_sgd | test | all | rmse | 21.616657 | 1.205249 |
| 4 | NNMF_sgd | test | user | correlation | 0.272998 | 0.091077 |
| 5 | NNMF_sgd | test | user | mae | 17.336230 | 1.176862 |
| 6 | NNMF_sgd | test | user | mse | 468.855204 | 64.910656 |
| 7 | NNMF_sgd | test | user | rmse | 19.954421 | 1.357769 |
Similar to the previous tutorial, we can also see if predictive performance varied by user to identify some users that were particularly difficult to generate predictions for.
user_results.head()
| rmse_test | mse_test | mae_test | correlation_test | |
|---|---|---|---|---|
| user | ||||
| 0 | 17.652991 | 376.946158 | 15.818702 | 0.425894 |
| 1 | 21.795164 | 564.910024 | 17.555891 | 0.038531 |
| 2 | 20.349158 | 457.097971 | 16.913393 | 0.238913 |
| 3 | 16.505700 | 343.458521 | 13.852083 | 0.525783 |
| 4 | 21.345487 | 490.732213 | 18.725056 | -0.352414 |
Summary¶
Fitting a model to sparse data is very similar to working with dense data with the exception that missing ratings have no ground truth for model performance calculation. However, the estimate_performance function can be used to approximate performance via cross-validation. This approach is the defacto standard in several other collaborative filtering toolboxes such as Surprise.
In the last tutorial we'll see one more feature that's particular useful for working with user ratings that were collected over time.