1. Working with Dense Data¶
import seaborn as sns
import matplotlib.ticker as ticker
from neighbors import (
NNMF_sgd,
estimate_performance,
load_toymat,
)
def plot_mat(mat):
"Quick helper function to nicely plot a user x item matrix"
ax = sns.heatmap(mat, cmap="Blues", vmin=1, vmax=100)
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
ax.yaxis.set_major_formatter(ticker.ScalarFormatter())
All toolbox algorithms operate on 2d pandas dataframes with rows as unique users and columns as unique items. Models distinguish between two kinds of datasets:
- Dense data, in which all users rated all items. Such datasets are useful for estimating the performance of an algorithm by testing how well some % of ratings can be masked out and then recovered via prediction. This is useful for benchmarking model performance and simulating a situations with datasets of varying sparsity. Conceptually this is equivalent to supervised learning where we make predictions with knowledge of the "correct answers" that can be used to compute model performance.
- Sparse data, in which some user-item ratings were never observed. This is the primary intended use case of the toolbox. A model can be trained on the observed ratings using various collaborative filterating algorithms to generate predictions about these missing ("unobserved") ratings.
In this tutorial we'll demonstrate basic toolbox features on dense data. The load_toymat
function can be used to generate some sample data for our purposes. Let's generate a dataset in which each of 50 users rated 50 items on a scale from 1-100.
Note: the numbers chosen are just for illustrative purposes and the number of users and items doesn't have to be equal
toy_data = load_toymat(users=50, items=50, random_state=0)
plot_mat(toy_data)
Fitting a model¶
Fitting a model works similarily to libraries like sklearn
. You just need to initialize a model object and call its .fit
method. When working with dense data, i.e. every user rated every item, we need initialize the model with a mask or value between 0-1 that indicates what proportion of the observed data should be treated as "missing." This allows us to simulate a situation in which we hadn't observed these ratings at all.
Using n_mask_items
we can masking out 25% of the ratings and retain 75%. Notice how some user-item combinations are now set to NaN
.
model = NNMF_sgd(toy_data, n_mask_items=.25, random_state=0)
# Take a look the first 10 user x item predictions
model.masked_data.iloc[:10,:10]
Item | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
User | ||||||||||
0 | 27.440675 | 35.759468 | 30.138169 | NaN | 21.182740 | 37.294706 | NaN | 44.588650 | 48.183138 | 19.172076 |
1 | 28.509839 | 21.930076 | 49.418692 | 5.102241 | 10.443838 | 13.065476 | 32.655416 | NaN | 23.315539 | 12.221280 |
2 | 33.890827 | 13.500399 | 36.759701 | NaN | 12.437657 | 33.807867 | NaN | 28.612595 | 11.154082 | 47.637451 |
3 | 7.472415 | 43.406303 | NaN | 30.777978 | 6.190999 | NaN | 40.365948 | 28.455037 | 20.359165 | 3.458350 |
4 | 15.589794 | 34.817174 | NaN | 8.980184 | NaN | 8.362482 | 33.969639 | 22.684842 | 26.828961 | NaN |
5 | 17.780637 | NaN | 38.266263 | 37.433181 | NaN | 9.171122 | 27.609623 | NaN | 48.096819 | NaN |
6 | 45.327775 | 38.702367 | 16.657258 | 4.055069 | NaN | 16.611707 | NaN | 2.671359 | 36.279718 | 0.571373 |
7 | 32.278512 | NaN | 21.520122 | NaN | 26.808875 | NaN | 13.879805 | 6.443028 | NaN | NaN |
8 | NaN | 46.464571 | NaN | 47.265077 | NaN | 27.708120 | NaN | NaN | NaN | NaN |
9 | 14.827813 | 49.600562 | 12.471002 | NaN | 47.547631 | 16.671013 | 34.488413 | 2.917818 | 36.535455 | 44.086011 |
Now we can try to predict these missing ratings by fitting the model and plotting its predictions.
The left matrix is the input data after masking. The middle is model predictions for the missing ratings + the ratings we did observe. The right is a scatter plot model predictions for missing ratings vs the true values of the these ratings.
For convenience the plot title contains the RMSE and correlation of the missing ratings (averaged across users to account for user-level clustering). RMSE is interpretable as the average misprediction on the same scale as the original ratings. In this case 1-100.
model.fit()
model.plot_predictions();
To retrieve the matrix containing the model predictions we can use the .transform
method. By default this will return a matrix containing ratings for values that were observed and predictions for values that were missing (i.e. masked out). To return predictions for the observed values as well, i.e. not passing forward these values, set return_only_predictions=True
.
Now the masked out ratings have been replaced with model predictions:
predictions = model.transform()
# Take a look the first 10 user x items after masking
predictions.iloc[:10,:10]
Item | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
User | ||||||||||
0 | 27.440675 | 35.759468 | 30.138169 | 36.660519 | 21.182740 | 37.294706 | 56.100570 | 44.588650 | 48.183138 | 19.172076 |
1 | 28.509839 | 21.930076 | 49.418692 | 5.102241 | 10.443838 | 13.065476 | 32.655416 | 21.191779 | 23.315539 | 12.221280 |
2 | 33.890827 | 13.500399 | 36.759701 | 23.337345 | 12.437657 | 33.807867 | 30.578051 | 28.612595 | 11.154082 | 47.637451 |
3 | 7.472415 | 43.406303 | 15.179831 | 30.777978 | 6.190999 | 38.134221 | 40.365948 | 28.455037 | 20.359165 | 3.458350 |
4 | 15.589794 | 34.817174 | 5.257692 | 8.980184 | 55.577116 | 8.362482 | 33.969639 | 22.684842 | 26.828961 | 25.627142 |
5 | 17.780637 | 41.777774 | 38.266263 | 37.433181 | 61.015818 | 9.171122 | 27.609623 | 40.282928 | 48.096819 | 36.694523 |
6 | 45.327775 | 38.702367 | 16.657258 | 4.055069 | 26.051788 | 16.611707 | 23.999081 | 2.671359 | 36.279718 | 0.571373 |
7 | 32.278512 | 13.093412 | 21.520122 | 31.101120 | 26.808875 | 23.499822 | 13.879805 | 6.443028 | 17.163445 | 36.159055 |
8 | 18.240560 | 46.464571 | 14.868483 | 47.265077 | 25.477137 | 27.708120 | 29.443827 | 28.951392 | 41.470059 | 32.561340 |
9 | 14.827813 | 49.600562 | 12.471002 | 12.572948 | 47.547631 | 16.671013 | 34.488413 | 2.917818 | 36.535455 | 44.086011 |
For NNMF
models it's easy to inspect and debug model training using the .plot_learning
function. It's also possible to get more detail while fitting, by passing verbose=True
to .fit
.
The plot title below also displays the final RMSE on the observed ratings during training and indicates whether the model fit converged within the number of iterations.
model.plot_learning();
Scoring a model's predictions¶
Working with dense data affords us a ground truth that can be used to asses the model's performance. We support a number of different metrics to do this (RMSE and correlation in the plots above are just two). To return a model's performance you can use the .score
method. However, the .summary
method maybe more convenient as it returns all supported metrics along with separate scoring for both the observed ratings (model training performance) and missing values (model testing performance)
Additionally, metrics are scored in two different ways. user
metrics below score performance separately by each user first and then average these scores. This approach is more common to the social science where observations are treated as "clustered" by user. all
simply scores all ratings ignoring the fact that multiple ratings come from each user. This method is more common in machine-learning where overall model performance is of primary interest.
model.summary(verbose=True)
User performance results (not returned) are accessible using .user_results Overall performance results (returned) are accesible using .overall_results
algorithm | dataset | group | metric | score | |
---|---|---|---|---|---|
0 | NNMF_sgd | missing | all | correlation | 0.360521 |
1 | NNMF_sgd | missing | all | mae | 16.760530 |
2 | NNMF_sgd | missing | all | mse | 441.263698 |
3 | NNMF_sgd | missing | all | rmse | 21.006278 |
4 | NNMF_sgd | missing | user | correlation | 0.297074 |
5 | NNMF_sgd | missing | user | mae | 16.760530 |
6 | NNMF_sgd | missing | user | mse | 441.263698 |
7 | NNMF_sgd | missing | user | rmse | 20.445180 |
8 | NNMF_sgd | observed | all | correlation | 0.995488 |
9 | NNMF_sgd | observed | all | mae | 1.519559 |
10 | NNMF_sgd | observed | all | mse | 4.655738 |
11 | NNMF_sgd | observed | all | rmse | 2.157716 |
12 | NNMF_sgd | observed | user | correlation | 0.994861 |
13 | NNMF_sgd | observed | user | mae | 1.519559 |
14 | NNMF_sgd | observed | user | mse | 4.655738 |
15 | NNMF_sgd | observed | user | rmse | 2.081369 |
Benchmarking a model's performance¶
The performance above is specific the the exact ratings we masked out. But how does the model perform in general when 25% of the data is missing?
While we could repeat the procedure above for different random masks of the same size, doing so by hand is a bit tedious. Fortunately, the estimate_performance
function is designed exactly for this purpose. Just pass it a model class (not a model object), some data, and the amount of masking and it will repeatedly refit the model with new random masks and return the average performance across all iterations. This is functionally equivalent to randomized cross-validation, where the size of the training and testing splits are controlled via the n_mask_items
argument. In the example below, masking 25% of the data is equivalent to 4-fold cross-validation where training is done using 3 folds and testing is performed on the left out fold.
overall_results, user_results = estimate_performance(
NNMF_sgd, toy_data, n_iter=10, n_mask_items=.25
)
overall_results
Data sparsity is 0.0%. Using random masking...
algorithm | dataset | group | metric | mean | std | |
---|---|---|---|---|---|---|
0 | NNMF_sgd | missing | all | correlation | 0.361321 | 0.039224 |
1 | NNMF_sgd | missing | all | mae | 17.722848 | 0.532381 |
2 | NNMF_sgd | missing | all | mse | 493.397942 | 34.549618 |
3 | NNMF_sgd | missing | all | rmse | 22.200547 | 0.770019 |
4 | NNMF_sgd | missing | user | correlation | 0.322631 | 0.038444 |
5 | NNMF_sgd | missing | user | mae | 17.722848 | 0.532381 |
6 | NNMF_sgd | missing | user | mse | 493.397942 | 34.549618 |
7 | NNMF_sgd | missing | user | rmse | 21.695961 | 0.672878 |
We can also see if predictive performance varied by user to identify some users that were particularly difficult to generate predictions for.
user_results.head()
rmse_missing | mse_missing | mae_missing | correlation_missing | |
---|---|---|---|---|
user | ||||
0 | 19.988851 | 408.599276 | 16.442676 | 0.306068 |
1 | 25.840342 | 748.642060 | 20.213965 | 0.192995 |
2 | 18.997861 | 375.325530 | 15.596407 | 0.482213 |
3 | 21.942705 | 499.427873 | 17.603425 | 0.271063 |
4 | 22.217112 | 506.409075 | 18.474751 | 0.358182 |
By default estimate_performance
only returns performance missing
data. To see performance on all subsets use return_full_performance = True
. You can also use return_agg=False
if you want to see performance for each iteration separately rather than the mean and std across all iterations.
Summary¶
Those are the basics of working with models and dense data. The next tutorial illustrates the primary use case for the toolbox: working with sparse data.